Provide better long-term storage for translation models
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Pginer-WMF
	Apr 27 2023, 10:01 AM

Description

As part of the exploration for self-hosted translation service (T331505) there was the question about where to store the models. Currently machine learning models for translation are stored at: https://people.wikimedia.org/~santhosh/nllb/ and https://people.wikimedia.org/~santhosh/opusmt

However, since the ML cluster uses Swift, that may be more resilient long term.

Details

Subject	Repo	Branch	Lines +/-
machinetranslation: Add egress mesh template	operations/deployment-charts	master	+2 -1
profile::thanos::swift: add machinetranslation user	operations/puppet	production	+5 -0
thanos: add machinetranslation user	labs/private	master	+1 -0
machinetranslation: Update people egress	operations/deployment-charts	master	+4 -4
machinetranslation: Enable thanos-swift service mesh	operations/deployment-charts	master	+4 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• santhosh	T331505 Self hosted machine translation service
		Open		None	T335491 Provide better long-term storage for translation models

Event Timeline

Pginer-WMF created this task.Apr 27 2023, 10:01 AM

Pginer-WMF mentioned this in T331505: Self hosted machine translation service.

Pginer-WMF triaged this task as Medium priority.Apr 27 2023, 10:38 AM

Change 913109 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Enable thanos-swift service mesh

https://gerrit.wikimedia.org/r/913109

gerritbot added a project: Patch-For-Review.Apr 28 2023, 8:19 AM

Change 913109 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Enable thanos-swift service mesh

https://gerrit.wikimedia.org/r/913109

Maintenance_bot removed a project: Patch-For-Review.Apr 28 2023, 12:30 PM

Pginer-WMF added a project: MinT.May 2 2023, 9:08 AM

Amire80 moved this task from Backlog to Infrastructure on the MinT board.May 9 2023, 9:31 AM

• santhosh removed • santhosh as the assignee of this task.Jun 13 2023, 9:36 AM

@MatthewVernon We probably want to store those models in Swift.

Would you like us to file a different task or is this one enough ?

To provide a bit more context. What we are talking about is:

# of files: 9 total up to now
Sizes: ranging from a few kilobytes to at most 2.3GB.
Updating frequency: Never has happened up to now, already live for a couple of months. I 'll let @Pginer-WMF and @santhosh add details on that one.
The workload is present in both eqiad+codfw so we need the files to eventually appear and be the same version on both DCs

I 've chatted a bit with @elukey, the setup would be pretty similar to LiftWing's thanks swift container setup, albeit it's probably best to not re-use that container for Separation of concerns reasons.

Thanks!

In T335491#8928309, @akosiaris wrote:

Updating frequency: Never has happened up to now, already live for a couple of months. I 'll let @Pginer-WMF and @santhosh add details on that one.

Models are created by external organizations, so the release of new versions is at their discretion. I'd say that updates with new versions are expected in the scale of years rather than months. However, machine learning has been a fast moving field recently, so who knows what may happen next.

Another piece of context. While NLLB-200 was supporting many languages with a single model, other projects such as Opus create models for specific language pairs. So it is expected that more models (of potentially smaller size) will be added in the near future as we expand language coverage. For example, from Opus the models supporting the 14 languages listed in T333969 are in our near-term plans (since they are not supported by other services).

This service is also designed in a way that any person or organization interested can just run it in their systems. To make that happen, there should be at least one publicly accessible location for these models in addition to our internal storage system.
This model download location should be configurable.

This is also the case for developers in our team or demo instance we run in wmflabs.

The architecture of MinT is moving towards what we offer for Lift Wing, that is a standardized way to provide ML model servers in Wikimedia (we have the same goal, to allow the community to test models and ask their productionization in case). We are going to resolve the same problems over and over while proceeding in parallel, but I can understand that migrating to Lift Wing would be a big effort for Content Translation, but keep it in mind for the future.

Eevans subscribed.Jun 15 2023, 7:01 PM

Change 931086 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] machinetranslation: Update people egress

https://gerrit.wikimedia.org/r/931086

gerritbot added a project: Patch-For-Review.Jun 19 2023, 8:45 AM

Change 931086 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Update people egress

https://gerrit.wikimedia.org/r/931086

Maintenance_bot removed a project: Patch-For-Review.Jun 19 2023, 9:11 AM

Mentioned in SAL (#wikimedia-operations) [2023-06-19T09:15:03Z] <kart_> Updated MinT to 2023-06-16-042302-production, Updated people egress (T339271, T335491)

Stashbot mentioned this in T339271: MinT translates to Hindi when English-Santali is selected.Jun 19 2023, 9:15 AM

@akosiaris sure; do you have opinions on what a good usename would look like for this use case?

In T335491#8947044, @MatthewVernon wrote:

@akosiaris sure; do you have opinions on what a good usename would look like for this use case?

Perfect, thanks! The service is called machinetranslation in our kubernetes related stanzas, if we can stick with this it would be awesome.

Change 931296 had a related patch set uploaded (by MVernon; author: MVernon):

[labs/private@master] thanos: add machinetranslation user

https://gerrit.wikimedia.org/r/931296

Change 931297 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] profile::thanos::swift: add machinetranslation user

https://gerrit.wikimedia.org/r/931297

KartikMistry moved this task from Quarter Backlog to In Progress on the Language-Team (Language-2023-April-June) board.Jun 20 2023, 7:24 AM

Change 931296 merged by MVernon:

[labs/private@master] thanos: add machinetranslation user

https://gerrit.wikimedia.org/r/931296

MatthewVernon mentioned this in rLPRI6b7ac2e421ec: thanos: add machinetranslation user.Jun 20 2023, 2:29 PM

Change 931297 merged by MVernon:

[operations/puppet@production] profile::thanos::swift: add machinetranslation user

https://gerrit.wikimedia.org/r/931297

BartTerpstra awarded a token.Jun 20 2023, 2:30 PM

Account in thanos should be ready now.

Maintenance_bot removed a project: Patch-For-Review.Jun 20 2023, 3:11 PM

Pginer-WMF edited projects, added Language-Team (Language-2023-July-September); removed Language-Team (Language-2023-April-June).Jun 30 2023, 11:46 AM

Pginer-WMF moved this task from Quarter Backlog to In Progress on the Language-Team (Language-2023-July-September) board.

@akosiaris What should be the next step for this task?

Nikerabbit moved this task from In Progress to Blocked on the Language-Team (Language-2023-July-September) board.Sep 6 2023, 7:34 AM

Change 956444 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Add egress mesh template

https://gerrit.wikimedia.org/r/956444

gerritbot added a project: Patch-For-Review.Sep 11 2023, 2:18 PM

Change 956444 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Add egress mesh template

https://gerrit.wikimedia.org/r/956444

Maintenance_bot removed a project: Patch-For-Review.Sep 12 2023, 7:30 AM

Pginer-WMF edited projects, added Language-Team (Language-2023-October-December); removed Language-Team (Language-2023-July-September).Oct 2 2023, 8:59 AM

Pginer-WMF moved this task from Quarter Backlog to Blocked on the Language-Team (Language-2023-October-December) board.

@MatthewVernon What should we do next on this task? Is anything required from the Language team?

The thanos account has been created and is ready to go, the client software needs to be told about it (I assume via a puppet update); I don't think there's anything blocking on Swift here.

From the ticket history it looks like either @elukey or @akosiaris might be the people to do this?

@elukey, @akosiaris What can be the next step for this?

@KartikMistry I think that the MinT python code should be able to pull the model binary from Swift when bootstrapping, so that we don't rely anymore on people.wikimedia.org (that is not as highly available as Swift etc..). Lift Wing does it as well, please consider reaching out to us the next time so we don't duplicate the same architecture in multiple places :)

@elukey, What do you mean by 'reaching out to you by next time' ? Regarding the architecture of MinT and why it is not using LiftWing we had discussion in the past. I don't think it is not useful to repeat. There is a reason why we put the models in people.wikimedia.org - it was as per recommendation from SRE and this ticket was created to make it more reliable. We still need a public location for models download as MinT is not designed for WMF instrastructure alone.

My current understanding is s3 model volumes(Example s3://wmf-ml-models/llm/langid/20231011160342/) need to be mounted in docker container. For MinT's test instance we do the same already by mounting like this /mnt/nfs/secondary-scratch/language/models:/app/models- we don't download it everytime from people.wikimedia.org. I think we need to the similar kubernetes configuration for MinT.

Hi @santhosh, sorry for the lag but I missed the notification!

In T335491#9369595, @santhosh wrote:

@elukey, What do you mean by 'reaching out to you by next time' ? Regarding the architecture of MinT and why it is not using LiftWing we had discussion in the past. I don't think it is not useful to repeat.

The suggestion that I made (already happened with another model, so the workflow is good now) at the time was to work between our teams to avoid replicating the same structure in multiple places. I think it is useful to repeat since originally we had some problems while running NLP models on GPUs (because of the Nvidia blocker), but the ctranslate version of the service (CPU only) was a good candidate for a pilot on Lift Wing and we never really worked on it together. This is not to assign any blame, but I warned you team at the time that you'd have problems to solve already implemented in Lift Wing, like model binary retrieval etc.. It seems a waste to duplicate efforts, this is why I raised the point.

For example, we now have NLLB running on Lift Wing with ctranslate, and we are currently doing some perf evaluations. Would it be viable for your team to work with us to figure out of MinT could delegate the NLLB prediction part to a model server on Lift Wing? It would solve this problem nicely, and improve reusability :)
More info T351740

There is a reason why we put the models in people.wikimedia.org - it was as per recommendation from SRE and this ticket was created to make it more reliable. We still need a public location for models download as MinT is not designed for WMF instrastructure alone.

If you mean downloading the model binary from the outside Internet, we already have a solution in place. All the models running in Lift Wing are mirrored to https://analytics.wikimedia.org/published/wmf-ml-models/ (plus sha512 checksums etc..).

My current understanding is s3 model volumes(Example s3://wmf-ml-models/llm/langid/20231011160342/) need to be mounted in docker container. For MinT's test instance we do the same already by mounting like this /mnt/nfs/secondary-scratch/language/models:/app/models- we don't download it everytime from people.wikimedia.org. I think we need to the similar kubernetes configuration for MinT.

Lift Wing has a special container that pulls the model binary from Swift when the pod comes up, backed up by a replicated endpoint like Thanos Swift. Using NFS in production is convenient for the moment but NFS is not something that SRE suggests to use, as it may lead to slowness and maintenance burden over time. Plus the NFS infrastructure from which you pull the data from (not sure which one it is yet) is not as replicated as Swift, and surely way slower.

Again I want to re-iterate that I am not trying to assign blame or to imply that Lift Wing is better, or to build a competition between teams. If I gave any impression like this, accept my apology since it is not what I meant. Lift Wing is new and I am trying to work with other teams to show what it is capable of, and sometimes it is sad when I see the same work repeated multiple times. Having said this, MinT is great so if you don't like the Lift Wing solution, it is fine, please proceed with NFS. But if you are open to discuss alternatives, my team is really open to help :)

Pginer-WMF edited projects, added Language-Team (Language-2024-January-March); removed Language-Team (Language-2023-October-December).Jan 8 2024, 10:35 AM

Pginer-WMF moved this task from Quarter Backlog to Blocked on the Language-Team (Language-2024-January-March) board.

Pginer-WMF edited projects, added Language-Team (Language-2024-April-June); removed Language-Team (Language-2024-January-March).Tue, Apr 2, 3:22 PM

Pginer-WMF moved this task from Quarter Backlog to Blocked on the Language-Team (Language-2024-April-June) board.

Provide better long-term storage for translation modelsOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Provide better long-term storage for translation models
Open, MediumPublic
Actions

Related Objects
Search...