Page MenuHomePhabricator

Provide better long-term storage for translation models
Open, MediumPublic

Description

As part of the exploration for self-hosted translation service (T331505) there was the question about where to store the models. Currently machine learning models for translation are stored at: https://people.wikimedia.org/~santhosh/nllb/ and https://people.wikimedia.org/~santhosh/opusmt

However, since the ML cluster uses Swift, that may be more resilient long term.

Event Timeline

Change 913109 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Enable thanos-swift service mesh

https://gerrit.wikimedia.org/r/913109

Change 913109 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Enable thanos-swift service mesh

https://gerrit.wikimedia.org/r/913109

akosiaris added a subscriber: MatthewVernon.

@MatthewVernon We probably want to store those models in Swift.

Would you like us to file a different task or is this one enough ?

To provide a bit more context. What we are talking about is:

  • # of files: 9 total up to now
  • Sizes: ranging from a few kilobytes to at most 2.3GB.
  • Updating frequency: Never has happened up to now, already live for a couple of months. I 'll let @Pginer-WMF and @santhosh add details on that one.
  • The workload is present in both eqiad+codfw so we need the files to eventually appear and be the same version on both DCs

I 've chatted a bit with @elukey, the setup would be pretty similar to LiftWing's thanks swift container setup, albeit it's probably best to not re-use that container for Separation of concerns reasons.

Thanks!

  • Updating frequency: Never has happened up to now, already live for a couple of months. I 'll let @Pginer-WMF and @santhosh add details on that one.

Models are created by external organizations, so the release of new versions is at their discretion. I'd say that updates with new versions are expected in the scale of years rather than months. However, machine learning has been a fast moving field recently, so who knows what may happen next.

Another piece of context. While NLLB-200 was supporting many languages with a single model, other projects such as Opus create models for specific language pairs. So it is expected that more models (of potentially smaller size) will be added in the near future as we expand language coverage. For example, from Opus the models supporting the 14 languages listed in T333969 are in our near-term plans (since they are not supported by other services).

This service is also designed in a way that any person or organization interested can just run it in their systems. To make that happen, there should be at least one publicly accessible location for these models in addition to our internal storage system.
This model download location should be configurable.

This is also the case for developers in our team or demo instance we run in wmflabs.

The architecture of MinT is moving towards what we offer for Lift Wing, that is a standardized way to provide ML model servers in Wikimedia (we have the same goal, to allow the community to test models and ask their productionization in case). We are going to resolve the same problems over and over while proceeding in parallel, but I can understand that migrating to Lift Wing would be a big effort for Content Translation, but keep it in mind for the future.

Change 931086 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] machinetranslation: Update people egress

https://gerrit.wikimedia.org/r/931086

Change 931086 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Update people egress

https://gerrit.wikimedia.org/r/931086

Mentioned in SAL (#wikimedia-operations) [2023-06-19T09:15:03Z] <kart_> Updated MinT to 2023-06-16-042302-production, Updated people egress (T339271, T335491)

@akosiaris sure; do you have opinions on what a good usename would look like for this use case?

@akosiaris sure; do you have opinions on what a good usename would look like for this use case?

Perfect, thanks! The service is called machinetranslation in our kubernetes related stanzas, if we can stick with this it would be awesome.

Change 931296 had a related patch set uploaded (by MVernon; author: MVernon):

[labs/private@master] thanos: add machinetranslation user

https://gerrit.wikimedia.org/r/931296

Change 931297 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] profile::thanos::swift: add machinetranslation user

https://gerrit.wikimedia.org/r/931297

Change 931296 merged by MVernon:

[labs/private@master] thanos: add machinetranslation user

https://gerrit.wikimedia.org/r/931296

Change 931297 merged by MVernon:

[operations/puppet@production] profile::thanos::swift: add machinetranslation user

https://gerrit.wikimedia.org/r/931297

Account in thanos should be ready now.

@akosiaris What should be the next step for this task?

Change 956444 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] machinetranslation: Add egress mesh template

https://gerrit.wikimedia.org/r/956444

Change 956444 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Add egress mesh template

https://gerrit.wikimedia.org/r/956444

@MatthewVernon What should we do next on this task? Is anything required from the Language team?

The thanos account has been created and is ready to go, the client software needs to be told about it (I assume via a puppet update); I don't think there's anything blocking on Swift here.

From the ticket history it looks like either @elukey or @akosiaris might be the people to do this?

@KartikMistry I think that the MinT python code should be able to pull the model binary from Swift when bootstrapping, so that we don't rely anymore on people.wikimedia.org (that is not as highly available as Swift etc..). Lift Wing does it as well, please consider reaching out to us the next time so we don't duplicate the same architecture in multiple places :)

@elukey, What do you mean by 'reaching out to you by next time' ? Regarding the architecture of MinT and why it is not using LiftWing we had discussion in the past. I don't think it is not useful to repeat. There is a reason why we put the models in people.wikimedia.org - it was as per recommendation from SRE and this ticket was created to make it more reliable. We still need a public location for models download as MinT is not designed for WMF instrastructure alone.

My current understanding is s3 model volumes(Example s3://wmf-ml-models/llm/langid/20231011160342/) need to be mounted in docker container. For MinT's test instance we do the same already by mounting like this /mnt/nfs/secondary-scratch/language/models:/app/models- we don't download it everytime from people.wikimedia.org. I think we need to the similar kubernetes configuration for MinT.

Hi @santhosh, sorry for the lag but I missed the notification!

@elukey, What do you mean by 'reaching out to you by next time' ? Regarding the architecture of MinT and why it is not using LiftWing we had discussion in the past. I don't think it is not useful to repeat.

The suggestion that I made (already happened with another model, so the workflow is good now) at the time was to work between our teams to avoid replicating the same structure in multiple places. I think it is useful to repeat since originally we had some problems while running NLP models on GPUs (because of the Nvidia blocker), but the ctranslate version of the service (CPU only) was a good candidate for a pilot on Lift Wing and we never really worked on it together. This is not to assign any blame, but I warned you team at the time that you'd have problems to solve already implemented in Lift Wing, like model binary retrieval etc.. It seems a waste to duplicate efforts, this is why I raised the point.

For example, we now have NLLB running on Lift Wing with ctranslate, and we are currently doing some perf evaluations. Would it be viable for your team to work with us to figure out of MinT could delegate the NLLB prediction part to a model server on Lift Wing? It would solve this problem nicely, and improve reusability :)
More info T351740

There is a reason why we put the models in people.wikimedia.org - it was as per recommendation from SRE and this ticket was created to make it more reliable. We still need a public location for models download as MinT is not designed for WMF instrastructure alone.

If you mean downloading the model binary from the outside Internet, we already have a solution in place. All the models running in Lift Wing are mirrored to https://analytics.wikimedia.org/published/wmf-ml-models/ (plus sha512 checksums etc..).

My current understanding is s3 model volumes(Example s3://wmf-ml-models/llm/langid/20231011160342/) need to be mounted in docker container. For MinT's test instance we do the same already by mounting like this /mnt/nfs/secondary-scratch/language/models:/app/models- we don't download it everytime from people.wikimedia.org. I think we need to the similar kubernetes configuration for MinT.

Lift Wing has a special container that pulls the model binary from Swift when the pod comes up, backed up by a replicated endpoint like Thanos Swift. Using NFS in production is convenient for the moment but NFS is not something that SRE suggests to use, as it may lead to slowness and maintenance burden over time. Plus the NFS infrastructure from which you pull the data from (not sure which one it is yet) is not as replicated as Swift, and surely way slower.

Again I want to re-iterate that I am not trying to assign blame or to imply that Lift Wing is better, or to build a competition between teams. If I gave any impression like this, accept my apology since it is not what I meant. Lift Wing is new and I am trying to work with other teams to show what it is capable of, and sometimes it is sad when I see the same work repeated multiple times. Having said this, MinT is great so if you don't like the Lift Wing solution, it is fine, please proceed with NFS. But if you are open to discuss alternatives, my team is really open to help :)