Page MenuHomePhabricator

New Service Deployment Request: NNLB-200 for machine translation
Closed, ResolvedPublic

Description

Background

The NLLB-200 machine translation system that a research team from Meta (Facebook) provided was running in an AWS hosting managed by Meta as a temporary solution. Recently we migrated that to AWS account by WMF (T321781). This allowed to keep supporting the initial set of communities, several of which with no previous machine translaiton options. However, budget constraints of this approach prevent to use the machine translation system to its full potential. Hosting this system directly on Wikimedia infrastructure was not an option because of dependency on NVIDIA GPU and hence nonfree CUDA drivers.

A recent exploration by @santhosh discovered an alternative mechanism to get the same or better performance by just CPUs. This is achieved by a one time conversion of model to a special model with the help of Ctranslate2, which optimize the model for inference in low processor and memory setting. A version of this is running at https://translate.wmcloud.org/, it provides good performance for translation, but it is a cloud VM.

WMF Language team would like to host this system in a production system.

Diagram

image.png (832×872 px, 69 KB)

Requirements

The model is exposed as a translation web API using python Flask service, wrapped in gunicorn(source code). It creates multiple workers, one worker per CPU core. The current NLLB model is ~3GB single file, that supports all 200 languages. This model need to be loaded to the memory and from experiments, 32GB is a baseline RAM required, but more RAM means less computation latency.

To get an idea about the current request count and response latency, here is our metrics from AWS system

image.png (620×913 px, 48 KB)

As you can see it is sporadic spikes and not a uniform load. About 20 languages(small languages) are served by the system.

But moving this to in-house gives an opportunity to increase language coverage and moving some of the language pairs currently served by paid api of Google to this system. So we need to make room for that opportunistic exploration too.

Timeline

Since the current AWS hosting is not budget friendly, we would like to get this migration as soon as possible.

Ownership

Similar to the hosting of in-house Apertium machine translation, once the system is up and running, we expect the language team's devOps staff to monitor and act as point of contact this system.

(Created after an initial conversation with @akosiaris)

Event Timeline

Thanks a lot for the write-up @santhosh!

Since the service is written in Python, I am wondering if we could host it on Lift Wing (the new ML infra that should replace ORES). Lift Wing is basically a kubernetes cluster that runs Kserve, a new framework for model-serving that should (in theory) ease the process of deploying models.

The Research team is already collaborating with us, for example the Revert Risk model is currently deployed via a dedicated[[ https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/revert-risk-model/model-server/model.py | model-server]]. The overall idea is:

  • We host the model on Swift, and each k8s pod fetch it when bootstrapping.
  • The model server python class requires the user to implement two methods - preprocess and predict (feature calculation and model prediction basically).
  • If needed, the model server code could be exposed via api.wikimedia.org (still a WIP feature).

Given the heavy requirements of the model we'd need dedicated k8s workers for sure, but it may be a compromise to avoid another dedicated fleet of nodes. I am going to add the Machine-Learning tag to this task so my team is aware, I'll follow up with @akosiaris to understand his ideas/suggestions/etc.. (maybe Kserve is not really needed etc..).

Adding some things for transparency. We had a meeting that panned out the next few steps (some need to be done regardless of where we 'll host the service):

The current requests patterns indicate a spiky nature of up to ~200 rps, followed by very large periods idleness. We expect to be able to smooth over the traffic.

While there are some questions regarding how exactly we will size instances (pods), the largest possible pod (taking over an entire node) we can host appears to be able to serve ~24 rps.

The service won't be externally reachable, it is only going to be used internally by cxserver.

A rough plan for the next steps (the order isn't strict, things can be done in parallel)

  1. Adding prometheus metrics support to the app
  2. Adding structured logging per ECS to the app
  3. Adding /healthz endpoint for readiness probes (the app is expected to take some time to load and expand in memory the 3GB model
  4. Adding capability to the app to fetch the 3GB model from some HTTP place (what exactly is still a question, people.wikimedia.org or per @elukey's suggestion swift are good candidates - the latter vastly preferred down the line)
  5. Moving the repo to gerrit where the Deployment pipeline is enabled
  6. Enabling the deployment pipeline and get the first container built
  7. Get a new helm chart using the create_service.sh script of the deployment-charts repo
  8. Using that first container validate the helm chart working
  9. Get a namespace on the proper cluster to deploy the service into
  10. Deploy (first deploy with help, subsequent deploy will be done by the language team)
  11. QA tests
  12. Announce

I had a chat with @elukey on Friday regarding this. To summarize, nothing of the above changes and for now, we target wikikube cluster (but leaving open the space to revisit whether ml-serve is a better target down the line, a migration should be relatively easy anyway). It's better to host the model on swift (we already do so for other models anyway).

akosiaris renamed this task from Hosting machine request for machine translation to New Service Deployment Request: NNLB-200 for machine translation.Mar 9 2023, 11:25 AM

I 've transformed (roughly) this to a Service-deployment-requests task and changed the title to reflect that.

Pginer-WMF claimed this task.

Since MinT was launched the service has been running in support of Content and Section Translation.