Background
The NLLB-200 machine translation system that a research team from Meta (Facebook) provided was running in an AWS hosting managed by Meta as a temporary solution. Recently we migrated that to AWS account by WMF (T321781). This allowed to keep supporting the initial set of communities, several of which with no previous machine translaiton options. However, budget constraints of this approach prevent to use the machine translation system to its full potential. Hosting this system directly on Wikimedia infrastructure was not an option because of dependency on NVIDIA GPU and hence nonfree CUDA drivers.
A recent exploration by @santhosh discovered an alternative mechanism to get the same or better performance by just CPUs. This is achieved by a one time conversion of model to a special model with the help of Ctranslate2, which optimize the model for inference in low processor and memory setting. A version of this is running at https://translate.wmcloud.org/, it provides good performance for translation, but it is a cloud VM.
WMF Language team would like to host this system in a production system.
Diagram
Requirements
The model is exposed as a translation web API using python Flask service, wrapped in gunicorn(source code). It creates multiple workers, one worker per CPU core. The current NLLB model is ~3GB single file, that supports all 200 languages. This model need to be loaded to the memory and from experiments, 32GB is a baseline RAM required, but more RAM means less computation latency.
To get an idea about the current request count and response latency, here is our metrics from AWS system
As you can see it is sporadic spikes and not a uniform load. About 20 languages(small languages) are served by the system.
But moving this to in-house gives an opportunity to increase language coverage and moving some of the language pairs currently served by paid api of Google to this system. So we need to make room for that opportunistic exploration too.
Timeline
Since the current AWS hosting is not budget friendly, we would like to get this migration as soon as possible.
Ownership
Similar to the hosting of in-house Apertium machine translation, once the system is up and running, we expect the language team's devOps staff to monitor and act as point of contact this system.
(Created after an initial conversation with @akosiaris)

