As part of the project No Language Left Behind from Meta, the NLLB-200 neural machine translation models (previously named Flores) have been released with an open source license. The model is currently available in Content Translation for a set of 23 languages (T307970) including several historically underserved languages that are not supported by other translation services such as Swati and Tswana. Based on a recent report that analyzes the different translation services available, data suggests the translation quality provided by the NLLB-200 model is good:
Overall across all languages, NLLB-200 currently has the lowest percent of articles created with content translation that are deleted (0.13%) compared to all other MT services available, while it has the highest percent of translations modified under 10%, indicating that the modifications rates for this machine translations service are a signal of good machine translation quality.
This is consistent with requests from communities such as Igbo and Icelandic to use the service as default over the current alternatives. The unique position of NLLB-200 as open source, good quality, and high number of languages supported at the same time, makes it a key resource to better support access to knowledge by anyone regardless the language they speak. This ticket proposes to create an instance to run this model for the currently supported and more languages.
Current status, opportunities and challenges
Currently, the model is accessed as an external service using an API that the research team at Meta provided to test the models (more details). Creating our own instance to run the model will allow to:
- Reduce dependencies on the current service provided by Meta, making sure it is available for the long term.
- Reduce dependencies on the external services available such as Google and Yandex which cover hundreds of languages not supported by the existing opensource systems available such as Apertium.
- Support more languages. The model provides support for 200, but only 23 languages are exposed through the current API. Supporting more languages with machine translaiton is often requested by the Wikimedia communities (T86700) requests for languages that are supported by the NLLB-200 model but not available in the current API have been requested (e.g., Santali).
- Expand the use of machine translation to other Wikimedia products (e.g., multilingual talk pages).
Hosting and running the model may present some challenges. Based on previous analysis, in order to obtain the needed level of performance, GPU acceleration (Nvidia-based in particular) is needed to run the models. Given that the drivers to access those GPUs are not yet open source, we may need to explore ways to avoid technical issues due to potential vulnerabilities the lack of fully open hardware not to prevent supporting anyone to access knowledge in any language.
As with any other machine translation service, the current NLLB-200 API is integrated in a way that only publicly available Wikipedia content is exchanged without any user personal information. Only the returned translated content (which is sanitized) is consumed back. The future system can work as isolated as needed from the rest of the infrastructure.