Page MenuHomePhabricator

Compile multilingual resources for translation models in collaboration with communities
Open, MediumPublic

Description

Machine learning translation models such as the ones used in MinT provide better translations when they have more training data. The Opus project is compiling multilingual freely-licensed data to train this kind of models. This ticket proposes to collaborate with different Wikipedia communities to identify good quality multilingual resources with a free license for their languages that can be useful to train translation models.

All languages can potentially improve their translation quality, but we'll give priority to those languages for which there is no machine translation available or quality seems low, especially those having signs of usage under these circumstances (i.e., expected to have a bigger impact if quality improves). The list in T343340 can be a good starting place.

Test sets

In addition to identifying existing resources, communities can help to define aligned test sets that can be used for checking the translation quality. Test data can use the same format as Flores and related test sets. We can consider contributing to the OLDI initiative.