Machine translation models such as those used by MinT may not produce the best translations for individual words or short sentences due to the lack of context. For example, the English expression "Hello!" is translated by MinT using NLLB-200 as "- ¡Hola, qué haces?" in Spanish (where "¡Hola!" would be expected instead).
This ticket proposes to complement the use of model-based translation with community-provided translations from Tatoeba (or a similar community) when there is an exact match. That is, given a source sentence to translate (e.g., "Hello!), before requesting the translation to a model such as NLLB-200, the sentences from Tatoeba will be searched for an exact match in the source language. If such match exist for the language pair, the translation from Tatoeba will be used. If not, the machine translation model will be used instead.
This approach is expected to provide two key benefits to MinT:
- Better translations. Tatoeba is more likely to include short sentences and expressions, which is where translation models can be more problematic. So both approaches seem that could complement well each other.
- A more direct way for users to improve translations. When users encounter a translation that is wrong, they could easily contribute a better translation in Tatoeba. Since the exact match approach doe snot require complex machine learning training, the updated translations from Tatoeba can be incorporated at a much quicker pace. As a result, a user providing an improved translation into Tatoeba is more likely to see the translation fixed in a shorter period of time.
This pre-search approach can be provided as an option on the MinT API, making it possible to test models directly when needed.
Sub-tasks. This work has been structured in separate sub-tasks:
- T351872: Create a library/service that serves matches from Tatoeba
- T351875: Integrate Tatoeba translation memory into MinT
As a reference, the number of sentences supported by Tatoeba in each language provides an idea on the current number of community-provided translations that can be improved in each language. Having this system integrated in MinT (and available for the Wikimedia products using MinT) can also encourage the contribution of more content to Tatoeba in more languages.