Machine translation models such as [those used by MinT](https://www.mediawiki.org/wiki/MinT#About_MinT) may not produce the best translations for individual words or short sentences due to the lack of context. For example, the English expression "Hello!" is translated by MinT using NLLB-200 as "- ¡Hola, qué haces?" in Spanish (where "¡Hola!" would be expected instead).
{F41523111, position=float}
This ticket proposes to complement the use of model-based translation with community-provided translations from [Tatoeba](https://tatoeba.org/) (or a similar community) when there is an exact match. That is, given a source sentence to translate (e.g., "Hello!), before requesting the translation to a model such as NLLB-200, the sentences from Tatoeba will be searched for [an exact match in the source language](https://tatoeba.org/en/sentences/show/373330). If such match exist for the language pair, [the translation from Tatoeba](https://tatoeba.org/en/sentences/show/1502706) will be used. If not, the machine translation model will be used instead.
This approach is expected to provide two key benefits to MinT:
- **Better translations.** Tatoeba is more likely to include short sentences and expressions, which is where translation models can be more problematic. So both approaches seem that could complement well each other.
- **A more direct way for users to improve translations.** When users encounter a translation that is wrong, they could easily contribute a better translation in Tatoeba. Since the exact match approach doe snot require complex machine learning training, the updated translations from Tatoeba can be incorporated at a much quicker pace. As a result, a user providing an improved translation into Tatoeba is more likely to see the translation fixed in a shorter period of time.
This pre-search approach can be provided as an option on the MinT API, making it possible to test models directly when needed.
___
As a reference, the [number of sentences supported by Tatoeba in each language](https://tatoeba.org/en/stats/sentences_by_language) provides an idea on the current number of community-provided translations that can be improved in each language.