Page MenuHomePhabricator

Use community-provided translations when there is an exact match
Open, MediumPublic

Description

Machine translation models such as those used by MinT may not produce the best translations for individual words or short sentences due to the lack of context. For example, the English expression "Hello!" is translated by MinT using NLLB-200 as "- ¡Hola, qué haces?" in Spanish (where "¡Hola!" would be expected instead).

translate.wmcloud.org_(Wiki Tablet).png (768×1 px, 45 KB)

This ticket proposes to complement the use of model-based translation with community-provided translations from Tatoeba (or a similar community) when there is an exact match. That is, given a source sentence to translate (e.g., "Hello!), before requesting the translation to a model such as NLLB-200, the sentences from Tatoeba will be searched for an exact match in the source language. If such match exist for the language pair, the translation from Tatoeba will be used. If not, the machine translation model will be used instead.

This approach is expected to provide two key benefits to MinT:

  • Better translations. Tatoeba is more likely to include short sentences and expressions, which is where translation models can be more problematic. So both approaches seem that could complement well each other.
  • A more direct way for users to improve translations. When users encounter a translation that is wrong, they could easily contribute a better translation in Tatoeba. Since the exact match approach doe snot require complex machine learning training, the updated translations from Tatoeba can be incorporated at a much quicker pace. As a result, a user providing an improved translation into Tatoeba is more likely to see the translation fixed in a shorter period of time.

This pre-search approach can be provided as an option on the MinT API, making it possible to test models directly when needed.

Sub-tasks. This work has been structured in separate sub-tasks:


As a reference, the number of sentences supported by Tatoeba in each language provides an idea on the current number of community-provided translations that can be improved in each language. Having this system integrated in MinT (and available for the Wikimedia products using MinT) can also encourage the contribution of more content to Tatoeba in more languages.

Event Timeline

Pginer-WMF triaged this task as Medium priority.Nov 21 2023, 4:53 PM

Notes:

  • Does not work with words in sentences since it require “fitting” that selection of word in to the original sentence and keep grammar accurate.(in localization we call this “avoiding lego messages”.) -This limitation also applies to translation memory for phrases(incomplete sentences). Chances of encountering the (exactly) same sentence again less in general prose.
  • Examples
    • This is Basil(person) - ഇവൻ(male)/ഇവൾ(female) ബേസിലാണ് (the translation of basil get fused with affirmative ആണ് - aka agglutination. Same below)
    • This is basil(herb) - ഇത് തുളസിയാണ് (grammatical gender does not apply for things)
  • Grammatical gender, subject verb agreement, agglutination and inflection are quite common outside the realm of analytical languages
  • Word context issue is almost a solved problem in latest translation models, only present in MinT due to its limitation about contexts and anaphora resolution.(Broad context translation with self attention - in literature examples like “interest”, “bank” are often used)

Notes:

  • Does not work with words in sentences since it require “fitting” that selection of word in to the original sentence and keep grammar accurate.(in localization we call this “avoiding lego messages”.) -This limitation also applies to translation memory for phrases(incomplete sentences). Chances of encountering the (exactly) same sentence again less in general prose.

This proposal is intended to support corrections for exact sentences. If Machine Translation (MT) provides a wrong translation for a specific sentence, a user would have an option to fix that specific translation. That allows users to be able to directly correct the output of MT services. Right now, the options we provide for improving machine translations are not that easy or direct.

This is along the lines of the "anyone can edit" principle, and is similar to editing a wiki page: if a user fixes a typo in one specific sentence, it gets fixed only for that instance (even if similar sentences exist). This is not intended to re-create an MT system, but adding a layer of human corrections on top instead. Using exact sentences may be conservative, but seems also the most reliable, predictable, and direct approach to let users fix automatic translations.

This proposal is intended to support corrections for exact sentences.
...
if a user fixes a typo in one specific sentence, it gets fixed only for that instance (even if similar sentences exist).

In Wikipedia, what is the chance of a specific sentence appearing again in another place in exactly same way? According to one study it is 0.02% for wikitext-103 corpus. There is chance of repeatation in boiler plate text or template or short phrases(example: proverbs). Since a number of combinations you can generate out huge vocabulary(in the scales of 100k) with 10 or so words is huge and chance of seeing that sentence again is near 0.

This proposal is intended to support corrections for exact sentences.
...
if a user fixes a typo in one specific sentence, it gets fixed only for that instance (even if similar sentences exist).

In Wikipedia, what is the chance of a specific sentence appearing again in another place in exactly same way? According to one study it is 0.02% for wikitext-103 corpus. There is chance of repetition in boiler plate text or template or short phrases(example: proverbs). Since a number of combinations you can generate out huge vocabulary(in the scales of 100k) with 10 or so words is huge and chance of seeing that sentence again is near 0.

That makes sense. I was not disputing that. Especially if we think of the particular case of a sentence in the middle of a paragraph of a Wikipedia article. The situation may be different in other contexts such as section titles, talk page messages, documentation, or UI messages. Also, a single sentence can be accessed many times by readers (what motivated caching on MinT in T363308). But I don't think that is, in any case, very relevant for the proposal.

The proposal is intended to allow users to fix one translation at a time, and seeing it fixed immediately. That seems a good benefit, even if the fix cannot be applied to other instances. With this approach we can tell users "if it is wrong, you can fix it immediately". If there are other alternate approaches where we can allow users to directly fix a machine translation in a simple and immediate way, we can consider them, and it would be useful to compare pros and cons with the proposal.