Page MenuHomePhabricator

Compile requirements for a general system to make the most of community-provided translations
Open, MediumPublic

Description

Machine translation is useful but is far from perfect. The translations obtained may be wrong and there is no direct way to fix them to avoid the same mistake the next time. Community-validated translations can help compensate some of the issues of machine translation and improve the final quality (T351748).

This ticket is intended to explore the different requirements that can inform the creation of a community-provided translation resource as a complement to machie translation.

Over time there have been multiple requests for translation tools in this space. These include translation memory, dictionaries, glossaries, options for users to correct a specific translation and get it right the next time (T96165).

Scenarios to support. All these approaches can operate at different levels:

  • Language. A user obtains a translation with issues (grammar issues, different meaning, etc.) and decides to provide a better translation. The next time, the system is providing the better translation.
  • Community. Communities can set their preferences to use certain terms to achieve consistency in the trasnlations.
  • Topic. When translating the article about the emperor Basil I, translated by MT as the herb with the same name in each instance in the article.
  • User. Translators may have their own preferences and style as their "Manual translation correction book" (T339907).

Initial considerations (to be reviewed and expanded):

  • Multiple sources. Community created translations can come from Tatoeba, Translatewiki, Wikipedia article translations, etc.
  • Predictable priority. In the case of multiple sources having a translation, we should be able to know where it is coming from.
  • A way for users to fix translations. It should be straightforward where users can go to add a new translation, or fix/invalidate an existing one.
  • Periodic updates. Keeping this data fresh will help users to feel their fixes are applied quickly. Merging updated data from the different sources should work.
  • Usable by different products. This system should be generic enough to be helpful to different tools that can benefit from this resource.
  • Integrated in MinT (with option to bypass when needed). MinT will be one of the primary integrations. For users of MinT, the result should be just getting better translations. No additional set-up of decision making when using MinT is expected.
  • No data sharing with external systems. MinT comes with the expectation of the data (source and translation) to be processed only inside Wikimedia infrastructure.

Sources to incorporate:

  • Tatoeba.
  • Wikidata lexeme
  • Content Translation data
  • Localization infrastructure data (Translte extension on Wikimedia sites, and translatewiki.net)