Page MenuHomePhabricator

Machine translation learned by comparing content across wikipedia languages
Closed, ResolvedPublic

Description

Build a Machine translation engine using manual/semi manual translations of wikipedia translations

What it does:

  • Learns from Wikipedia translations across languages and build translation models from parallel corpora for Machine translation.

Wiki thing it helps with:

  • Content translation project https://www.mediawiki.org/wiki/Content_translation
    • Stephen: it would be neat if AI could automatically identifical words (within context) previously translated. For example, the translation of the word "Wikimedia" once made is probably identical regardless of context. The translation of the word "build" is not AI could identify identical context usages. This is all done manually in TranslateWiki by volunteers right now
  • TWN has translation memory based on previous translations, but it can be enhanced may be by using Parallel corpora from translations
  • Niklas: perhaps this AI could enhance existing MT by local translated expression selection

Things that might helps us get this AI built:

Event Timeline

I don't think it makes sense to develop something learning from Wikipedia translation when tools already do and what is missing rather is integration/adoption of these tools. Why can one not use DeepL with the content translation tool? Moreover, manual translation using machine translation is also problematic because after translation the source article can get edited a lot which means those changes are not synced to the target article which can stay flawed, outdated, or incomplete.

I think what's proposed here would be better integrated in a machine translation system where it automatically resolves the translated wikilink article name instead of using the machine translated text. But first please add capability to use DeepL with the tools rather than only Google Translate.

Also when it comes to how things have been translated previously this would introduce flaws where the translation differs when context is different. I think machine translation is one of the top important tasks to work on but what's proposed here at this stage doesn't seem good to develop / feasible / useful with other parts missing. You may be interested in this study btw.

In short such tools already exist and there's no need for Wikimedia to spend resources on training AI systems when advanced such translation systems already do so and we could just use them and then adapt these translation systems such as for correctly translating wikilinks and translating also the ref template syntax.

Well, DeepL seems to have errors in wiki markup translation, when tried on a wikitext. Although it shows a better performance (in English-to-Ukrainian translation) than Google Translate, at a first glance. Is it capable of prompt engineering or in-context learning?

I gave up using Wikimedia's Translation Tool for English-to-Ukrainian direction because Google Translate's performance is very poor on this direction, and because the tool itself is pretty unstable and glitchy. I use ChatGPT+ at a wikitext level with some prompt engineering, and in-context learning from manual corrections of its per-section translations, and, sometimes, from suggested samples from other translations (that might include previous translation of the same article, when I'm updating an outdated translation).

I assume that performing a per-section translation using an LLM (maybe, not necessarily the GPT) with RAG utilizing manual corrections of previous sections' and other existing translations (determined not only by the Translation Tool itself, but also by templates like Шаблон:Перекладена стаття) could provide better results than the current solution. Maybe, there will be a need for some manual manual engineering of the RAG to give more weight to Wikipedia categories for a better context match.

@OlexaRiznyk I don't think so but good questions. These are really interesting insights.

Maybe you and other subscribers to this issue could contribute to this wishlist proposal I just submitted:

https://meta.wikimedia.org/wiki/Community_Wishlist/Wishes/Wikipedia_Machine_Translation_Project

It is about building a machine translation system to translate WP at scale, not largely manually at the small scale. For example, ways to correctly translate wikitext template into the equivalent template in another language WP. I figured it's not enough to hint at the emerging possibility here and there and hope that somebody other than me gets this off the ground as I think people should think about and work on this sooner rather than later as time is running out until some other organization sets something like this up when it should be the Wikimedia community. If you're interested in it, have ideas for next steps, or any ideas on how to improve the proposal, please comment on its talk page.

Pginer-WMF claimed this task.

I'm inclined to close this ticket. Recently, MinT was created. This translation service uses models including those from OpusMT which already integrates the published translations from Content Translation. In addition, the LPL team ha plans to expand MinT capabilities to incorporate community-provided translations (T351748). So we already have translation models incorporating the manual translations created by Wikipedia translators, and there are plans to continue work in this front as part of collaborations with the external entities that develop those models.

For exposing machine translation to readers, we launched the experimental feature MinT for Wiki Readers (T359072). The MVP will be useful to understand how helpful this can be and learn from the experience for future improvements.