Page MenuHomePhabricator

Next steps for machine translation
Closed, ResolvedPublic


Type of activity: Pre-scheduled session
Main topic: Artificial Intelligence to build and navigate content

The problem

We would like to make all our content available in all the languages of the world to read all the readers.
Machine Translation capabilities are greatly improving. Further, we potentially possess the rosetta stone for building excellent machine translation tools: a large and continually growing archive of parallel texts.

A virtuous cycle is possible, where every article or part of an article which is translated by a user trains our machine translation software to make better automatic translations and suggestions, so that the rote work of updating translation A when translation B has been edited is increasingly automated.

This is a hard problem, in general, and it's not entirely clear whether WMF should build the machine translation expertise (or software) in house. But there are a number of initial steps we could take in the near-ish term to pave the way.

Expected outcome

Consensus on a vision for machine translation as an integral part of our projects in the future.
Consensus on practical initial steps to take.

Current status of the discussion

  • This was discussed as a potential Main Topic for WikiDev17
  • Amir and I discussed this during the October Editing offsite, but there was no formal session.
  • Our proposed first steps (perhaps a stepping-off point for this session):
    1. Export CX translation pairs in appropriate format for Moses training data
    2. Add part of speech info to wikidata interlanguagelink relations
    3. Export interlanguagelinks in appropriate format for Apertium dictionary.


Event Timeline

Qgil added subscribers: Halfak, Qgil.

Can you add projects/tags related to the topics discussed, please?

Terminological note: We have a comparable corpora (texts on the same topics in different languages), while a parallel corpora would require them to be equivalent translations.

@Nikerabbit that's true currently, although CX-translated texts are probably closer to parallel corpora. But on the other hand, we have no good reason to *require* parallel corpora right now, and in fact encourage the opposite: minority-language wikis should feel ownership of their content, be free to change it, etc.

*But* the bigger picture of this proposal is that we can start to make the translation aspect part of our core content, and part of the service we provide. You are not *requires* to construct parallel corpora, but if you *do* then we can provide some aids to your wiki: (a) your work can improve machine translation of your language, and (b) if edits are made to the source material we can semi-automatically suggest edits to keep your translation in parallel.

For "big" wikis like (say) enwiki/dewiki, they probably won't see the value of this, and will continue to create comparable corpora, not parallel corpora.

But I've worked with smaller wikis who are overwhelmed by the amount of work tracking changes in our bigger projects and keeping their little wikis up to date. They would probably find the trade-off to be a good one, and they would be contributing to the open source training corpora for their language at the same time, possibly benefiting other projects as well.

It would be up to the individual wikis and individual authors. So long as we can track "translations which are intended to be parallel" and provide authors a means to opt-in to this, we can start providing better services on an translation-by-translation basis.

Discussed briefly at

my "big picture" vision here is that we start using our machine translation tools to tie our projects more tightly together, so we feel more like "one project aided by a bunch of babel fish" and less like "a thousand separate projects, each in their own tower".

(Mangling the tower of babel metaphor a bit, I hope you'll forgive me.)

@cscott Hey! As developer summit is less than four weeks from now, we are working on a plan to incorporate the ‘unconference sessions’ that have been proposed so far and would be generated on the spot. Thus, could you confirm if you plan to facilitate this session at the summit? Also, if your answer is 'YES,' I would like to encourage you to update/ arrange the task description fields to appear in the following format:

Session title
Main topic
Type of activity
Description Move ‘The Problem,' ‘Expected Outcome,' ‘Current status of the discussion’ and ‘Links’ to this section
Proposed by Your name linked to your MediaWiki URL, or profile elsewhere on the internet
Preferred group size
Any supplies that you would need to run the session e.g. post-its
Interested attendees (sign up below)

  1. Add your name here

We will be reaching out to the summit participants next week asking them to express their interest in unconference sessions by signing up.

To maintain the consistency, please consider referring to the template of the following task description:

Pginer-WMF triaged this task as Medium priority.Apr 11 2018, 3:10 PM
Pginer-WMF moved this task from Backlog to Other teams/Watching on the Language-Team board.
Pginer-WMF subscribed.

This seems to be a session for an event in 2017. Currently, we are exploring the integration of an Open source Neural Machine Translation service. More details in this ticket: T234194: Explore the integration of OpusMT

@cscott, since the ticket seems related to that specific event, which is over now, I think it is better to close it, but feel free to reopen if needed.