Next steps for machine translation
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cscott
	Nov 1 2016, 5:54 AM

Description

Type of activity: Pre-scheduled session
Main topic: Artificial Intelligence to build and navigate content

The problem

We would like to make all our content available in all the languages of the world to read all the readers.
Machine Translation capabilities are greatly improving. Further, we potentially possess the rosetta stone for building excellent machine translation tools: a large and continually growing archive of parallel texts.

A virtuous cycle is possible, where every article or part of an article which is translated by a user trains our machine translation software to make better automatic translations and suggestions, so that the rote work of updating translation A when translation B has been edited is increasingly automated.

This is a hard problem, in general, and it's not entirely clear whether WMF should build the machine translation expertise (or software) in house. But there are a number of initial steps we could take in the near-ish term to pave the way.

Expected outcome

Consensus on a vision for machine translation as an integral part of our projects in the future.
Consensus on practical initial steps to take.

Current status of the discussion

This was discussed as a potential Main Topic for WikiDev17
Amir and I discussed this during the October Editing offsite, but there was no formal session.
Our proposed first steps (perhaps a stepping-off point for this session):
1. Export CX translation pairs in appropriate format for Moses training data
2. Add part of speech info to wikidata interlanguagelink relations
3. Export interlanguagelinks in appropriate format for Apertium dictionary.

Related Objects
Search...

Status	Assigned	Task
Resolved	Qgil	T153007 Technical Collaboration annual plan FY2017-18
Resolved	Qgil	T159313 Draft WMF annual plan program about technical events
Resolved	Qgil	T149300 Future of the Wikimedia Developer Summit
Resolved	• Rfarrand	T153996 Wikimedia Developer Summit 2017: Feedback Survey
Resolved	• Rfarrand	T141926 Wikimedia Developer Summit 2017
Resolved	Qgil	T141938 Prepare a program for Wikimedia Developer Summit 2017 to effectively address current high level movement needs
Resolved	Halfak	T147708 Facilitate Wikidev'17 main topic "Artificial Intelligence to build and navigate content"
Resolved	cscott	T149666 Next steps for machine translation

Event Timeline

cscott created this task.Nov 1 2016, 5:54 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 1 2016, 5:54 AM

cscott mentioned this in T149575: Mediawiki extensions to reach the next billion.Nov 3 2016, 9:24 PM

Can you add projects/tags related to the topics discussed, please?

cscott added projects: Language-Team, ContentTranslation, Wikidata.Nov 10 2016, 8:23 PM

cscott moved this task from Missing basic information to Missing proven interest on the Wikimedia-Developer-Summit (2017) board.Nov 10 2016, 8:39 PM

Terminological note: We have a comparable corpora (texts on the same topics in different languages), while a parallel corpora would require them to be equivalent translations.

@Nikerabbit that's true currently, although CX-translated texts are probably closer to parallel corpora. But on the other hand, we have no good reason to *require* parallel corpora right now, and in fact encourage the opposite: minority-language wikis should feel ownership of their content, be free to change it, etc.

*But* the bigger picture of this proposal is that we can start to make the translation aspect part of our core content, and part of the service we provide. You are not *requires* to construct parallel corpora, but if you *do* then we can provide some aids to your wiki: (a) your work can improve machine translation of your language, and (b) if edits are made to the source material we can semi-automatically suggest edits to keep your translation in parallel.

For "big" wikis like (say) enwiki/dewiki, they probably won't see the value of this, and will continue to create comparable corpora, not parallel corpora.

But I've worked with smaller wikis who are overwhelmed by the amount of work tracking changes in our bigger projects and keeping their little wikis up to date. They would probably find the trade-off to be a good one, and they would be contributing to the open source training corpora for their language at the same time, possibly benefiting other projects as well.

It would be up to the individual wikis and individual authors. So long as we can track "translations which are intended to be parallel" and provide authors a means to opt-in to this, we can start providing better services on an translation-by-translation basis.

cscott updated the task description. (Show Details)Nov 11 2016, 10:50 PM

Psychoslave awarded a token.Nov 15 2016, 3:56 PM

Psychoslave subscribed.

Siznax subscribed.Nov 15 2016, 6:56 PM

Addshore awarded a token.Nov 16 2016, 3:11 PM

Addshore subscribed.

cscott added a parent task: T147708: Facilitate Wikidev'17 main topic "Artificial Intelligence to build and navigate content".Nov 17 2016, 10:39 PM

Discussed briefly at https://lists.wikimedia.org/pipermail/wikimedia-l/2016-November/085545.html:

my "big picture" vision here is that we start using our machine translation tools to tie our projects more tightly together, so we feel more like "one project aided by a bunch of babel fish" and less like "a thousand separate projects, each in their own tower".

(Mangling the tower of babel metaphor a bit, I hope you'll forgive me.)

Sebastian_Berlin-WMSE subscribed.Nov 18 2016, 7:37 AM

Lydia_Pintscher subscribed.Nov 18 2016, 12:41 PM

Halfak awarded a token.Nov 18 2016, 3:13 PM

Zack subscribed.Nov 18 2016, 10:15 PM

Basvb subscribed.Nov 19 2016, 6:53 AM

• ZhouZ awarded a token.Nov 21 2016, 6:29 PM

Amire80 moved this task from Needs Triage to Long term on the ContentTranslation board.Nov 23 2016, 11:34 AM

WMDE-leszek subscribed.Nov 28 2016, 9:24 AM

Qgil moved this task from Missing proven interest to Proposed Unconference Sessions on the Wikimedia-Developer-Summit (2017) board.Nov 29 2016, 11:11 AM

Qgil moved this task from Proposed Unconference Sessions to Unconference candidates on the Wikimedia-Developer-Summit (2017) board.Dec 13 2016, 11:38 AM

@cscott Hey! As developer summit is less than four weeks from now, we are working on a plan to incorporate the ‘unconference sessions’ that have been proposed so far and would be generated on the spot. Thus, could you confirm if you plan to facilitate this session at the summit? Also, if your answer is 'YES,' I would like to encourage you to update/ arrange the task description fields to appear in the following format:

Session title
Main topic
Type of activity
Description Move ‘The Problem,' ‘Expected Outcome,' ‘Current status of the discussion’ and ‘Links’ to this section
Proposed by Your name linked to your MediaWiki URL, or profile elsewhere on the internet
Preferred group size
Any supplies that you would need to run the session e.g. post-its
Interested attendees (sign up below)

Add your name here

We will be reaching out to the summit participants next week asking them to express their interest in unconference sessions by signing up.

To maintain the consistency, please consider referring to the template of the following task description: https://phabricator.wikimedia.org/T149564.

Pginer-WMF triaged this task as Medium priority.Apr 11 2018, 3:10 PM

Pginer-WMF moved this task from Backlog to Other teams/Watching on the Language-Team board.

Pginer-WMF removed a project: Language-Team.Apr 11 2018, 3:15 PM

Addshore moved this task from incoming to monitoring on the Wikidata board.Sep 19 2018, 8:03 AM

Arrbee moved this task from Long term to Check & Move on the ContentTranslation board.Jan 20 2020, 8:12 AM

Restricted Application added a subscriber: alaa. · View Herald TranscriptJan 20 2020, 8:12 AM

This seems to be a session for an event in 2017. Currently, we are exploring the integration of an Open source Neural Machine Translation service. More details in this ticket: T234194: Explore the integration of OpusMT

@cscott, since the ticket seems related to that specific event, which is over now, I think it is better to close it, but feel free to reopen if needed.

Next steps for machine translationClosed, ResolvedPublicActions