Page MenuHomePhabricator

Explore the integration of MarianMT
Open, MediumPublic

Description

MarianMT is an open source Neural Machine Translation framework and the the OPUS project is building language models based on their corpora for MarianMT. The OPUS project compiles a parallel corpus of translation examples, including those created by using Content translation.

The integration of this project into Content translation (and other Wikimedia projects) would provide new opportunities to expand the use of machine translation to new languages and new usecases. This would be the first approach that is both opensource and based on neural machine translation, making it different from the existing options. This makes it possible, for example, to integrate user corrections done with Content translation back into the system to improve the translation quality.

The current ticket proposes to explore the possibility of such integration by defining the initial steps to follow, including technical aspects to evaluate among other considerations.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 30 2019, 10:31 AM
Pginer-WMF triaged this task as Medium priority.Sep 30 2019, 10:32 AM
santhosh updated the task description. (Show Details)Oct 1 2019, 3:54 AM

As a start, I attempted to make the development and deployment of the system simple. I creaded a Docker container that takes care of setting up the system - https://github.com/Helsinki-NLP/Opus-MT/pull/1
It need more work. My aim is to get a webserver that abstract MT engines and complexities, and expose a web api that does translation

santhosh added a comment.EditedOct 23 2019, 12:00 PM

Today I built a web fontend for Marian MT with OPUS data models:

  • Uses Tornado for forking sub processes and communicate with them
  • Define a configuration for language pairs and their models
  • Build a simple web interface where language pair can be selected and content to translate can be submitted
  • Define a web api at /api url which takes from, to, source params- in the body content to POST. Returns json with translation as key for translated content.
  • Remove all existing python scripts and replace it with server.py

I submitted a pull request at https://github.com/Helsinki-NLP/Opus-MT/pull/2 but not ready to merge yet. I need to do some testing of docker images.

Here is a screenshot of web interface with en-es translation

This is running at http://opusmt.wmflabs.org/

Observations

  1. The prepared language models used from https://object.pouta.csc.fi/OPUS-MT gives translation, but results need lot of improvement for start using in our usecases. Need to work with upstream to see what can be done
  2. The python tornado based web interface and API is not merged with upsteam. The version used for http://opusmt.wmflabs.org/ is https://github.com/santhoshtr/Opus-MT
  3. For all Indic languages, this issue is present: https://github.com/Helsinki-NLP/Opus-MT/issues/4

Based on the OpusMT documentation, it seems that there is support for a few languages that are not supported by other MT systems available: Assamese (as), Breton (br), Kinyarwanda (rw), and Walloon (wa).

This may be a useful list for future initiatives. They may be willing to try an experimental MT system even if the initial quality is very low (compared to having no MT), and be willing to spend time with manual corrections to help to make it better.

After the language team offsite conversations, the next steps would be as follows:

  1. Expand our OpusMT instance language models with the unsupported languages: Assamese (as), Breton (br), Kinyarwanda (rw), and Walloon (wa).
  2. Create a MT client for Content translation to support the integration (labelled as experimental).
  3. Enabled the new MT service on testing wikis but keep disabled on real wikis (until conversations with the communities show their interest in trying).