Page MenuHomePhabricator

Improve section mapping by integrating MT
Open, MediumPublic

Description

As part of the exploration on how to improve section mapping (T276212) for Section Translation, one option to explore is the use of the Machine Translation (MT) services available.

MT can provide translations for sections even if those are not in the database of previously translated sections. Since these requests introduce a delay it is worth considering when is the best time to apply them.

Considering that not all languages have MT available, we need to make sure that while MT improves the mapping process we don't add a strong dependency on this approach, making it possible to still get mappings for languages without MT support.

Event Timeline

Pginer-WMF triaged this task as Medium priority.Mar 2 2021, 11:47 AM
Pginer-WMF created this task.

A rough outline of MT integration:

  • Machine translation typically involves API calls and they are costly(time, money). Since the usecase here is to translate section titles and see if they exist in a target language and section titles, we can consider doing machine translation of highly frequent section titles offline. And keep those translations in a database(same database we use for sectiont title alignement can be used)
    • Find out n number of top section titles present in English, machine translate them to languages of interest and insert the source-target pair of section titles to section alignment database.
      • Need not limit to English, but can be done for most used source languages too, for example, Spanish
  • If we use the same section title alignment database, we can give preference to human translations we learned from CX corpus if it exist. If not fallback to MT
  • Evaluate the coverage - Does a language have title pairs for most frequent n section titles? If not explore additional options:
    • Can we do MT on the fly? If we see no alignment in database, can we do MT request from cxserver at run time?
    • Can we integrate Machine learning based section recommendation at this point? Or should it be in parallel in with the above processing?
  • Can we do MT on the fly? If we see no alignment in database, can we do MT request from cxserver at run time?

Another relevant aspect to consider for such approach is whether the missing entry is incorporated to the database too for future access (similar to a cache)?

Change 680291 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Use machine translation to populate section alignment db

https://gerrit.wikimedia.org/r/680291

Added 11678 new section title mappings for frequently used section titles in 200 languages.

Frequently used titles were extracted from cx corpus and top 200 titles were used for machine translation.

Before machine translation, they were checked for existing mapping our section titles alignment database.

Updated database has 692376 alignment entries now.
Previously: 680698.

Theoretically, 200 frequently used titles in English , translated to 199 other languages should result 39800 new entries. But we don't have MT service providers for all these languages. While running the script it was observed that, the MT some times gives translation same as source text, probably due to nature of the word(example: proper nouns) or due to quality issue of MT service. Entries with source section title same as target section title were not added to database.

If a title is still not found, we need to explore additional options such as on the fly machine translation. This part is not does in current work(https://gerrit.wikimedia.org/r/680291)

Change 680291 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Use machine translation to populate section alignment db

https://gerrit.wikimedia.org/r/680291