Page MenuHomePhabricator

Improve section mapping by integrating MT
Open, MediumPublic

Description

As part of the exploration on how to improve section mapping (T276212) for Section Translation, one option to explore is the use of the Machine Translation (MT) services available.

MT can provide translations for sections even if those are not in the database of previously translated sections. Since these requests introduce a delay it is worth considering when is the best time to apply them.

Considering that not all languages have MT available, we need to make sure that while MT improves the mapping process we don't add a strong dependency on this approach, making it possible to still get mappings for languages without MT support.

Event Timeline

Pginer-WMF triaged this task as Medium priority.Mar 2 2021, 11:47 AM
Pginer-WMF created this task.

A rough outline of MT integration:

  • Machine translation typically involves API calls and they are costly(time, money). Since the usecase here is to translate section titles and see if they exist in a target language and section titles, we can consider doing machine translation of highly frequent section titles offline. And keep those translations in a database(same database we use for sectiont title alignement can be used)
    • Find out n number of top section titles present in English, machine translate them to languages of interest and insert the source-target pair of section titles to section alignment database.
      • Need not limit to English, but can be done for most used source languages too, for example, Spanish
  • If we use the same section title alignment database, we can give preference to human translations we learned from CX corpus if it exist. If not fallback to MT
  • Evaluate the coverage - Does a language have title pairs for most frequent n section titles? If not explore additional options:
    • Can we do MT on the fly? If we see no alignment in database, can we do MT request from cxserver at run time?
    • Can we integrate Machine learning based section recommendation at this point? Or should it be in parallel in with the above processing?
  • Can we do MT on the fly? If we see no alignment in database, can we do MT request from cxserver at run time?

Another relevant aspect to consider for such approach is whether the missing entry is incorporated to the database too for future access (similar to a cache)?

Change 680291 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Use machine translation to populate section alignment db

https://gerrit.wikimedia.org/r/680291

Added 11678 new section title mappings for frequently used section titles in 200 languages.

Frequently used titles were extracted from cx corpus and top 200 titles were used for machine translation.

Before machine translation, they were checked for existing mapping our section titles alignment database.

Updated database has 692376 alignment entries now.
Previously: 680698.

Theoretically, 200 frequently used titles in English , translated to 199 other languages should result 39800 new entries. But we don't have MT service providers for all these languages. While running the script it was observed that, the MT some times gives translation same as source text, probably due to nature of the word(example: proper nouns) or due to quality issue of MT service. Entries with source section title same as target section title were not added to database.

If a title is still not found, we need to explore additional options such as on the fly machine translation. This part is not does in current work(https://gerrit.wikimedia.org/r/680291)

Change 680291 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Use machine translation to populate section alignment db

https://gerrit.wikimedia.org/r/680291

Change 693842 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2021-05-15-034540-production

https://gerrit.wikimedia.org/r/693842

Change 693842 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2021-05-15-034540-production

https://gerrit.wikimedia.org/r/693842

Mentioned in SAL (#wikimedia-operations) [2021-05-25T06:16:46Z] <kart_> Updated cxserver to 2021-05-15-034540-production (T276214)

In what way can I test this besides:

  • check that all suggested articles have the right sections to translate
  • find a non-popular article, search for it and check that it has the right sections to translate

?

In what way can I test this besides:

  • check that all suggested articles have the right sections to translate
  • find a non-popular article, search for it and check that it has the right sections to translate

?

Maybe testing efforts should be focused on creating a list of examples to use as a benchmark for the final completion of the ticket. That is, when Machine Translation is used as a fallback beyond the entries added manually so far.

Collecting examples of sections that are not the most common ones and are not properly mapped right now but we expect these to be once MT is applied. For example, The Oasis (band) article in English has a section named "Legal battles over songwriter credits". The Spanish version has a section named "Batallas legales sobre créditos de composición". I'd expect those to be mapped, however when trying in SX that section does not appear either as available or missing:

test.m.wikipedia.org_wiki_Special_ContentTranslation_page=Oasis+%28band%29&from=en&to=es&sx=true (1).png (2×750 px, 187 KB)

Having such list of examples in advance and checking them after further improvements could be useful to validate the progress.

For example, The Oasis (band) article in English has a section named "Legal battles over songwriter credits". The Spanish version has a section named "Batallas legales sobre créditos de composición". I'd expect those to be mapped, however when trying in SX that section does not appear either as available or missing.

This is good example. But cases like this is not covered in my iteration. As I explained above, I did a frequency analysis and took most occuring 200 titles were filled in database for all languages possible. The above example, I would expect a frequency of occurance 1. One way to address this using MT engines at the time of section suggestion request. But I am not convinced about its efficiency yet. Need a real analysis of cost-benefit. To illustrate, let us take the same example.

Google translate: Batallas legales por los créditos de los compositores
Actual title in the wiki: Batallas legales sobre créditos de composición

Here, our on the fly MT usage will fail to identify both are section titles that can be mapped unless we device a clever fuzzy matching system.

Since every time we are mapping section titles in just two languages, it makes sense to me to:

  1. check section alignment db to find matched sections like we currently do
  2. then for all unmapped sections, compare machine translation of source section titles to target section titles and find matches that meet a threshold of similarity. In the above example Santhosh provided Google translation and actual title in the wiki are (let's say) 95% similar. So we could consider that a match if our threshold is 90% for example. This step can be more sophisticated by taking into account the length of the titles (short titles might falsely lead to high similarity scores and thus false matches).

This is good example. But cases like this is not covered in my iteration.

Yes. The ideal is to have prepared a similar set of examples before this was implemented, but we cannot go back in time. So I proposed to do this before the next set of improvements to be better equipped by then.

As I explained above, I did a frequency analysis and took most occuring 200 titles were filled in database for all languages possible. The above example, I would expect a frequency of occurance 1.

I think that the intervention you made is a good improvement. The challenge is to find now cases that were failing before and have been solved now. That's why I realized that a set of examples for the future could be helpful, but I'open to other approaches.

Google translate: Batallas legales por los créditos de los compositores
Actual title in the wiki: Batallas legales sobre créditos de composición

Here, our on the fly MT usage will fail to identify both are section titles that can be mapped unless we device a clever fuzzy matching system.

Makes sense, but that illustrate the usefulness of this kind of benchmark. If we have a diverse set of examples we can see which cases get solved with the next iteration (MT) and whether another one may be needed (fuzzy MT?)

One way to address this using MT engines at the time of section suggestion request. But I am not convinced about its efficiency yet. Need a real analysis of cost-benefit. To illustrate, let us take the same example.

One aspect that could help is to log how many times a section for which we have no data i encountered. In any case I'd expect that:

  • If this happens very rarely, the impact of having on-the-fly MT check with caching (i.e., adding the missing section to the database once it is resolved) would be also small (only happening on few rare cases).
  • If this happens very frequently, then assuming the cost of MT may be worth it to avoid lots of sections not being mapped.

If we expect the approach based on the alignment API (T270485) or any other approach to improve the mapping in a more cost-effective way, we can wait for that to be applied or set the priority so that real-time MT approach is only used after those.

I created a sub-task to capture the completed work (T283815). I'm moving this task (more general about MT support) to the backlog for this track of work to continue. As a next step, we'll compile examples of current issues with section mapping to be used as a benchmark (T283817).