Page MenuHomePhabricator

Basic service for mapping sections
Closed, ResolvedPublic

Description

Summary

For the section translation feature of content translation project, an API to identify the missing section between articles in two languages need to be developed. This api then used in the user interface to show the missing sections to user

Details

As part of the Section Translation process, users pick a section to translate (T241587). To facilitate such selection the tool will show which article sections are missing and which are present in the target language.

In order to support this, we need a service that given an article (Q-ID or page name) and language pair (source, and target) provides the mapping of likely equivalent sections.

The purpose of this ticket is to provide a basic approach that can complement the more advanced approach from the Research team (T224234). This would work as a fall-back and as an initial version until the later is ready.

Some strategies that can be considered to find the mappings:

  • Translate section titles. Make a fuzzy match of the automatic translation of the source section titles with those in the target language. This is similar to what is done to re-apply formatting and links after those are lost by plain text translation services in Content Translation.
  • Use Content Translation info. For articles translated with Content Translation, use the mapping information from the tool to identify equivalent sections.
  • Inspect section contents and map the linked topics. Extract the articles linked in each section, extract their Q-IDs and check for number of coincidences across source and target sections.

Since the more advanced approach may not cover all languages or topics, the basic approach is expected to coexist with it. So it is worth considering how the new approach will be integrated into the system when designing the current one.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2020, 5:41 PM
Pginer-WMF triaged this task as High priority.Jan 22 2020, 5:41 PM
Pginer-WMF renamed this task from Basic service for section mapping to Basic support for mapping sections.Jan 23 2020, 9:07 AM
Pginer-WMF renamed this task from Basic support for mapping sections to Basic service for mapping sections.Jan 23 2020, 10:03 AM
santhosh added a subscriber: diego.EditedFeb 24 2020, 6:35 AM

As an experiment, I wrote a program that downloads the entire content translation parallel corpus from https://dumps.wikimedia.org/other/contenttranslation/20200214/ and find all section title translation done so far. It took about 9 hours to parse huge jsons files and at the end we have a 31.7 MB Sqlite database with this information.

Database: https://people.wikimedia.org/~santhosh/cx-section-titles-aligned.db

I captured language pairs and title pairs and its frequency of occurance. Here is top 25 items sorted by desc order of frequency

The database has 503100 section titles for 2104 distinct language pairs. Note that we captured h2, h3, h4 headings.

Using this database helps us in different ways:

  1. Given an article pair in different languages, we can find sections present in one and not in another. For example "History" in en is mapped with "வரலாறு" in tamil in 346 times as per our database. This information helps us to see if History section is present in an article in Tamil.
  2. We can also suggest the target section title "Add வரலாறு section in tamil" instead of "Add History section in tamil"
  3. We can allow some fuzziness in matching. For example if Archaeology does not have a mapping to Tamil, but "Archaeology and excavations" has a mapping, it helps.

There are also some limitations

  1. This database is only based on past translations. So it is always incomplete. We may need additional methods to find missing sections
  2. This is based on translations till Feb 14, 2020. We will require a way to update this database

@diego Will this database helps to augment some of the explorations you are doing?

Updated: Corrected the database size

Change 574445 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] API for finding missing sections in an article pair

https://gerrit.wikimedia.org/r/574445

diego added a comment.Feb 27 2020, 4:48 PM

Hi @santhosh! Yep, this is super useful, I considering do something similar, to find some ground-truth for my approaches.

This database is only based on past translations. So it is always incomplete. We may need additional methods to find missing section

This might be an issue, but is also true that - by definition - should cover the most usual sections and languages pairs.

Change 574445 merged by jenkins-bot:
[mediawiki/services/cxserver@master] API for finding missing sections in an article pair

https://gerrit.wikimedia.org/r/574445

Change 578935 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[operations/deployment-charts@master] WIP: cxserver: Add sectionmapping config for production

https://gerrit.wikimedia.org/r/578935

diego added a comment.Mar 11 2020, 8:50 PM

As an experiment, I wrote a program that downloads the entire content translation parallel corpus from https://dumps.wikimedia.org/other/contenttranslation/20200214/ and find all section title translation done so far. It took about 9 hours to parse huge jsons files and at the end we have a 31.7 GB Sqlite database with this information.

Database: https://people.wikimedia.org/~santhosh/cx-section-titles-aligned.db

@santhosh that file is 30Mb, where can I get the full 31.7GB file?

As an experiment, I wrote a program that downloads the entire content translation parallel corpus from https://dumps.wikimedia.org/other/contenttranslation/20200214/ and find all section title translation done so far. It took about 9 hours to parse huge jsons files and at the end we have a 31.7 GB Sqlite database with this information.

Database: https://people.wikimedia.org/~santhosh/cx-section-titles-aligned.db

@santhosh that file is 30Mb, where can I get the full 31.7GB file?

Sorry, I wrote it incorrectly. it is not GB, It is MB. The file I placed there is the same one. sorry again.

Change 578935 merged by jenkins-bot:
[operations/deployment-charts@master] cxserver: Add sectionmapping config for production

https://gerrit.wikimedia.org/r/578935

Mentioned in SAL (#wikimedia-operations) [2020-03-12T06:14:02Z] <kart_> Updated cxserver to 2020-03-12-041806-production and added sectionmapping db config (T246316, T243430, T202276)

diego added a comment.Mar 18 2020, 3:23 PM

weekly update:

  • I'm experimenting with the API. The plan is to release a usable version during next week.

The API based on the section mapping database prepared from CX parallel corpus is up and running now: https://cxserver.wikimedia.org/v2/suggest/sections/Sitar/en/ml

The API based on the section mapping database prepared from CX parallel corpus is up and running now: https://cxserver.wikimedia.org/v2/suggest/sections/Sitar/en/ml

This is great! Thanks, @santhosh.

Some observations and related questions:

  • Section order. I noticed that the ordering in which sections are presented in the article is not respected when those sections are returned by the API. When listing them in the UI it makes sense to present them in the original order. Users may pick in any order but it is likely that the article order makes sense. Do you think it makes sense for the API to keep the order or should be the UI be in charge of reordering?
  • Special sections. There are some sections that we may want to treat differently. For example, it may not make sense to pick the "References" for translation since that normally includes automatically-generated contents based on the citations present in the rest of the article. I think that this is something for the app using this API to consider and filter (e.g., not showing the "references" section), but wanted to check if that makes sense.

Change 591343 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/extensions/ContentTranslation@master] WIP: Section selector

https://gerrit.wikimedia.org/r/591343

Change 594462 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/extensions/ContentTranslation@master] Refactor: Use model classes

https://gerrit.wikimedia.org/r/594462

Change 591343 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Section selector - basic version

https://gerrit.wikimedia.org/r/591343

Change 594462 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Refactor: Use model classes

https://gerrit.wikimedia.org/r/594462

santhosh updated the task description. (Show Details)May 28 2020, 4:50 AM

I created two follow up tickets based on Pau's comments:

Moving this ticket from in-progress so that we can work on follow ups for specific improvements.

Jpita added a subscriber: Jpita.Jun 29 2020, 10:04 AM

Anything I need to do here @santhosh ?

santhosh closed this task as Resolved.Jul 9 2020, 4:49 AM

No. just FYI now. Marking as resolved