Page MenuHomePhabricator

Research support for cross-wiki content propagation
Closed, ResolvedPublic

Description

During the 2019-20 fiscal year the Language team plans to support Section translation ( check the General concept and the initial design ideas) as part of the Translation Boost initiative. That is, supporting users to expand existing articles by translating a new section from another language. For example, we want to make it easy to expand the "Ukulele" article in Tagalog by translating the "history" section from English.

This ticket provides an overview of current and future research work that could help in this context. The areas below describe the support needed, in which ways such support would help, and fallback approaches that can be applied while the necessary capabilities are not yet available.

Section mappings

Sections of an article represent relevant aspects of a topic. We want to present for a given article and a language pair which are the sections that are present and missing on each version. In this way, users can select which aspects to expand by translating from another language. The example below shows that "history" is a section present in English but missing in Tagalog for the Ukulele article:

cx-dash-section-selector copy 4.png (931×1 px, 100 KB)

Status and fallbacks

There has been work already from research to identify relevant missing sections that users can add to an article based on those present in other languages. Initial discussions suggest, that this work can be repurposed to identify sections that exist in one given language and are missing in another one.

Until this approach is available, simpler (although limiting) approaches can be used to prevent this work to be blocker:

  • Focus on target articles with no sections at all, to make sure that any section present in another language is not there.
  • Focus on articles that were created with Content translation where no additional sections were added after they were published. In this way the section mapping is already available.
  • Let the users check (and report) if the page contains a given section.

Suggestions for sections

In addition to letting users picking a specific article to expand with a new section, we want to surface suggestions that users can translate. That is, we want to surface the opportunity to translate the history section of the Ukulele article in the same way we are currently surfacing opportunities to translate new articles in content translation. The idea is illustrated below (note how the suggestions list include two parts: "new pages" and "expand with new sections" ):

cx-dash-suggest-sections.png (959×1 px, 217 KB)

Status and fallbacks
Currently, only articles missing in the target language are surfaced in the recommendation system.

A possible fallback would be to suggest missing sections for:

  • translations that the user has created previously (more targeted to encourage the user to continue the work rather than help discover new topics)
  • articles featured in the source language that are present in the target language as non-featured (as an indicator of the potential for expansion)

Other related aspects

There are two additional aspects that may intersect with the work above and Research can help with:

  • Section relevance. How we determine which sections are relevant for being translated. This can inform how we present sections to translate, suggest them or notify users that a new section has been created that is worth translating.
  • Custom suggestions. We are exploring how the current suggestions can focus on a particular topic area. That is, exposing a catalog of topics (Geography, Mats, etc.) for users to select. This is something that could be supported in the current recommendation system by using a seed article, but a similar need would arise for suggested sections.

Event Timeline

@Pginer-WMF is the task description updated? I'd like to do a pass over this task and see what we can potentially pick up starting January 2020.

@Pginer-WMF is the task description updated? I'd like to do a pass over this task and see what we can potentially pick up starting January 2020.

Thanks for pinging, @leila. I reviewed and updated the description. In short: We are starting the work on Section translation and will benefit from collaborating with Research. Currently, we are evaluating the initial designs, in order to start implementation during next quarter (January-March). The work from the Research team related to section mapping will be very useful to (a) let users know which sections are available/missing for a given article and language and (b) surfacing missing sections in the suggestions for users to translate. Looking forward to start collaborating in these areas.

@Pginer-WMF this is very helpful. Let me talk with a couple of folks in the team and get back to you.

Hi @Pginer-WMF,
Have you already have a look in our Section Recommendation demo app? That's currently working for 6 languages. Expand it and specially maintaining for many languages could be complex, however, a simplified version of that system, using a dump approach instead of an API (like we have done with the template parameter alignment) it could be feasible.

Hi @Pginer-WMF,
Have you already have a look in our Section Recommendation demo app? That's currently working for 6 languages. Expand it and specially maintaining for many languages could be complex, however, a simplified version of that system, using a dump approach instead of an API (like we have done with the template parameter alignment) it could be feasible.

Yes. I was aware of some of the work done around section recommendation. I don't know much about the internals and your input is very useful info on how to make this scale to a larger set of languages.

For initial stages it is ok to start with our set of target languages. Regarding the dump approach, one of the challenges may be to keep the dumps up to date since articles seem more likely to change than templates (where we used the dump approach previously). In any case, we can discuss the specific details as we plan the work in this area.

diego triaged this task as High priority.

Update from last two weeks:

  • Created a cloud instance to host the API.
  • Working on simplify the previous section alignment model.
  • Main limitations/challenges are in terms of resources (RAM and HDD) to make this approach work on all many languages.

Is there a guide I can read for contributing, or is this for the employed team only? I've been working on something tangentially similar at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(idea_lab)#Topic_Similarities_Across_Language_Editions_%28Bot%29 but it seems that this direction has more traction.

Hi @Theory42
I wont say "employed team only", I'll share all the code I'm creating for this, but currently I can't think in tasks where I need help for this. Please feel free to contribute to the repo previously mentioned, and tell me if I you need my help.

Is there a guide I can read for contributing, or is this for the employed team only? I've been working on something tangentially similar at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(idea_lab)#Topic_Similarities_Across_Language_Editions_%28Bot%29 but it seems that this direction has more traction.

Hey @elukey, for this task I need to download - at least - 50 languages models, each of them is around 8G, so I'll use around 400G. I'll do my best to make this work with that data on HDFS, but for starting I need to have it in a local machine. I'm now using stat1007 for my experiments. Is ok if I store temporarily the models there?

Hey @elukey, for this task I need to download - at least - 50 languages models, each of them is around 8G, so I'll use around 400G. I'll do my best to make this work with that data on HDFS, but for starting I need to have it in a local machine. I'm now using stat1007 for my experiments. Is ok if I store temporarily the models there?

Hey Diego, as we discussed a lot of times in the past stat1007 is usually crowded and there are other 2/3 nodes that have terabytes of free space that are better for your use case. As result, since you are using ~700GB of space on stat1007, the /srv partition is almost full on stat1007. It is not a big issue at the moment since there is some space left, but please next time do a quick check on the free space before adding files :) If you need to keep files on stat1007 for a few days it is fine, otherwise I'd ask to move to say stat1004/5 (on those you can keep local files for more).

As general rule, on stat boxes it is ok to store data for these tests (notebooks are the only problem), if it is better for you it is also fine to skip pushing to HDFS if the files are temporary. The only thing that we ask is to check free space to make sure that others can work on the same hosts.

Edit: I checked metrics for space used on stat1007, it increased over night but not something gigantic as I expected. Diego, are the 400G to be put in addition to the 700 that you already have in there? If so please don't proceed, let's chat about how to shuffle files around first.

Edit2: I realized that we don't have a suggested/canonical way in the docs to check for disk space used, so I added some notes to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibilities. Will also send the email to the analytics mailing list :)

@elukey I've deleted 120Gb. Moved back to 580G :)

weekly update:

  • Created alignment matrix for 40 languages (the top ranked by number of translations with CX tool), getting all the possible pairs C(40,2)
  • Extracted sections in all those 40 languages

What is next:

  • Created aligned vector representation for all sections, for all combinations.
  • Implement the API
  • Test a simpler approach using LASER.

@elukey I've deleted 120Gb. Moved back to 580G :)

Thanks! A little bit better but we are still at 92% of disk space usage. Moving files now is probably a pain, but do you have an estimated time before reducing further the space used? (Just to understand how to shuffle things around during the next days/weeks).

I'll need around 3 weeks (aprox) to finish this.

Weekly update:

  • Testing the alignments,
  • Testing LASER.

Weekly update:

  • Comparing results with the ones created from parallel corpora of CX tool T243430#5910987

Weekly update:

  • Working on the API.

Weekly update:

  • The language team has released an API based on the translation corpus. Their approach is simpler than the multilingual word embeddings solution, and quality is similar.