Page MenuHomePhabricator

Store translations of frequent section titles
Open, MediumPublic

Description

There are section titles that appear in a great number of Wikipedia articles: "Biography", "Early life", "Bibliography", "External links", "References", "History", "Legacy", etc.

When machine translation is not available (and even if it is) these section titles can be auto-translated by the CX software. The translations can be stored in the usual i18n JSON files.

This should save translator a few seconds on each of these sections.

(This feature was suggested by User:Chimel31 at https://meta.wikimedia.org/wiki/Research:Increasing_article_coverage .)

Related Objects

Event Timeline

Amire80 raised the priority of this task from to Medium.
Amire80 updated the task description. (Show Details)
Amire80 subscribed.
Amire80 set Security to None.

This is also useful for Wikivoyage.

However, I don't think it's a good idea to hardcode a list of frequent section titles.

What do you mean by "hardcode"?

The way I see it, the headings will be added by default if any text is written in following paragraph, and the translator will be able to edit them just like any other text.

How can we get the list of frequent section titles?

For a precise number we could analyze dumps, although it would probably take a very long time given that they are so huge in some languages.

Or maybe @ssastry and @cscott have a way to query Parsoid data quickly?

Of course, we could start from some intuitive ones: "Biography", "History", "Early life", "Personal life", "Awards", "Bibliography", "External links", "References", "History", "Legacy", "Death", "Geography", "In popular culture", etc.

Is this the sort of thing you're looking for? Takes few minutes to generate with something like find /public/dumps/public -path '*/201506*/*pages-articles.xml.bz2' -print0 | xargs -0 -n 1 -I '{}' jsub -cwd LC_COLLATE=C bzgrep -HE '^==' {} | sort | uniq -c | sort -nr | head -1000 >> frequent-headers.txt -> F191888.

Pginer-WMF subscribed.

As part of the upcoming work on Section translation, we may have access to section mapping data which may enable to translate sections in this way.