Page MenuHomePhabricator

Content Translation should suggest automatic translations for common heading names
Open, MediumPublic

Description

I discussed this once at Wikitech-L, but somehow never filed a Phab task (or maybe I did file it, but I cannot find it now; please mark this one as a dupe if you can). I recalled this thanks to the somewhat related work that the Research people are doing: T182211: Develop a standalone classifier for section translation (alignment) across languages.

There are many common section headings in Wikimedia projects: Biography, Early life, Filmography, References, Etymology, Understanding (Wikivoyage), and several dozens (or hundreds) of others. They are common in many languages, and a list could be compiled so it would suggest automatic translations. Such translations may be better than machine translation from Apertium or Yandex, because they'll be in the right context, and based on the wiki editors' existing experience.

Event Timeline

Amire80 created this task.Feb 23 2018, 1:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 23 2018, 1:59 PM

@Amire80 I like this task. A few thoughts:

  • We'd love to work with you if you decide to pick this up. I expect T182211 to be in a good place in a couple of months for us to be able to surface recommendations for candidate sections.
  • Some learnings from what we have done so far:
    • Section title recommendation can be helpful to make the most common headings more consistent. I hypothesize that readers can comprehend and understand the content better if there is some agreed upon general structure for the way articles are written in. (@Capt_Swing do we have any evidence that what I hypothesize is correct?) In English Wikipedia, ~1000 sections span 70% of the total section titles used.
    • We need to make sure the tools and services make it clear that serendipity is key and we stay away from encouraging editors to make Wikipedia language editions homogeneous. I know you know this, it's mostly a self-reminder for me. ;)
    • There is a long tail of sections that are used only once in a language such as English. We need to dig deeper in this long tail and make sense of it. I hypothesize that while some of them are good sections that we need to surface for reuse, some are not something the algorithms should encourage to have more of.
    • As we're looking at different languages, we learned that in Arabic, for example, the definite article "ال" in Arabic is used in some set of section titles and not in some others. This, unless I'm missing something, creates unnecessary inconsistency in many cases.
    • We should figure out what to do with synonyms such as Film and Movie. One straightforward option is to surface such synonyms to each language community and encourage them to choose one versus the other, especially if Capt_Swing or others know of evidence that shows consistency can help with comprehension or learning.

I'm sure this task requires engineering and design work, too. These are just my limited-view input. :)

leila added a subscriber: diego.
Pginer-WMF triaged this task as Medium priority.Jul 20 2018, 9:59 AM
Pginer-WMF moved this task from Needs Triage to Enhancements on the ContentTranslation board.
leila edited projects, added Research-Backlog; removed Research.Dec 9 2019, 9:02 PM

Moving this to research-backlog. @Amire80 it would be helpful to have an update from you about the status of this task. Is this something you're still interested in? Has anything changed since we last visited it on your end?

It would be a good feature to have, but it's better to ask @Pginer-WMF about planning.

As part of the work on Section Translation, we'll work on mapping sections between articles across languages. The main purpose is to be able to surface that "the 'history' section is present in the English version of 'Ukulele' but missing in the Tagalog version". However, if supporting such mapping makes it easy to also support section title translations, we can consider that too.