Page MenuHomePhabricator

Apertium mishandling some links
Open, MediumPublic


When content is translated with Apertium, some of the links are not reapplied correctly. To some extent this is expected (since Apertium works with plain text), but for some cases we may want to review the algorithm that applied the links back to the Apertium translation.

In particular, links for places such as "Peterborough, Ontario" are lost despite the text being exactly the same in the source and the translation (maybe the "," is getting in the way of the algorithm).

The example below is based on a translation of "Gene Kelly" from English to Spanish ("Early live" section):

Screenshot 2019-01-15 at 10.52.01.png (235×1 px, 97 KB)

(The original report is included below. In addition to the issue with links it also suggests improvements to the translations provided by the external service which is out of the scope of Content Translation, although related improvements may help, such as T197662: CX2: Quickly switch between alternative link label translations )

Here are some examples of Apertium mishandling links, in 3 distinct ways:

  • Failing to create a link
  • Failing to mark a "missing link"
  • Translating a name link literally

You can see all the examples below on the same page. Using english>espanol, start a new translation of the page "Gene Kelly".

Failing to create a link

Examples of Apertium failing to create a link and substituting plain text instead, even though the target page was available on es wiki. (In all these instances, Yandex and Google were able to successfully create the link.)
• Paragraph 1 under “Early Life”, third sentence: Peterborough, Ontario
• Paragraph 3 under “Early Life”, second sentence: Johnstown, Pennsylvania
• Paragraph 2 under “Political and Religious Views”, first sentence: Beverly Hills, California

Failing to mark a missing link

Examples of Apertium failing to handle a missing link correctly, while Yandex and Google handle it successfully:
• Paragraph 2 under “Stage Career”, first sentence: Leave It to Me.
• Paragraph 2 under “Working method and influence on filmed dance”, sentence 7: Loew’s Penn Theater.

Translating a name link literally

Examples of Apertium translating a surname link literally (while Yandex and Google successfully translate):
• Paragraph 1 under “Film Career”, sentence 5: link “Lucile Ball” is translated as “Lucille Pelota”
• Paragraph 2 (near top of page), sentence 2: link “Judy Garland” is translated as “Judy Guirnalda”

Event Timeline

Pginer-WMF renamed this task from Apertium mishandling some links to Apertium mishandling some links.Jan 15 2019, 10:10 AM
Pginer-WMF updated the task description. (Show Details)
Pginer-WMF added a subscriber: Pginer-WMF.

Thanks for the report @Barbvd. These issues are directly related to the way Apertium works and we are limited about how much we can improve the situation. I have updated the description to focus on the issue that can be improved from Content translation side, and provided more details below.

  • Failing to create a link
  • Failing to mark a "missing link"

Unlike other translation services, Apertium does not support rich text, providing only plain text translations. When the plain-text translation is obtained from Apertium the text formatting and links are reapplied to it by Content translation.

This process cannot be perfect since word order changes, and the translation of segments of the source content such as the label of a link in isolation may be very different when it is translated in the context of a sentence. So this process is expected to produce some false positives. What was a bit surprising from your examples is that it failed for city names that were kept unchanged in the translation. My guess is that the "," character may be confusing the algorithm, which is worth investigating.

  • Translating a name link literally

We rely on the translations that the external services provide. Since Apertium is a rule-based translation system, it may be more prone to these kind of errors than other services following different approaches. We plan to work on a separate ticket that may help in this situations: T197662: CX2: Quickly switch between alternative link label translations

Pginer-WMF moved this task from Needs Triage to Bugs on the ContentTranslation board.
KartikMistry raised the priority of this task from Medium to Needs Triage.Jan 15 2019, 10:11 AM
KartikMistry triaged this task as Medium priority.
KartikMistry updated the task description. (Show Details)

We do have a Plan to change Apertium to handle markup correctly, so one can trust that certain tags are always kept (and ordering of close/open tags is preserved), and there has been some work towards that end already, but it's all in "work in progress" branches and needs cleanup and testing. I may find some time in a few months to do that. I can't say for certain whether it'll immediately solve this issue without changes on the Content Translation-side though.

The nitty-gritty:

@Pginer-WMF, thank you for that background - very helpful!

I think you might be right about the comma being the culprit in some cases. There's an example on that same page of a comma possibly causing problems in a link that is NOT a place name. Under "Becoming established in Hollywood", Apertium fails to create the missing link dialog for "lieutenant, junior grade".

Maybe other types of punctuation throw Apertium off as well. It fails to create a link for "Singin' in the rain" (paragraph 5 under MGM). Interestingly, it leaves the word Singin' in its original form, translating the complete title as "Singin' en la Lluvia". Yandex translates it as "Cantando bajo la Lluvia."

Google simply uses the english title -- as it does with "An American in Paris" in the same paragraph. I get the impression that using the source language for titles is a style decision on their part. (Although they do translate titles sometimes. As an example, paragraph 2 under "1946-52: MGM" shows both strategies -- "The Three Musketeers" is translated, but "Take Me Out to the Ball Game" and "On the Town" are not.)

Another character that seems to trigger the sentence split logic prematurely is "ḥ".
This was reported by a translator trying to translate Yeha from English into Spanish.

Inside Content Translation, the word "yiḥa" was automatically translated as "yiḥun". However, using Apertium directly on their website keeps "yiḥa" intact as the translation result. the suspicion is that the "ḥ" character was interpreted as the sentence end and the remaining "a" was translated as "un" in Spanish.

Similar issues were reported for Polish where "łski" becomes "ł esquí", being "esquí the Spanish version of the ski sport.