Page MenuHomePhabricator

Content Translation treats machine translation of links that begin in the middle of a word incorrectly
Closed, InvalidPublic

Description

To reproduce: translate the article "ויקטור גרייבסקי" from Hebrew to Russian (it's easy to find more examples if this example doesn't work).

The first sentence of the second paragraph in Hebrew is "ויקטור גרייבסקי נולד בקרקוב שבפולין בשנת 1925 בשם ויקטור שפילמן." (In wiki syntax: ויקטור גרייבסקי נולד ב[[קרקוב]] שב[[הרפובליקה הפולנית השנייה|פולין]] בשנת [[1925]] בשם '''ויקטור שפילמן'''.)

It is translated to Russian as "Виктор grajewski родился вКракове сестьПольше в 1925 имени Виктор шпильман."

It has many translation errors, but one of them is definitely introduced by the CX software and not by the Yandex machine translation. The Hebrew word "בקרקוב" is translated as "вКракове". It means "in Krakow". It is written as one word in Hebrew, but as two words in Russian (and English). Without CX Yandex translates it correctly as "в Кракове", and you can try it at https://translate.yandex.com/. In the original Hebrew text the first letter "ב" is a preposition, and "קרקוב" is the name of the city Krakow. The name of the city is a link, but the preposition is not part of the link.

There is another example in the same sentence: "שבפולין" is translated as "сестьПольше". The Hebrew text means "which is in Poland", but it is translated incorrectly because of the processing that CX is doing on the way, so it comes out as something like "sitPoland". This is wrong for several reasons:

  • The words in Russian are stuck together without space.
  • The prefix "שב" ("which is in") is sent for translation separately, probably because "פולין" (Poland) is a link and the prefix is not part of the link. The prefix as a separate word happens to mean "sit", and that's how it was translated, but it wasn't supposed to be sent for translation separately in the first place. It must be sent as one word: "שבפולין". If you paste the same sentence as plain text without any links to Yandex.Translate, the translation of these words will be different and better: "в Кракове, Польша".

I guess that this is a bug in how Content Translation's annotation mapping works. This affects Hebrew and Arabic, and possibly other languages which have prepositions and other prefixes in the beginning of the word before links (maybe French and Italian, which frequent have articles like l' in the beginning of the word).

Event Timeline

Amire80 triaged this task as Medium priority.Jul 22 2016, 8:41 PM
Arrbee renamed this task from Content Translation treats machine translation of links that begin in the middle of a word incorrectly to [to triage] Content Translation treats machine translation of links that begin in the middle of a word incorrectly.Jul 6 2018, 5:09 PM
Arrbee moved this task from Bugs to Check & Move on the ContentTranslation board.
Pginer-WMF subscribed.

@Amire80, can you confirm whether this happens in version 2 of Content Translation?

Arrbee renamed this task from [to triage] Content Translation treats machine translation of links that begin in the middle of a word incorrectly to Content Translation treats machine translation of links that begin in the middle of a word incorrectly.Aug 6 2018, 9:58 AM

I tested it further, and found that the issue is in Yandex.Translate and not in CX.