Page MenuHomePhabricator

Detection of entity names seems to fail quite often
Closed, DeclinedPublic


Given that I translate an article
And that article contain a common word as an entity name
When I translate a block with that entity name
Then the entity name should be retained

What happen quite often is that the name is translated. If the word is part of the page title it is a very clear indication that it is in fact an entity name. It is also a very clear indication that it is an entity name if it is used with a capital letter without a leading sentence terminator. If so it should probably not be translated.

A user said that emphasized text should be left as it is. I'm not sure if this is correct, as emphasize and quote signs are interchangeable in Norwegian typesetting. If quote signs are used it is a rather clear indication that the text should not be translated. Emphasize is although an indication that capital letters should be retained.

Another option could be to check linked articles for entity names, possibly also articles in the same category. Some words will also be used in connection with names, and could act as markers to detect entity names that should be written with capital letters.

An article that had a name messed up was Frode Grytten, one of the test articles for the nno-nob pair from Apertium.

Note that "Frode" (male name) became "Fråd" (froth) in the first one, and I had to correct the translation.

Second note; in some cases a name has an established translation. That should be respected.

Third note "quite often" is not quite functional… ;)

Event Timeline

Amire80 triaged this task as Medium priority.Jun 19 2016, 12:26 PM
Amire80 subscribed.

Thanks a lot for the report.

The ideal solution would probably be to offer multiple translation to the same word. We already have a task to do this for suggestions from several engines (T90161), but the same engine can offer several translations, too. This must be added in the CX extension's UI, and must, of course, be supported by the translation backend. I assume that Google already supports it (which we doesn't currently offer, but we may offer in the future, theoretically), but I'm not sure about Apertium and Yandex.

Pginer-WMF subscribed.

Machine translation is provided as a way to speed-up the translation process, and it is not expected to always be correct. Improving these external services is out of the scope of the Content Translation tool, and adding complex rules on top of them may not be practical. Having said that, there are ways to reduce the need for making recurrent corrections which would improve the experience on cases like the ones described in this ticket even if the external services have not been fixed. This ticket captures our current thinking in this area: T96165: Learn from user corrections to avoid editing the same term again and again

Here, as in T103066, I'd say that the really right solution is having better semantics in the text, and this is outside the scope of CX.

I still support the idea of showing multiple translations options, as I wrote above, but this can be done some day as part of T90161.