Detection of entity names seems to fail quite often
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	jeblad
	Jun 9 2016, 8:29 AM

Description

Given that I translate an article
And that article contain a common word as an entity name
When I translate a block with that entity name
Then the entity name should be retained

What happen quite often is that the name is translated. If the word is part of the page title it is a very clear indication that it is in fact an entity name. It is also a very clear indication that it is an entity name if it is used with a capital letter without a leading sentence terminator. If so it should probably not be translated.

A user said that emphasized text should be left as it is. I'm not sure if this is correct, as emphasize and quote signs are interchangeable in Norwegian typesetting. If quote signs are used it is a rather clear indication that the text should not be translated. Emphasize is although an indication that capital letters should be retained.

Another option could be to check linked articles for entity names, possibly also articles in the same category. Some words will also be used in connection with names, and could act as markers to detect entity names that should be written with capital letters.

An article that had a name messed up was Frode Grytten, one of the test articles for the nno-nob pair from Apertium.

https://svn.code.sf.net/p/apertium/svn/branches/scandi-mt-eval/frode_grytten.nno-nob.a.txt (machine translated)
http://piratepad.net/bTi9sk5a8a (proofread)

Note that "Frode" (male name) became "Fråd" (froth) in the first one, and I had to correct the translation.

Second note; in some cases a name has an established translation. That should be respected.

Third note "quite often" is not quite functional… ;)

Related Objects

Mentioned Here: T103066: Adapt number format across languages
T96165: Learn from user corrections to avoid editing the same term again and again
T90161: Suggestions from multiple translation services

Event Timeline

jeblad created this task.Jun 9 2016, 8:29 AM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 9 2016, 8:29 AM

jeblad updated the task description. (Show Details)Jun 9 2016, 8:34 AM

jeblad updated the task description. (Show Details)Jun 9 2016, 8:43 AM

jeblad updated the task description. (Show Details)Jun 9 2016, 12:06 PM

Thanks a lot for the report.

The ideal solution would probably be to offer multiple translation to the same word. We already have a task to do this for suggestions from several engines (T90161), but the same engine can offer several translations, too. This must be added in the CX extension's UI, and must, of course, be supported by the translation backend. I assume that Google already supports it (which we doesn't currently offer, but we may offer in the future, theoretically), but I'm not sure about Apertium and Yandex.

Amire80 added a project: Essential-Work.Jun 19 2016, 12:27 PM

Amire80 moved this task from Needs Triage to Long term on the ContentTranslation board.

Restricted Application added a subscriber: jhsoby. · View Herald TranscriptFeb 27 2017, 8:02 AM

Annelingua subscribed.Dec 5 2017, 11:09 PM

Arrbee moved this task from Long term to Check & Move on the ContentTranslation board.Jan 20 2020, 7:51 AM

Machine translation is provided as a way to speed-up the translation process, and it is not expected to always be correct. Improving these external services is out of the scope of the Content Translation tool, and adding complex rules on top of them may not be practical. Having said that, there are ways to reduce the need for making recurrent corrections which would improve the experience on cases like the ones described in this ticket even if the external services have not been fixed. This ticket captures our current thinking in this area: T96165: Learn from user corrections to avoid editing the same term again and again

Here, as in T103066, I'd say that the really right solution is having better semantics in the text, and this is outside the scope of CX.

I still support the idea of showing multiple translations options, as I wrote above, but this can be done some day as part of T90161.

Detection of entity names seems to fail quite oftenClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Detection of entity names seems to fail quite often
Closed, DeclinedPublic
Actions