Page MenuHomePhabricator

CX2: Should not transform named HTML entites into numeric HTML entities
Closed, ResolvedPublic

Description

CX2 should not transform things like – into their numeric equivalent - because they are a lot less understandable.

Example on frwiki "Félicie de Hauteville" : original translation

  • (c. 1078 - c. 1102) compared to (c. 1078 – c. 1102)
  • (c. 1070 - 3 février 1116) compared to (c. 1070 – 3 February 1116)
  • (avant 1101 - ?) compared to (before 1101 – ?)
  • (1101 - 1er mars 1131) compared to (1101 – 1 March 1131)

Event Timeline

Pginer-WMF subscribed.

We may need to check if this is a side-effect of the content sanitization done for content coming from MT services (or if it is something that can be fixed as part of it). May be related to T213257.

Change 486237 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Do not allow MT engines to change the entity values

https://gerrit.wikimedia.org/r/486237

Change 486237 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Do not allow MT engines to change the entity values

https://gerrit.wikimedia.org/r/486237

Mentioned in SAL (#wikimedia-operations) [2019-01-25T05:38:07Z] <kartik@deploy1001> Started deploy [cxserver/deploy@a5d7181]: Update cxserver to 356f0a1 (T213257, T213275)

Mentioned in SAL (#wikimedia-operations) [2019-01-25T05:42:16Z] <kartik@deploy1001> Finished deploy [cxserver/deploy@a5d7181]: Update cxserver to 356f0a1 (T213257, T213275) (duration: 04m 09s)

@santhosh - there are several issues that I'd like you to review:

(1) I was checking the fix in cx2-testing on the same article translation: en:Felicia of Sicily -> fr:Félicie de Hauteville. The only MT option in cx2-testing for en-fr translation is Yandex which does not work, so the translation fell back to 'Copy original content'. I used this option for translating problematic content, e.g. (c. 1078 &ndash; c. 1102). The result: (c. 1078 &ndash; c. 1102) was translated with 'Copy original content' to (c. 1078 &#x2D; c. 1102) upon publishing.

(2) Translating (en_>es) with MT Apertium (the text is changed to 'Lorem ipsum' in the sample below for avoiding "too much unmodified text" warning), I still see &ndash; -> &#x2013;

'''Id magna nec elit laoreet commodo quis eget nisi'''   (c. 1078 &#x40;&#x2013; c. 1102) es un nombre que está utilizado para una.

* [[Sofía de Hungría|Sophia]] (antes de que 1101 &#x40;&#x2013; ?), mujer de un húngaro noble
* King [[Esteban II de Hungría|Stephen II de Hungría]] (1101 &#x40;&#x2013; 1 Marcha 1131)
* Ladislaus (?)

(3)

CX2 should not transform things like &ndash; into their numeric equivalent &#x2D;

The numeric equivalent for &ndash; is not &#x2D; (http://www.howtocreate.co.uk/sidehtmlentity.html)

DecHexEntity
&#8211;&#x2013;&ndash;

There is no entity equivalent for &#x2D;:

DecHexEntity
&#45;&#x2d;

(4) &ndash; represents en dash, a dash that is used for ranges which is correctly used in the article for (c. 1078 &ndash; c. 1102). &#x2013 is used for hyphens which is incorrect to use for (c. 1078 &#x2D; c. 1102) according to English grammar rules.

No additional work is required per @santhosh; re-checked for possible regression - all seem to be fine.