CX2: Should not use   for regular white-space characters
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	NicoV
	Jan 9 2019, 10:47 AM

Description

CX2 should use regular white-space characters, not their HTML encoded equivalent  

Example on frwiki Maaraue : https://fr.wikipedia.org/w/index.php?title=Maaraue&action=edit&oldid=155643355 with the following texts

u. a.
entre autres

Details

	Subject	Repo	Branch	Lines +/-
	Do not allow MT engines to change the entity values	mediawiki/services/cxserver	master	+30 -4

Customize query in gerrit

Related Objects

Mentioned In: T213275: CX2: Should not transform named HTML entites into numeric HTML entities
Mentioned Here: rGCXS356f0a1943d4: Do not allow MT engines to change the entity values
T56262: Parsoid: mw:Entity not preserved if data-parsoid missing
T213275: CX2: Should not transform named HTML entites into numeric HTML entities

Event Timeline

NicoV created this task.Jan 9 2019, 10:47 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 9 2019, 10:47 AM

Pginer-WMF updated the task description. (Show Details)Jan 9 2019, 12:19 PM

This can be reproduced by when translating Maaraue from German to French using Yandex. The last content paragraph will result in "entre autres" when published.

We may need to check if this is a side-effect of the content sanitization done for content coming from MT services (or if it is something that can be fixed as part of it). May be related to T213275.

Pginer-WMF triaged this task as Medium priority.Jan 9 2019, 12:29 PM

Pginer-WMF mentioned this in T213275: CX2: Should not transform named HTML entites into numeric HTML entities.Jan 9 2019, 12:33 PM

The original article has the following wikitext

auf dem sich u.&nbsp;a.

Parsoid/Restbase converts this to the following html

auf dem sich u.<span typeof="mw:Entity" id="mwAw"> </span>a.

When a MT service translate this, for example, Yandex translates this to French:

entre<span id=\"mwKQ\" typeof=\"mw:Entity\"> </span>autres

Converting this back to wikitext using parsoid/restbase gives the wikitext:

entre&#x20;autres

&nbsp;  => <span typeof="mw:Entity" id="mwAw"> </span> => &#x20;

@ssastry Do you think this is a valid round trip? The reporter of this issue says, they prefer   instead of numerical representation. More examples at T213275: CX2: Should not transform named HTML entites into numeric HTML entities

@santhosh The problem is not only about prefering named HTML entities mentionned in T213275, but the conversion here is even incorrect:   is a regular whitespace character, not at all equivalent to   which is a non-breaking whitespace... I think   should be converted to a regular whitespace, I don't see any reason to have it instead of the regular whitespace inside a normal text.

Using HTML entity for a non-breaking whitespace is ok (preferably a named HTML entity for me as it's more understandable), because the difference between the actual character and a regular whitespace is not visually obvious.
In the other hand, using HTML entity for a normal character (regular whitespace, ANSI letter, classic accented letter...), is unecessary and complexify the wikitext for the editor.

In T213257#4895742, @santhosh wrote:
The original article has the following wikitext
auf dem sich u.&nbsp;a.
Parsoid/Restbase converts this to the following html
auf dem sich u. a.
When a MT service translate this, for example, Yandex translates this to French:
entre autres
Converting this back to wikitext using parsoid/restbase gives the wikitext:
entre&#x20;autres

Can you clarify which of the s in your example contain   vs ? Parsoid uses the literal unicode   character, but that sometimes makes it hard to tell what's actually going on when looking at HTML with the naked eye.

It appears to me that what is going on is that Parsoid is creating   but the translation service is stripping/normalizing the whitespace, resulting in . Parsoid faithfully interprets this as a explicit request to use an entity to encode the plain space character. The root cause is that the translation service is not preserving the original whitespace.

The workaround would probably be to do something like add an attribute to the redundantly identifying the whitespace enclosed, then use that to correct for the mangling done by the translation service. That is:

Parsoid:

<span typeof="mw:Entity">&nbsp;</span>

Mangled for translation:

<span typeof="mw:Entity" data-mw-entity="nbsp">&nsbp;</span>

Returned from translation service:

<span typeof="mw:Entity" data-mw-entity="nbsp">&#20;</span>

Unmangled using the data-mw-entity attribute:

<span typeof="mw:Entity">&nsbp;</span>

Parsoid generates the wikitext:

&nbsp;

Does this make sense? This mangling/unmangling would only need to be done for translation services which don't properly preserve whitespace.

@cscott Translation service is not required to reproduce the issue.

Give A B in https://en.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_wikitext_to_html_title_revision. It gives

<p id="mwAg">A<span typeof="mw:Entity" id="mwAw"> </span>B</p>

Copy that and paste into https://en.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_html_to_wikitext_title_revision. It gives

A&#x20;B

In T213257#4901234, @santhosh wrote:
@cscott Translation service is not required to reproduce the issue.

Give A B in https://en.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_wikitext_to_html_title_revision. It gives
A B
Copy that and paste into https://en.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_html_to_wikitext_title_revision. It gives
A&#x20;B

@santhosh this is because you are testing without data-parsoid which stores more information about the entity in the original source. Without that, Parsoid has no way of know what the original entity was.

To illustrate, see this:

[subbu@earth:~/work/wmf/parsoid] echo -e "x\n&nbsp;\n&#x20;\ny" | parse.js --body_only
<p data-parsoid='{"dsr":[0,17,0,0]}'>x
<span typeof="mw:Entity" data-parsoid='{"src":"&amp;nbsp;","srcContent":" ","dsr":[2,8,null,null]}'> </span>
<span typeof="mw:Entity" data-parsoid='{"src":"&amp;#x20;","srcContent":" ","dsr":[9,15,null,null]}'> </span>
y</p>

If you remove data-parsoid attribute, Parsoid doesn't know if it is   or  .

So, I think what is happening here is that the CX2 request to RESTBase is probably losing the UUID info of the original revision that initiated the translation (look at mw:TimeUuid meta tag in the <head> for example in enwiki:Hospet). That UUID is necessary to map Parsoid HTML ids to the corresponding data-parsoid (the mwAg, mwAg values in the id attributes). Check how VE handles this. VE is required to preserve this uuid info to prevent dirty diffs on the edited document.

...and the cut-and-paste mechanism santosh is using to test seems to be losing the U+00A0 character.

@ssastry:   is an "ordinary" space.   is a nonbreaking space. When santosh copy-and-pastes the span with the U+00A0 character in it, something (maybe his browser) is converting the U+00A0 to a U+0020. Your response is mixing up A0 and 20. I don't think the underlying problem is due to data-parsoid, it's being caused by exactly which character is inside the span. Unfortunately examples are hard to paste here because you can't visually tell the difference between A0 and 20. You need to look at a hex editor to be sure you're testing with the right unicode character.

FWIW, I verified that Firefox converts 00A0 to 0020 on cut-and-paste, at least on Linux, by copying the 00A0 in the first sentence of https://en.wikipedia.org/wiki/Non-breaking_space, pasting it into emacs, and using M-x hexl-mode. In order to properly test this you can't use copy-paste to construct your examples.

Here is a test file demonstrating that data-parsoid is not needed, what's important is the contents of the .

test-nbsp.html87 BDownload

$ bin/parse.js --html2wt < test-nbsp.html
a&#x20;b

c&nbsp;d

In T213257#4902622, @cscott wrote:
Here is a test file demonstrating that data-parsoid is not needed, what's important is the contents of the .

test-nbsp.html87 BDownload
$ bin/parse.js --html2wt < test-nbsp.html
a&#x20;b

c&nbsp;d

I stand corrected and I verified independently. Parsoid is doing the right thing here even without data-parsoid. Looks like that was an old bug (T56262: Parsoid: mw:Entity not preserved if data-parsoid missing) that was fixed early on. So, sorry about adding noise to the discussion. :)

santhosh claimed this task.Jan 24 2019, 8:48 AM

Change 486237 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Do not allow MT engines to change the entity values

https://gerrit.wikimedia.org/r/486237

gerritbot added a project: Patch-For-Review.Jan 24 2019, 10:00 AM

Thanks @cscott, @ssastry. The content inside the span tricked me, I should have checked the character inside it. In further testing, I found that the MT engine behavior is randomabout handling these inputs. Some of them changes nbsp to space, ndash to hyphen in random. I added a guard against these changes now.

Pginer-WMF added a project: Language-Team (Language-2019-January-March).Jan 24 2019, 10:19 AM

Pginer-WMF moved this task from Check & Move to CX2 on the ContentTranslation board.

Pginer-WMF moved this task from Backlog to In Progress on the Language-Team (Language-2019-January-March) board.

santhosh moved this task from In Progress to Needs QA-before deployment on the Language-Team (Language-2019-January-March) board.Jan 24 2019, 10:25 AM

santhosh moved this task from Needs QA-before deployment to In Review on the Language-Team (Language-2019-January-March) board.

Change 486237 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Do not allow MT engines to change the entity values

https://gerrit.wikimedia.org/r/486237

Your patch looks good to me.

santhosh moved this task from In Review to Needs QA-before deployment on the Language-Team (Language-2019-January-March) board.Jan 25 2019, 3:49 AM

Mentioned in SAL (#wikimedia-operations) [2019-01-25T05:38:07Z] <kartik@deploy1001> Started deploy [cxserver/deploy@a5d7181]: Update cxserver to 356f0a1 (T213257, T213275)

Mentioned in SAL (#wikimedia-operations) [2019-01-25T05:42:16Z] <kartik@deploy1001> Finished deploy [cxserver/deploy@a5d7181]: Update cxserver to 356f0a1 (T213257, T213275) (duration: 04m 09s)

I checked in cx2-testing (no MT translation) - unlike the case in T213275: CX2: Should not transform named HTML entites into numeric HTML entities,   is preserved when 'Copy original content' option is used. I added the task to the list of issues to check after the deployment.

@Etonkovidova This is deployed to cxserver in Production, you can check it there now.

Re-checked in wmf.14 - translating de:Maaraue to fr using CX2 testing will preserve  .

• Petar.petkovic removed a project: Patch-For-Review.Feb 5 2019, 8:08 AM

• Petar.petkovic removed subscribers: Stashbot, gerritbot.

• Petar.petkovic moved this task from Needs QA-before deployment to Done on the Language-Team (Language-2019-January-March) board.Feb 14 2019, 8:20 AM

CX2: Should not use &#x20; for regular white-space charactersClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

CX2: Should not use for regular white-space characters
Closed, ResolvedPublic
Actions