Page MenuHomePhabricator

CX2: Should not use   for regular white-space characters
Closed, ResolvedPublic

Description

CX2 should use regular white-space characters, not their HTML encoded equivalent  

Example on frwiki Maaraue : https://fr.wikipedia.org/w/index.php?title=Maaraue&action=edit&oldid=155643355 with the following texts

  • u. a.
  • entre autres

Event Timeline

Pginer-WMF subscribed.

This can be reproduced by when translating Maaraue from German to French using Yandex. The last content paragraph will result in "entre autres" when published.

We may need to check if this is a side-effect of the content sanitization done for content coming from MT services (or if it is something that can be fixed as part of it). May be related to T213275.

Pginer-WMF triaged this task as Medium priority.Jan 9 2019, 12:29 PM

The original article has the following wikitext

auf dem sich u. a.

Parsoid/Restbase converts this to the following html

auf dem sich u.<span typeof="mw:Entity" id="mwAw"> </span>a.

When a MT service translate this, for example, Yandex translates this to French:

entre<span id=\"mwKQ\" typeof=\"mw:Entity\"> </span>autres

Converting this back to wikitext using parsoid/restbase gives the wikitext:

entre&#x20;autres
santhosh added a subscriber: ssastry.
&nbsp;  => <span typeof="mw:Entity" id="mwAw"> </span> => &#x20;

@ssastry Do you think this is a valid round trip? The reporter of this issue says, they prefer &nbsp; instead of numerical representation. More examples at T213275: CX2: Should not transform named HTML entites into numeric HTML entities

@santhosh The problem is not only about prefering named HTML entities mentionned in T213275, but the conversion here is even incorrect: &#x20; is a regular whitespace character, not at all equivalent to &nbsp; which is a non-breaking whitespace... I think &#x20; should be converted to a regular whitespace, I don't see any reason to have it instead of the regular whitespace inside a normal text.

Using HTML entity for a non-breaking whitespace is ok (preferably a named HTML entity for me as it's more understandable), because the difference between the actual character and a regular whitespace is not visually obvious.
In the other hand, using HTML entity for a normal character (regular whitespace, ANSI letter, classic accented letter...), is unecessary and complexify the wikitext for the editor.

The original article has the following wikitext

auf dem sich u.&nbsp;a.

Parsoid/Restbase converts this to the following html

auf dem sich u.<span typeof="mw:Entity" id="mwAw"> </span>a.

When a MT service translate this, for example, Yandex translates this to French:

entre<span id=\"mwKQ\" typeof=\"mw:Entity\"> </span>autres

Converting this back to wikitext using parsoid/restbase gives the wikitext:

entre&#x20;autres

Can you clarify which of the <span>s in your example contain &nbsp; vs &#20;? Parsoid uses the literal unicode &nbsp; character, but that sometimes makes it hard to tell what's actually going on when looking at HTML with the naked eye.

It appears to me that what is going on is that Parsoid is creating <span ...>&nbsp;</span> but the translation service is stripping/normalizing the whitespace, resulting in <span ...>&#20;</span>. Parsoid faithfully interprets this as a explicit request to use an entity to encode the plain space character. The root cause is that the translation service is not preserving the original whitespace.

The workaround would probably be to do something like add an attribute to the <span> redundantly identifying the whitespace enclosed, then use that to correct for the mangling done by the translation service. That is:

Parsoid:

<span typeof="mw:Entity">&nbsp;</span>

Mangled for translation:

<span typeof="mw:Entity" data-mw-entity="nbsp">&nsbp;</span>

Returned from translation service:

<span typeof="mw:Entity" data-mw-entity="nbsp">&#20;</span>

Unmangled using the data-mw-entity attribute:

<span typeof="mw:Entity">&nsbp;</span>

Parsoid generates the wikitext:

&nbsp;

Does this make sense? This mangling/unmangling would only need to be done for translation services which don't properly preserve whitespace.

@cscott Translation service is not required to reproduce the issue.

  1. Give A&nbsp;B in https://en.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_wikitext_to_html_title_revision. It gives
<p id="mwAg">A<span typeof="mw:Entity" id="mwAw"> </span>B</p>
  1. Copy that and paste into https://en.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_html_to_wikitext_title_revision. It gives
A&#x20;B

@cscott Translation service is not required to reproduce the issue.

  1. Give A&nbsp;B in https://en.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_wikitext_to_html_title_revision. It gives
<p id="mwAg">A<span typeof="mw:Entity" id="mwAw"> </span>B</p>
  1. Copy that and paste into https://en.wikipedia.org/api/rest_v1/#!/Transforms/post_transform_html_to_wikitext_title_revision. It gives
A&#x20;B

@santhosh this is because you are testing without data-parsoid which stores more information about the entity in the original source. Without that, Parsoid has no way of know what the original entity was.

To illustrate, see this:

[subbu@earth:~/work/wmf/parsoid] echo -e "x\n&nbsp;\n&#x20;\ny" | parse.js --body_only
<p data-parsoid='{"dsr":[0,17,0,0]}'>x
<span typeof="mw:Entity" data-parsoid='{"src":"&amp;nbsp;","srcContent":" ","dsr":[2,8,null,null]}'> </span>
<span typeof="mw:Entity" data-parsoid='{"src":"&amp;#x20;","srcContent":" ","dsr":[9,15,null,null]}'> </span>
y</p>

If you remove data-parsoid attribute, Parsoid doesn't know if it is &nbsp; or &#x20;.

So, I think what is happening here is that the CX2 request to RESTBase is probably losing the UUID info of the original revision that initiated the translation (look at mw:TimeUuid meta tag in the <head> for example in enwiki:Hospet). That UUID is necessary to map Parsoid HTML ids to the corresponding data-parsoid (the mwAg, mwAg values in the id attributes). Check how VE handles this. VE is required to preserve this uuid info to prevent dirty diffs on the edited document.

...and the cut-and-paste mechanism santosh is using to test seems to be losing the U+00A0 character.

@ssastry: &#x20; is an "ordinary" space. &#xA0; is a nonbreaking space. When santosh copy-and-pastes the span with the U+00A0 character in it, something (maybe his browser) is converting the U+00A0 to a U+0020. Your response is mixing up A0 and 20. I don't think the underlying problem is due to data-parsoid, it's being caused by exactly which character is inside the span. Unfortunately examples are hard to paste here because you can't visually tell the difference between A0 and 20. You need to look at a hex editor to be sure you're testing with the right unicode character.

FWIW, I verified that Firefox converts 00A0 to 0020 on cut-and-paste, at least on Linux, by copying the 00A0 in the first sentence of https://en.wikipedia.org/wiki/Non-breaking_space, pasting it into emacs, and using M-x hexl-mode. In order to properly test this you can't use copy-paste to construct your examples.

Here is a test file demonstrating that data-parsoid is not needed, what's important is the contents of the <span>.

$ bin/parse.js --html2wt < test-nbsp.html
a&#x20;b

c&nbsp;d

Here is a test file demonstrating that data-parsoid is not needed, what's important is the contents of the <span>.

$ bin/parse.js --html2wt < test-nbsp.html
a&#x20;b

c&nbsp;d

I stand corrected and I verified independently. Parsoid is doing the right thing here even without data-parsoid. Looks like that was an old bug (T56262: Parsoid: mw:Entity not preserved if data-parsoid missing) that was fixed early on. So, sorry about adding noise to the discussion. :)

Change 486237 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Do not allow MT engines to change the entity values

https://gerrit.wikimedia.org/r/486237

Thanks @cscott, @ssastry. The content inside the span tricked me, I should have checked the character inside it. In further testing, I found that the MT engine behavior is randomabout handling these inputs. Some of them changes nbsp to space, ndash to hyphen in random. I added a guard against these changes now.

Change 486237 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Do not allow MT engines to change the entity values

https://gerrit.wikimedia.org/r/486237

Mentioned in SAL (#wikimedia-operations) [2019-01-25T05:38:07Z] <kartik@deploy1001> Started deploy [cxserver/deploy@a5d7181]: Update cxserver to 356f0a1 (T213257, T213275)

Mentioned in SAL (#wikimedia-operations) [2019-01-25T05:42:16Z] <kartik@deploy1001> Finished deploy [cxserver/deploy@a5d7181]: Update cxserver to 356f0a1 (T213257, T213275) (duration: 04m 09s)

I checked in cx2-testing (no MT translation) - unlike the case in T213275: CX2: Should not transform named HTML entites into numeric HTML entities, &nbsp; is preserved when 'Copy original content' option is used. I added the task to the list of issues to check after the deployment.

@Etonkovidova This is deployed to cxserver in Production, you can check it there now.

Re-checked in wmf.14 - translating de:Maaraue to fr using CX2 testing will preserve &nbsp;.