Page MenuHomePhabricator

HTML entity replaced by the Unicode character in an edit
Closed, ResolvedPublic

Description

HTML entities were replaced by the Unicode characters in these edits made with DiscussionTools:

In both diffs, there are other entities that were not affected (even on the same line of text).

More occurrences:

Event Timeline

This doesn't look like a Parsoid issue to me. This might be a UI issue in Discussion Tools that stripped the entity span. See below where Parsoid converts a literal char to the entity even when data-parsoid is stripped.

[subbu@earth:~/work/wmf/parsoid] echo "Ω" |php bin/parse.php --wt2wt
Ω

[subbu@earth:~/work/wmf/parsoid] echo '<span typeof="mw:Entity">Ω</span>' | php bin/parse.php --html2wt
&#x2126;

@matmarex, can you verify / repro?

ssastry triaged this task as Medium priority.Nov 18 2020, 4:27 AM

No idea how it's happening, and I couldn't reproduce, but here's another case: https://he.wikipedia.org/?diff=29904425

The entities in question are never loaded into the DT UI, so sounds like a remex bug/configuration issue. Cc @cscott.

@Esanders can you say more about that? In T266140#6625639, I say that Parsoid will not convert the entity to a char if the wrapping span is present. Remex shouldn't be stripping the spans.

I edited the source to convert the char back to the entity, then opened the page in VE via ?veaction=edit and if I modify even the list item, the entity is left alone (as expected).

@Esanders suspects that when they parse the page using Remex they are somehow losing the entity. I'm not convinced, but ed's going to try to trace the html into and out of discussion tools to figure out more precisely what's going on.

@cscott testing locally, the bug only surfaces when I have RESTBase enabled.

The HTML produced by remex looks fine though, so I suspect something to do with RESTBase IDs.

I've tracked this down to the point where we do the HTML->wikitext serialisation and compared it with a VE edit that wasn't corrupting.

The HTML from the discussion tools edit has the rel and href attributes re-ordered in an adjacent <a>, presumably by the round-trip through remex:

VE:

<dl id="mwBg"><dd id="mwBw">test <span typeof="mw:Entity" id="mwCA">Ω</span> <a rel="mw:WikiLink" href="./User:Ed" title="User:Ed" class="new" id="mwCQ">Ed</a></dd></dl>

->

:test &#x2126; [[User:Ed|Ed]]

DT:

<dl id="mwBg"><dd id="mwBw">test <span typeof="mw:Entity" id="mwCA"></span> <a href="./User:Ed" rel="mw:WikiLink" title="User:Ed" class="new" id="mwCQ">Ed</a></dd></dl>

->

:test Ω [[User:Ed|Ed]]

It seems like there are two things to investigate here:

  1. Preserving attribute order in remex to help selser
  2. Even with the attribute re-order, the HTML should've round-tripped correctly

cc @ssastry @cscott

  1. Preserving attribute order in remex to help selser

We do have customized diff handlers that ignores ordering in some cases although not sure if that is generalized or not. That said reg 2. below,

  1. Even with the attribute re-order, the HTML should've round-tripped correctly

Yes, for sure. If this isn't happening, this is a Parsoid bug. That said, the entity HTML is identical and Parsoid seems to handle that correctly.

[subbu@earth:~/work/wmf/parsoid] echo '<span typeof="mw:Entity" id="mwCA">Ω</span>' | php bin/parse.php --html2wt
&#x2126;

So, I guess I'll run this through the full selser path by sending it a new html where the attrs are reordered and see if that duplicates this behavior.

Okay, so I tested, and domdiff is agnostic to attr ordering it appears: [SELSER] | ORIG-src with DSR [0,29] = ":test &#x2126; [[User:Ed|Ed]]"
And, I introduced artificial diffs to trigger the regular serialization of the entity and it does the right thing.
So, there is still something else going on between what DTools is posting and what Parsoid sees.
Can you dump the posted HTML from DTools backend to Parsoid/RESTBase and compare that?

Can you dump the posted HTML from DTools backend to Parsoid/RESTBase and compare that?

As mentioned offline, the HTML is my last comment is from the DTools backend, right before it was sent to RB.

In my local test w/ RESTBase, I got this:

image.png (578×2 px, 100 KB)

This looks like what i'd expect for missing data-parsoid -- the entity is still there, it's just been normalized.

Still can't reproduce the entity going away completely, but stay tuned.

Getting there (slowly):
OLD DOM:

<p id="mwAw" data-parsoid='{"dsr":[17,343,0,0]}'>Vi kan förvänta oss att bilden är komplicerad när det gäller huruvida individer från Göteborg har ett tionde fonem, och i vilka ord de i så fall uttalar med ett tionde fonem. Det kan finnas infödda människor med arbetaryrken som uttalar många typiska <i id="mwBA" data-parsoid='{"dsr":[279,285,2,2]}'>ô</i><span typeof="mw:Entity" id="mwBQ" data-parsoid='{"src":"&amp;#x2011;","srcContent":"‑","dsr":[285,293,null,null]}'>‑</span>ord med en regional form av <i id="mwBg" data-parsoid='{"dsr":[321,327,2,2]}'>å</i><span typeof="mw:Entity" id="mwBw" data-parsoid='{"src":"&amp;#x2001;","srcContent":" ","dsr":[327,335,null,null]}'> </span>fonemet.</p>

After DOM diff:

<p id="mwAw" data-parsoid='{"dsr":[17,343,0,0]}' data-parsoid-diff='{"id":4946,"diff":["subtree-changed"]}'>Vi kan förvänta oss att bilden är komplicerad när det gäller huruvida individer från Göteborg har ett tionde fonem, och i vilka ord de i så fall uttalar med ett tionde fonem. Det kan finnas infödda människor med arbetaryrken som uttalar många typiska <i id="mwBA" data-parsoid='{"dsr":[279,285,2,2]}'>ô</i><span typeof="mw:Entity" id="mwBQ" data-parsoid='{"src":"&amp;#x2011;","srcContent":"‑","dsr":[285,293,null,null]}'>‑</span>ord med en regional form av <i id="mwBg" data-parsoid='{"dsr":[321,327,2,2]}'>å</i><span typeof="mw:Entity" id="mwBw" data-parsoid='{"src":"&amp;#x2001;","srcContent":" ","dsr":[327,335,null,null]}' data-parsoid-diff='{"id":4946,"diff":["children-changed","subtree-changed"]}'><meta typeof="mw:DiffMarker/deleted" data-parsoid="{}"/> </span>fonemet.</p>

Everything looks good, but selser is marking the entity as deleted for some reason.

Ah, the diff is saying that the content of the entity span has changed. So, it is marking the original text node deleted.

So, the qn. is if the content did indeed change in transit (or if dom-diff has a bug).

Oh, and indeed it has changed: the srcContent is \342\200\201 but the actual span contents are \342\200\203. How did *that* happen, I wonder?

And of course since those are invisible characters, I need to look *real close* to see where that happened...
(but not all of them are invisible characters, I thought ed's test case had an omega...)

This is a very strange bug:

>>> json_encode(Validator::cleanUp("abc\u{2001}\u{2003}"))
=> ""abc\u2003\u2003""
>>> json_encode(Validator::NFD("abc\u{2001}\u{2003}"))
=> ""abc\u2003\u2003""

The Unicode NFC normalization being done in WebRequest::getVal() -> WebRequest::getGPCVal() -> Language::normalizeUnicode() -> UtfNormal\Validator::cleanUp -> UtfNormal\Validator::NFC -> UtfNormal\Validator::NFD is corrupting the input HTML.

Change 649751 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/extensions/VisualEditor@master] Don't NFC normalize Parsoid HTML

https://gerrit.wikimedia.org/r/649751

Option 1: ^ the above is one possible fix here.

Option 2: Another (cleaner?) option is to add a 'raw' type to ParamValidator (somehow) and change ApiParamValidatorCallbacks::getValue() in core to return WebRequest::getRawVal() for parameters with the raw type.

Option 3: A third option is to think hard and ensure that Parsoid always emits output in Unicode Normalization Form C. *Probably* this could be easily done w/ a post-processing pass, using the same utfnormal library that core does. This could result in dirty diff if the input wikitext was not in NFC, but it "should" Just Work (tm) for entities. The tricky thing here is that the wikitext author is explicitly requesting a non-NFC character using an entity escape. The NFC-clean thing to do is to leave the original in contentSrc but put the NFC form in content. Some weird cases possible here if the user manually specified a decomposed unicode form using multiple entity escapes that converts to a single unicode codepoint in NFC, etc. Perhaps not insolvable, but not entirely straightforward. (I believe the *wikitext in the database* will always be NFC due to this parameter normalization, but if the wikitext author uses explicit entity escapes they can cause the *output HTML* not to be NFC.)

Option 4: A fourth option is to treat this as a weird special case in selser for now and NFC convert the text contents before comparing them in DOMDiff.

https://unicode.org/reports/tr15/ for reference.

Change 649754 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] api: Add new 'raw' parameter type which avoids Unicode NFC normalization

https://gerrit.wikimedia.org/r/649754

Change 649755 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/extensions/VisualEditor@master] ApiVisualEditorEdit: the 'html' parameter should be raw to avoid normalization

https://gerrit.wikimedia.org/r/649755

^ this pair of patches implements "option 2" above.

Of everything, I prefer Option 4 since this will affect all clients and magically fixing VE to do the right thing won't help other editing clients out there.

I don't particularly like option 4, because I think it's a little too 'magical'. It covers up the NFC normalization under selser, which makes it less likely to cause dirty diffs (good!) but more surprising when the same bug creeps into edited HTML. But maybe defense in depth is warranted. I think option 2 is necessary because I think there are plenty of cases where the action API should *not* be trying to normalize the input string -- just immediately adjacent to the area changed in option 1 we see an attempt to pass *compressed* HTML to ApiVisualEditorEdit. I'm sure that ran into all sorts of mysterious problems because the binary deflated string was being NFC normalized...

So I think we need a 'raw' option, and I think the HTML should be passed using raw mode. That will make 'deflate' work, if nothing else. The remaining questions are:

  1. should &#x2001; generate U+2001 (literal) or U+2003 (NFC normalize all output, at least in the simple cases)
  2. what about (from figure 5 of https://unicode.org/reports/tr15): &#x1E0B;&#x0323; -- the NFC output should be U+1E0D U+0307 -- but only if they are adjacent like that, &#x1E0B; should not turn into U+1E0D in isolation. NFC normalizing the output is going to run into <span> boundaries. Perhaps the semantically correct thing to do is to grow the "entity affected region" to include all bytes in the NFC codepoint (only one of which need be written using entities), but this starts looking like a real headache. What if some of the characters are template-generated? (OTOH, this starts looking like a task for LanguageConverter, which is built to handle regions inside text nodes.)
  3. should we NFC normalize text regions during dom diff?

My person opinion is that (1) should generate U+2001 and we should punt on (2) -- that is, we should explicitly state that the output HTML is not guaranteed to be in unicode NFC and that's OK. That means we need raw mode when passing HTML around, but again this is only for the action API (there's no normalization in the rest API AFAIK). Then for (3) I don't particularly care one way or the other, but I'd lean toward not doing it, because it adds a bunch of string-conversion overhead for something that it only very rarely useful. Maybe it's worth doing if it can be done only in edge cases; ie do the fast string comparison first and only if it fails do an NFC normalization and try again. But I'd really want to watch performance closely.

Change 649754 merged by jenkins-bot:
[mediawiki/core@master] api: Add new 'raw' parameter type which avoids Unicode NFC normalization

https://gerrit.wikimedia.org/r/649754

Change 649755 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] ApiVisualEditorEdit: the 'html' parameter should be raw to avoid normalization

https://gerrit.wikimedia.org/r/649755

Change 649751 abandoned by C. Scott Ananian:
[mediawiki/extensions/VisualEditor@master] Don't NFC normalize Parsoid HTML

Reason:
Abandoned in favor of I0d34c9a01f1132c2616ed3392ea40d8b73e15325

https://gerrit.wikimedia.org/r/649751

ppelberg claimed this task.