Page MenuHomePhabricator

Spaces before a colon in a preformatted block replaced with random other letters
Closed, ResolvedPublicBUG REPORT

Description

Diff: https://office.wikimedia.org/w/index.php?diff=310675&diffmode=source (private wiki, sorry, I haven't managed to reproduce this on a public wiki yet)

In this diff, the text that I meant to add was added, but also every space that appears before a colon in a preformatted block was replaced with a random letter.

This seems to be related to the fact that Parsoid transforms spaces before colons in preformatted blocks to non-breaking spaces (<span typeof="mw:DisplaySpace">&nbsp;</span>), which is visible (and causes a minor display issue in VE) in my sandbox. Those spaces are the ones that got mangled. But I can't reproduce that corruption on my simple sandbox page where I have just one of these DisplaySpace instances.

Event Timeline

I wonder if this was caused by https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/768281/ ... that has only rolled out till group 1 wikis .. hence probably only officewiki impact.

Hmm .. I cannot reproduce this locally (with the wikitext of the page) with both combinations (HTML from before the fix, and selsered with new patch), and vice versa.

Plus, the DSR offsets look correct. Will see if I can reproduce the bug on officewiki by editing that revision.

I can reproduce it on the page with that revision.

ssastry triaged this task as High priority.Apr 7 2022, 6:21 AM

Oh, I can also reproduce it on the latest revision.

Still not reproducible locally and DSR offsets look identical from the HTML version downloaded off officewiki and the HTML version that i generate locally. Hmm..

Can also be reproduced on https://office.wikimedia.org/wiki/User:Ssastry/Sandbox (currently at revid 310730 ) if you edit the first preformatted line.

https://office.wikimedia.org/wiki/User:Ssastry/Sandbox is now the simplest page possible where I could reproduce it (say if i edited "x" to "ax") and used 'Review Changes' to look at the diff.

Okay with that snippet, I can now reproduce this error locally on my laptop.
Plus, I also established it is present before my fix. So, I'll unblock the train and we'll look at this bug separately.

[subbu@earth:~/work/wmf/parsoid] selser.sh
1,2c1,2
<  x
<  y :)
---
>  a
>  yy:)
[subbu@earth:~/work/wmf/parsoid] git checkout e4192a17f3fafad3d085c25539e0d26625da57c3
Note: switching to 'e4192a17f3fafad3d085c25539e0d26625da57c3'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at e4192a17f Followup to 29a92a69: Fix missing unwrap invocation in PWrap
[subbu@earth:~/work/wmf/parsoid] selser.sh
1,2c1,2
<  x
<  y :)
---
>  a
>  yy:)

Hrm .. Turns out the DSR on the DisplaySpace is incorrect even after my PreHander patch (off-by-N chars .. I suppose based on how many pre-lines there are?)

<pre data-parsoid='{"dsr":[0,8,1,0]}'>x
y<span typeof="mw:DisplaySpace" data-parsoid='{"dsr":[4,5,0,0]}'> </span>:)</pre>

That should be [5,6,0,0]. So, looks like there is more to investigate and fix.

Aha .. I see the problem .. There is DSR updating in the DisplaySpace handler and so, the mw:IndentPreWS meta tag that I added to avoid DSR fixup problems need to be present till the very end. I am stripping the meta tags too early in Cleanup:stripMarkerMetas which runs before some of these passes.

So, maybe I need to strip it in CleanUp::cleanupAndSaveDataParsoid instead.

I am surprised no one tripped on this bug till today! That Bash page has been around for a long time.

But yes, while this bug has been around for a while, this should not have triggered on unedited blobs. So, that part is still unresolved.

Change 778210 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Don't strip mw:IndentPre meta tags till the very end

https://gerrit.wikimedia.org/r/778210

Test edit on officewiki: ... shows that unedited portions of the page are being dirtied with this bug.

Test edit on testwiki: ... shows no problem with dirtying unedited portions of the page that would trigger this bug.

Test edit on mediawiki.org .. shows no problem with dirtying unedited portions of the page that would trigger this bug.

So, looks like this is not a bug on wikis with RESTBase and is limited to officewiki (and maybe wikitech and other private wikis).

EDIT: Yes on wikitech as well

Okay, for some reason, the dirtying seems restricted to the mw:DisplaySpace bug. This officewiki edit verifies that.

So, the impact is far less severe than I feared.

I am going unblock the train again.

Arguably we should not be adding display space in <pre> tags. I don't know if that avoids the bug, but displayspace is a formatting hack for french : style " punctuation and is completely unnecessary for preformatted text.

Ah, good point! That too! So, that can be fixed in the DisplaySpace DOM pass. 3 bugs for the price of 1.

Change 778210 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Don't strip mw:IndentPre meta tags till the very end

https://gerrit.wikimedia.org/r/778210

Change 779085 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.16.0-a5

https://gerrit.wikimedia.org/r/779085

Change 779085 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.16.0-a5

https://gerrit.wikimedia.org/r/779085

ssastry claimed this task.