Page MenuHomePhabricator

VE is removing spaces (dirty diffs) on some wikis (wikitech, officewiki)
Closed, ResolvedPublic

Description

The removed spaces include spaces around section headings (== Foo ====Foo==), in list markup (* Foo*Foo), and around the pipes in table markup.

Some examples:

Looking through Special:RecentChanges, the earliest example I managed to find was https://office.wikimedia.org/w/index.php?diff=252191&oldid=252189 on 2019-05-29 at 21:44 UTC. This edit at 19:04 and this edit at 20:15 did not remove spaces present elsewhere in the page.

Event Timeline

Anomie created this task.Jun 6 2019, 2:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 6 2019, 2:14 PM

This topic has come up on the staff irc channel a few times in the last week. This is probably VE / RESTBase interaction issues. @Esanders @mobrovac @matmarex might have more info about what changed.

Officewiki si a private wiki, and we don't have RESTBase set up for private wikis.

The earlier instance I know of is these diffs posted by @gpaumier on IRC on 30 May: https://office.wikimedia.org/w/index.php?diff=252216&oldid=252101 and https://office.wikimedia.org/w/index.php?diff=prev&oldid=252262

We've recently enabled the ability to switch editors on non-RESTBase wikis (T222312), which is expected to cause dirty diffs. There's also a config option to disable editor switching again if dirty diffs are more annoying than not being able to switch editors (see that task). However at the time these edits were made, switching editors should not have been possible yet, and so there must be some other reason for the dirty diffs.

Editing team members are currently travelling back home from an offsite, so please give us some time.

Anomie added a comment.Jun 7 2019, 1:15 PM

I just saw an edit on wikitech with a similar dirty diff: https://wikitech.wikimedia.org/w/index.php?title=Logstash&diff=1828764&oldid=1827679


I have no idea if it's a valid test, but since the time this seems to have started corresponds nicely to a Parsoid deployment SAL entry

20:35 arlolra: Updated Parsoid to 8546c79 (T219927, T211125)
20:28 arlolra@deploy1001: Finished deploy [parsoid/deploy@6caac43]: Updating Parsoid to 8546c79 (duration: 07m 46s)
20:20 arlolra@deploy1001: Started deploy [parsoid/deploy@6caac43]: Updating Parsoid to 8546c79

I decided to poke at it with parse.js a little to try to reproduce. The SAL log previous to that was on 2019-05-15,

20:18 arlolra@deploy1001: Finished deploy [parsoid/deploy@8f28977]: Updating Parsoid to 6658cad (duration: 06m 23s)
20:12 arlolra@deploy1001: Started deploy [parsoid/deploy@8f28977]: Updating Parsoid to 6658cad

Using that version (6658cad), I found that this managed to produce correct output:

$ echo "== foo ==" > /tmp/1
$ bin/parse.js --wt2html --inputfile /tmp/1 | bin/parse.js --html2wt --selser --oldtextfile /tmp/1
Waiting for stdin...
== foo ==

while in 8546c79 it didn't:

$ bin/parse.js --wt2html --inputfile /tmp/1 | bin/parse.js --html2wt --selser --oldtextfile /tmp/1
Waiting for stdin...
==foo==

Bisecting from there turned up rGPAR59fd077b2257: Stop generating an old dom when none is provided as the revision that broke it.

ssastry triaged this task as Normal priority.Jun 7 2019, 2:49 PM
ssastry edited projects, added Parsoid-Edit-Support; removed Parsoid.
ssastry added a subscriber: Arlolra.Jun 7 2019, 2:52 PM

I just saw an edit on wikitech with a similar dirty diff: https://wikitech.wikimedia.org/w/index.php?title=Logstash&diff=1828764&oldid=1827679

I have no idea if it's a valid test, but since the time this seems to have started corresponds nicely to a Parsoid deployment SAL entry

20:35 arlolra: Updated Parsoid to 8546c79 (T219927, T211125)
20:28 arlolra@deploy1001: Finished deploy [parsoid/deploy@6caac43]: Updating Parsoid to 8546c79 (duration: 07m 46s)
20:20 arlolra@deploy1001: Started deploy [parsoid/deploy@6caac43]: Updating Parsoid to 8546c79

I decided to poke at it with parse.js a little to try to reproduce. The SAL log previous to that was on 2019-05-15,

20:18 arlolra@deploy1001: Finished deploy [parsoid/deploy@8f28977]: Updating Parsoid to 6658cad (duration: 06m 23s)
20:12 arlolra@deploy1001: Started deploy [parsoid/deploy@8f28977]: Updating Parsoid to 6658cad

Using that version (6658cad), I found that this managed to produce correct output:

$ echo "== foo ==" > /tmp/1
$ bin/parse.js --wt2html --inputfile /tmp/1 | bin/parse.js --html2wt --selser --oldtextfile /tmp/1

Did you not provide an oldhtmlfile deliberately to mimic absence of RESTBase? Or was it unintentional? Because if you had provided the old HTML file, you wouldn't have uncovered this.

Waiting for stdin...

foo

while in 8546c79 it didn't:

$ bin/parse.js --wt2html --inputfile /tmp/1 | bin/parse.js --html2wt --selser --oldtextfile /tmp/1
Waiting for stdin...

foo

Bisecting from there turned up {59fd077b22} as the revision that broke it.

Interesing. I didn't realize that all these private wikis were effectively relying on Parsoid parsing the page again prior to serializing the edited HTML. The reparse generates the original HTML for the page which lets Parsoid do dom-diff and then apply selser. However, that reparse to use as the basis for dom diff is flawed. It works in the common case but there are edge cases (template edits, use of parser functions that are non-deterministic) which can potentially cause page corruption. But, I suppose these edge cases don't quite exist or are extremely rare on private wikis.

We do have some solutions for that problem ( T224241: Generate ids based on dsr info). Anyway, for now, we could revert that patch and when T224241 is resolved, regenerating the HTML prior to html->wt would be the right thing to do.

@Arlolra FYI.

Change 515091 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Revert "Stop generating an old dom when none is provided"

https://gerrit.wikimedia.org/r/515091

cscott added a subscriber: cscott.Jun 7 2019, 3:20 PM

Hm! Turns out that we've been worrying about the complex caching semantics required of RESTbase during edits, while all along merrily ignoring caching altogether....

.... while all along merrily ignoring caching altogether....

Only on private wikis.

Only on private wikis.

If RESTBase decides (based on these complex semantics) that it doesn't have cached data for an edit, it applies to non-private wikis too.

Only on private wikis.

If RESTBase decides (based on these complex semantics) that it doesn't have cached data for an edit, it applies to non-private wikis too.

Those are edge cases, and given the near absence of dirty diff complaints on non-private wikis, I am inclined to believe those are quite rare.

Anomie added a comment.Jun 7 2019, 5:17 PM

Did you not provide an oldhtmlfile deliberately to mimic absence of RESTBase? Or was it unintentional? Because if you had provided the old HTML file, you wouldn't have uncovered this.

Completely unintentional. I started with bin/parse.js --wt2html --inputfile /tmp/1 | bin/parse.js --html2wt but that never produced the correct output. When I tried adding --selser the resulting error message only mentioned --oldtext and --oldtextinput, and adding only that got the right output out of 6658cad.

Change 515091 abandoned by Subramanya Sastry:
Revert "Stop generating an old dom when none is provided"

Reason:
I don't know what I am talking about.

https://gerrit.wikimedia.org/r/515091

Change 515279 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Revert "Stop generating an old dom when none is provided"

https://gerrit.wikimedia.org/r/515279

Change 515279 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Revert "Stop generating an old dom when none is provided"

https://gerrit.wikimedia.org/r/515279

This should be resolved by the Parsoid patch and now we're just waiting for the next Parsoid deployment, correct?

This should be resolved by the Parsoid patch and now we're just waiting for the next Parsoid deployment, correct?

Correct. https://www.mediawiki.org/wiki/Parsoid/Deployments#Monday,_June_10,_2019_around_1:15_pm_PT:_2bf94f0_to_be_deployed_(deploy-2019-06-10_branch)_No_SRE_around

Krinkle renamed this task from VE on officewiki is removing spaces (dirty diffs) to VE is removing spaces (dirty diffs) on some wikis (wikitech, officewiki).Jun 12 2019, 7:06 PM

Mentioned in SAL (#wikimedia-operations) [2019-06-17T20:53:05Z] <arlolra> Updated Parsoid to rGPAR2bf94f07e8c0 (T225217)

Arlolra closed this task as Resolved.Jun 17 2019, 8:55 PM
Arlolra claimed this task.
Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptJun 17 2019, 8:55 PM