Page MenuHomePhabricator

Whitespace normalized outside of modified tree
Closed, ResolvedPublic

Assigned To
Authored By
Arlolra
May 23 2018, 9:03 PM
Referenced Files
F18497474: html
May 23 2018, 10:56 PM
F18497471: edit
May 23 2018, 10:56 PM
F18497476: json
May 23 2018, 10:56 PM
F18497479: text
May 23 2018, 10:56 PM

Description

As a result of T157418

16:23 <+arlolra> how long until we start hearing complaints about https://fr.wikipedia.org/w/index.php?title=Tom_Palmer_(rugby_%C3%A0_XV)&curid=1169884&diff=148797916&oldid=148797790&diffmode=source
16:23 <+arlolra> https://fr.wikipedia.org/w/index.php?title=Paul_Destombes&curid=11726121&diff=148798038&oldid=148798015&diffmode=source
[...]
16:32 <+subbu|away> arlolra, but, I clicked on that one link .. and from what i can make out .. heading space is gone? that is odd .. selser should have prevented that.
16:33 <+subbu|away> worth tracking .... and reverting if it looks like a problem.
16:33 <+subbu|away> later.
[...]
16:38 <+arlolra> yup ... i will monitor
16:39 <+arlolra> https://fr.wikipedia.org/w/index.php?title=Fr%C3%A9d%C3%A9ric_Michalak&curid=239627&diff=148798760&oldid=148798254&diffmode=source
16:39 <+arlolra> https://fr.wikipedia.org/w/index.php?title=Utilisateur:Daniel_AC_Mathieu/Brouillon24&curid=11536443&diff=148798728&oldid=148798186&diffmode=source
16:39 <+arlolra> yeah, we're going to need to revert
16:41 <+arlolra> https://en.wikipedia.org/w/index.php?title=Jack_Posobiec&curid=54287201&diff=842658685&oldid=842656084&diffmode=source
16:41 <+subbu|away> Looks like headings only? arlolra maybe test in a sandbox to narrow donw the problem before reverting.
16:43 <+arlolra> why is it that new heading are added with the spacing
16:43 <+arlolra> https://en.wikipedia.org/w/index.php?title=Blinn_College&curid=1246005&diff=842658946&oldid=842657772&diffmode=source
16:44 <+arlolra> it doesn't seem like we should be adding that if we're not going to preserve it
16:46 <+arlolra> https://fr.wikipedia.org/w/index.php?title=Abraham_Van_Helsing&curid=1612328&diff=148799230&oldid=148555643&diffmode=source
16:46 <+arlolra> https://fr.wikipedia.org/w/index.php?title=Rita_Moreno&curid=968101&diff=148799144&oldid=148798292&diffmode=source
16:47 <+arlolra> https://fr.wikipedia.org/w/index.php?title=Histoire_de_Saint-Malo&curid=2233500&diff=148799086&oldid=148798323&diffmode=source
16:47 <+subbu|away> anyway, collect data for fixing and revert. i wonder if it is with fallback ids.
16:48 <+arlolra> https://en.wikipedia.org/w/index.php?title=User:LawrenceScafuri/My_Work&curid=57495649&diff=842659509&oldid=842657849&diffmode=source
16:49 <+arlolra> the problem seems much more prevalent on frwiki
16:49 <+arlolra> https://fr.wikipedia.org/w/index.php?title=Navires_de_Nemi&curid=3762401&diff=148799395&oldid=148799247&diffmode=source
16:49 <+arlolra> https://fr.wikipedia.org/w/index.php?title=LZ_129_Hindenburg&curid=125296&diff=148799310&oldid=148799261&diffmode=source
16:50 <+arlolra> yeah, this seems pretty unacceptable
16:50 <+subbu|away> yes because of fallback ids for headings.
16:50 <+arlolra> i'll revert and file a bug

Event Timeline

Arlolra triaged this task as High priority.May 23 2018, 9:04 PM

From https://en.wikipedia.beta.wmflabs.org/wiki/Transatlantic_Trade_and_Investment_Partnership

node bin/parse --pbinfile json --oldhtmlfile html --inputfile edit --oldtextfile text --selser

As expected, it is the fallback ids (!).

Orig HTML has:

<h2 id="Ссылки"><span id=".D0.A1.D1.81.D1.8B.D0.BB.D0.BA.D0.B8" typeof="mw:FallbackId"></span>Ссылки</h2>

VE's HTML has:

<h2 id="Ссылки">Ссылки</h2>

VE strips fallback spans and section wrappers before doing its edits. So, our DOM-diff should strip section wrappers and fallback spans before diffing the DOMs. This is an old bug. This didn't matter before because the headings preserved whitespace in the HTML and even non-selser normalization would wt2wt properly.

This is why I couldn't reproduce it locally till we actually dumped VE's HTML and used that for selser tests.

This also explains the mysterious html2wt perf degradation since mid-December that I noticed recently.

https://grafana.wikimedia.org/dashboard/db/parsoid-timing-html2wt?panelId=26&fullscreen&orgId=1&from=now-1y&to=now

There is a noticeable upward bump around mid-December ... and guess what ... html5 ids with fallbacks was deployed on December 12.

Btw, just want to draw attention to this comment,

16:43 <+arlolra> why is it that new heading are added with the spacing
16:43 <+arlolra> https://en.wikipedia.org/w/index.php?title=Blinn_College&curid=1246005&diff=842658946&oldid=842657772&diffmode=source
16:44 <+arlolra> it doesn't seem like we should be adding that if we're not going to preserve it

Change 434847 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] html2wt: Strip mw:FallbackId spans before DOM diffs

https://gerrit.wikimedia.org/r/434847

This also explains the mysterious html2wt perf degradation since mid-December that I noticed recently.

https://grafana.wikimedia.org/dashboard/db/parsoid-timing-html2wt?panelId=26&fullscreen&orgId=1&from=now-1y&to=now

There is a noticeable upward bump around mid-December ... and guess what ... html5 ids with fallbacks was deployed on December 12.

More accurately, this is from https://gerrit.wikimedia.org/r/#/c/395831/ on the VE side which was merged Dec 12th and was probably deployed 12th - 15th as part of the train that week.

Btw, just want to draw attention to this comment,

16:43 <+arlolra> why is it that new heading are added with the spacing
16:43 <+arlolra> https://en.wikipedia.org/w/index.php?title=Blinn_College&curid=1246005&diff=842658946&oldid=842657772&diffmode=source
16:44 <+arlolra> it doesn't seem like we should be adding that if we're not going to preserve it

No, these two things are somewhat orthogonal. (a) Whitespace trimming in wikitext -> html is wikitext spec (b) Editors / wikis want whitespace in lists, headings, tables, etc. for readability and so, new content should continue to add them.

The question however is what do we do about headings /list items / table cells that were already present in original wikitext. That is where this gets tricky. There is no good automatic solution. I'm going to file a separate ticket for it.

Btw, just want to draw attention to this comment,

16:43 <+arlolra> why is it that new heading are added with the spacing
16:43 <+arlolra> https://en.wikipedia.org/w/index.php?title=Blinn_College&curid=1246005&diff=842658946&oldid=842657772&diffmode=source
16:44 <+arlolra> it doesn't seem like we should be adding that if we're not going to preserve it

No, these two things are somewhat orthogonal. (a) Whitespace trimming in wikitext -> html is wikitext spec (b) Editors / wikis want whitespace in lists, headings, tables, etc. for readability and so, new content should continue to add them.

The question however is what do we do about headings /list items / table cells that were already present in original wikitext. That is where this gets tricky. There is no good automatic solution. I'm going to file a separate ticket for it.

Filed T195486: html2wt of existing list items, headings, table cells

Change 434847 merged by jenkins-bot:
[mediawiki/services/parsoid@master] html2wt: Strip mw:FallbackId spans before DOM diffs

https://gerrit.wikimedia.org/r/434847

ssastry renamed this task from Heading whitespace normalized outside of modified tree to Whitespace normalized outside of modified tree.May 29 2018, 7:17 PM

Change 436091 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WS-minimization heuristics only apply to comments & text nodes

https://gerrit.wikimedia.org/r/436091

Change 436091 merged by jenkins-bot:
[mediawiki/services/parsoid@master] WS-minimization heuristics only apply to comments & text nodes

https://gerrit.wikimedia.org/r/436091

Vvjjkkii renamed this task from Whitespace normalized outside of modified tree to 5ecaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed ssastry as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: ssastry; removed: gerritbot, Aklapper.
CommunityTechBot renamed this task from 5ecaaaaaaa to Whitespace normalized outside of modified tree.Jul 2 2018, 1:32 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to ssastry.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot edited subscribers, added: gerritbot, Aklapper; removed: ssastry.