The following links:
https://en.wikivoyage.org/wiki/Talk%3AIslamabad?useparsoid=0
https://en.wikivoyage.org/wiki/Talk%3AIslamabad?useparsoid=1
render differently in Parsoid and legacy:
Legacy
Parsoid
| ihurbain | |
| Jun 28 2024, 12:11 PM |
| F55974952: Screenshot_2024-06-28-19-18-33-15_3aea4af51f236e4932235fdada7d1643.jpg | |
| Jun 28 2024, 5:22 PM |
| F55963108: image.png | |
| Jun 28 2024, 12:11 PM |
| F55963120: image.png | |
| Jun 28 2024, 12:11 PM |
The following links:
https://en.wikivoyage.org/wiki/Talk%3AIslamabad?useparsoid=0
https://en.wikivoyage.org/wiki/Talk%3AIslamabad?useparsoid=1
render differently in Parsoid and legacy:
Legacy
Parsoid
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T378089 [EPIC] Wikivoyages rollout followup work | |||
| Resolved | BUG REPORT | ihurbain | T368720 <code> rendering differences (visualdiff testing) |
This is a misnested tag - the code tag is an inline tag wrapping multiple paragraphs which are blocks (modulo HTML 5 framing). Linter appropriately identifies it as such:
In other words, the page is already broken and Parsoid happens to make it more obvious.
You should review lint errors on a page to see if that may be the source of a rendering diff.
On the other hand, legacy fixes that in the output, and Parsoid does weird stuff:
$ echo -e "<code>1\n\n2\n\n3\n\n4</code>" |php ./bin/parse.php <p data-parsoid='{"dsr":[0,7,0,0]}'><code data-parsoid='{"stx":"html","autoInsertedEnd":true,"dsr":[0,7,6,0]}'>1</code></p><code data-parsoid='{"stx":"html","autoInsertedStart":true,"autoInsertedEnd":true,"dsr":[7,13,0,0]}'> <p data-parsoid='{"dsr":[9,10,0,0]}'>2</p> <p data-parsoid='{"dsr":[12,13,0,0]}'>3</p></code> <p data-parsoid='{"dsr":[15,23,0,0]}'><code data-parsoid='{"stx":"html","autoInsertedStart":true,"dsr":[15,23,0,7]}'>4</code></p>
So, yeah, the input is not great, but there's still something iffy going on. If anything I'd be happier with a single <code> around all of that (because that would point in the direction of things just not being normalized), but here clearly SOMETHING happens, possibly incompletely.
It also doesn't feel like OBVIOUSLY wrong wikitext (although it does trip the linter), so it's probably worth having a look, regardless of the specific page.
Okay, fine.
At the point it enters remex, it looks like
<p><code>1</p><p>2</p><p>3</code></p>
and the behaviour is at least consistent with what happens in other places. It's not pretty, but as mentioned, inline formatting elements are not really supposed to span multiple paragraphs. Modifying either remex or what feeds it would have the potential to break A Lot of things (in particular, articles starting with a formatting tag, as is common).
I don't like it, it makes me grumpy, but it's technically bad input, and is linted as such, so I'm going to close this issue as won't solve.
There, I spent a significant amount of time hunting down where the discrepancy comes from.
Given wikitext
a b
what enters Remex in the legacy parser is
<p>a </p><p>b </p>
and what enters Remex with Parsoid is
<p>a</p> <p>b</p>
So far, so good.
The problem is that, if these paragraphs are included in a formatting element, we have reconstructAFE kicking in at some point. With the legacy parser, it happens on the string boundary, that is just before b. With Parsoid, it also happens on the string boundary... that is, the two \n\n between the paragraphs. So the <code> is re-opened before the new paragraph opening, and this is where the discrepancies come from.
I've explored a tiny bit around "what if we don't run reconstructAFE on character runs that are just [\t\n\f\r]+ (because there was a line above checking for just that), but it was messing up the list handling. But, there still might be SOMETHING to explore around that. (right now I'm thinking "maybe I just want to avoid it specifically on \n\n", which might be enough because if there are more of them, we get some <br/> tags kicking in).
Change #1052779 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):
[mediawiki/libs/RemexHtml@master] Suppress AFE reconstruction on \n\n
I played around a bit with that thing, and I'm arriving at a proposal looking like a variation of \n\n (PS2 on https://gerrit.wikimedia.org/r/c/mediawiki/libs/RemexHtml/+/1052779).
I considered a few other paths:
So right now I'm proposing: "removing all AFE on whitespace runs that contain at least 2 \n". It's not super satisfying, because it feels super hacky. Right now, this only means changing 3 parserTests in parsoid (and I think one in core, I need to check), including 2 that are "unexpected change to known failure output", and one for which the result is actually nicer in my view.
One thing that makes me hesitate is that we have the Inline HTML vs wiki block nesting test that currently has
Inline HTML vs wiki block nesting !! wikitext <b>Bold paragraph New wiki paragraph !! html/php <p><b>Bold paragraph </b></p><p><b>New wiki paragraph </b></p> !! html/parsoid <p><b>Bold paragraph</b> </p><p>New wiki paragraph </p> !! end
and this would not be consistent with the desired output here, which doesn't have a <b> tag on the second paragraph. (It is, however, consistent with legacy; and this test is already failing by adding <b> tags in there, just slightly differently.)
Mmmh. Easier, probably: exclude all whitespace unless it's a single \n. This way we avoid touching lists & tables in "less-corner-case" cases, and it feels less hacky/easier to justify. (and it seems to be passing parserTests as well.)
@Arlolra found another instance of this pattern using <small> in visual diff testing of hewikivoyage:
https://he.wikivoyage.org/w/index.php?title=%D7%A9%D7%99%D7%97%D7%95%D7%A0%D7%99%D7%9D&oldid=259135
Can we change the Parsoid output to be closer to the legacy output. Have one of the newlines inside the paragraph?
<p>a </p> <p>b </p>
Does that make any difference to the AFE opening? It might mess with roundtripping as the wikitext serializer currently sits.
Change #1118845 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):
[mediawiki/services/parsoid@master] [DNM] playing around with token order before remex
@ABreault-WMF the above patch feels like an EXCESSIVELY cheeky way of doing that, buuuuuuuut it might not be *completely* preposterous? I kind of want your opinion before looking into cleaning it up/making it less preposterous :)
Change #1052779 abandoned by Isabelle Hurbain-Palatin:
[mediawiki/libs/RemexHtml@master] Allow to not reconstruct AFE on only-space content
Reason:
I7665fee2b25ceff693427771ed1ef580bf8e2965 looks more promising
Change #1122548 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):
[mediawiki/extensions/Cite@master] Temporarily disable a couple of parserTests
Change #1122642 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):
[mediawiki/extensions/Cite@master] Re-enabled fixed Cite tests
Change #1122548 merged by jenkins-bot:
[mediawiki/extensions/Cite@master] Temporarily disable a couple of parserTests
Change #1118845 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Process white space tokens after <p> tags out of order in Remex
Change #1124186 had a related patch set uploaded (by Arlolra; author: Arlolra):
[mediawiki/vendor@master] Bump wikimedia/parsoid to V0.21.0-a19
Change #1124186 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.21.0-a19
Change #1125493 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):
[mediawiki/services/parsoid@master] Process white space tokens after <p> tags out of order in Remex
Change #1125493 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Process new line tokens after <p> tags out of order in Remex
Change #1132724 had a related patch set uploaded (by Arlolra; author: Arlolra):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.21.0-a23
Change #1132724 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.21.0-a23
Change #1122642 merged by jenkins-bot:
[mediawiki/extensions/Cite@master] Re-enabled fixed Cite tests