Page MenuHomePhabricator

<code> rendering differences (visualdiff testing)
Closed, ResolvedPublicBUG REPORT

Description

The following links:

https://en.wikivoyage.org/wiki/Talk%3AIslamabad?useparsoid=0
https://en.wikivoyage.org/wiki/Talk%3AIslamabad?useparsoid=1

render differently in Parsoid and legacy:

Legacy

image.png (397×532 px, 71 KB)

Parsoid

image.png (351×538 px, 60 KB)

Related Objects

Event Timeline

This is a misnested tag - the code tag is an inline tag wrapping multiple paragraphs which are blocks (modulo HTML 5 framing). Linter appropriately identifies it as such:

Screenshot_2024-06-28-19-18-33-15_3aea4af51f236e4932235fdada7d1643.jpg (578×1 px, 144 KB)

In other words, the page is already broken and Parsoid happens to make it more obvious.

You should review lint errors on a page to see if that may be the source of a rendering diff.

On the other hand, legacy fixes that in the output, and Parsoid does weird stuff:

$ echo -e "<code>1\n\n2\n\n3\n\n4</code>" |php ./bin/parse.php
<p data-parsoid='{"dsr":[0,7,0,0]}'><code data-parsoid='{"stx":"html","autoInsertedEnd":true,"dsr":[0,7,6,0]}'>1</code></p><code data-parsoid='{"stx":"html","autoInsertedStart":true,"autoInsertedEnd":true,"dsr":[7,13,0,0]}'>

<p data-parsoid='{"dsr":[9,10,0,0]}'>2</p>

<p data-parsoid='{"dsr":[12,13,0,0]}'>3</p></code>

<p data-parsoid='{"dsr":[15,23,0,0]}'><code data-parsoid='{"stx":"html","autoInsertedStart":true,"dsr":[15,23,0,7]}'>4</code></p>

So, yeah, the input is not great, but there's still something iffy going on. If anything I'd be happier with a single <code> around all of that (because that would point in the direction of things just not being normalized), but here clearly SOMETHING happens, possibly incompletely.

It also doesn't feel like OBVIOUSLY wrong wikitext (although it does trip the linter), so it's probably worth having a look, regardless of the specific page.

Okay, fine.

At the point it enters remex, it looks like

<p><code>1</p><p>2</p><p>3</code></p>

and the behaviour is at least consistent with what happens in other places. It's not pretty, but as mentioned, inline formatting elements are not really supposed to span multiple paragraphs. Modifying either remex or what feeds it would have the potential to break A Lot of things (in particular, articles starting with a formatting tag, as is common).

I don't like it, it makes me grumpy, but it's technically bad input, and is linted as such, so I'm going to close this issue as won't solve.

ihurbain claimed this task.
ihurbain moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

There, I spent a significant amount of time hunting down where the discrepancy comes from.

Given wikitext

a

b

what enters Remex in the legacy parser is

<p>a
</p><p>b
</p>

and what enters Remex with Parsoid is

<p>a</p>

<p>b</p>

So far, so good.

The problem is that, if these paragraphs are included in a formatting element, we have reconstructAFE kicking in at some point. With the legacy parser, it happens on the string boundary, that is just before b. With Parsoid, it also happens on the string boundary... that is, the two \n\n between the paragraphs. So the <code> is re-opened before the new paragraph opening, and this is where the discrepancies come from.

I've explored a tiny bit around "what if we don't run reconstructAFE on character runs that are just [\t\n\f\r]+ (because there was a line above checking for just that), but it was messing up the list handling. But, there still might be SOMETHING to explore around that. (right now I'm thinking "maybe I just want to avoid it specifically on \n\n", which might be enough because if there are more of them, we get some <br/> tags kicking in).

Change #1052779 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/libs/RemexHtml@master] Suppress AFE reconstruction on \n\n

https://gerrit.wikimedia.org/r/1052779

I played around a bit with that thing, and I'm arriving at a proposal looking like a variation of \n\n (PS2 on https://gerrit.wikimedia.org/r/c/mediawiki/libs/RemexHtml/+/1052779).

I considered a few other paths:

  • removing the new lines between paragraphs when generating Parsoid HTML. Requires to either modify a large number of tests OR normalize them by replacing new lines by '' rather than ' ', and this feels iffy. We also lose readability in the output, which may or may not be a concern.
  • removing all AFE on all whitespace runs. This poses problems with wt2wt around lists - we lose end tags (because they're marked as auto-inserted, and none gets re-added at the end).
  • Not really considered yet: removing AFE on all whitespace, AND find a way to fix wt2wt. Would require more investigation if we were to go that route.
  • Removing AFE only on \n\n. Feels not exceedingly satisfying, because it breaks on patterns like \n\t\n, which I would expect may happen in <code> block if they got copy-pasted from somewhere else

So right now I'm proposing: "removing all AFE on whitespace runs that contain at least 2 \n". It's not super satisfying, because it feels super hacky. Right now, this only means changing 3 parserTests in parsoid (and I think one in core, I need to check), including 2 that are "unexpected change to known failure output", and one for which the result is actually nicer in my view.

One thing that makes me hesitate is that we have the Inline HTML vs wiki block nesting test that currently has

Inline HTML vs wiki block nesting
!! wikitext
<b>Bold paragraph

New wiki paragraph
!! html/php
<p><b>Bold paragraph
</b></p><p><b>New wiki paragraph
</b></p>
!! html/parsoid
<p><b>Bold paragraph</b>
</p><p>New wiki paragraph
</p>
!! end

and this would not be consistent with the desired output here, which doesn't have a <b> tag on the second paragraph. (It is, however, consistent with legacy; and this test is already failing by adding <b> tags in there, just slightly differently.)

Mmmh. Easier, probably: exclude all whitespace unless it's a single \n. This way we avoid touching lists & tables in "less-corner-case" cases, and it feels less hacky/easier to justify. (and it seems to be passing parserTests as well.)

what enters Remex in the legacy parser is

<p>a
</p><p>b
</p>

and what enters Remex with Parsoid is

<p>a</p>

<p>b</p>

So far, so good.

Can we change the Parsoid output to be closer to the legacy output. Have one of the newlines inside the paragraph?

<p>a
</p>
<p>b
</p>

Does that make any difference to the AFE opening? It might mess with roundtripping as the wikitext serializer currently sits.

Change #1118845 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] [DNM] playing around with token order before remex

https://gerrit.wikimedia.org/r/1118845

@ABreault-WMF the above patch feels like an EXCESSIVELY cheeky way of doing that, buuuuuuuut it might not be *completely* preposterous? I kind of want your opinion before looking into cleaning it up/making it less preposterous :)

Change #1052779 abandoned by Isabelle Hurbain-Palatin:

[mediawiki/libs/RemexHtml@master] Allow to not reconstruct AFE on only-space content

Reason:

I7665fee2b25ceff693427771ed1ef580bf8e2965 looks more promising

https://gerrit.wikimedia.org/r/1052779

Change #1122548 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/Cite@master] Temporarily disable a couple of parserTests

https://gerrit.wikimedia.org/r/1122548

Change #1122642 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/Cite@master] Re-enabled fixed Cite tests

https://gerrit.wikimedia.org/r/1122642

Change #1122548 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Temporarily disable a couple of parserTests

https://gerrit.wikimedia.org/r/1122548

Change #1118845 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Process white space tokens after <p> tags out of order in Remex

https://gerrit.wikimedia.org/r/1118845

Change #1124186 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump wikimedia/parsoid to V0.21.0-a19

https://gerrit.wikimedia.org/r/1124186

Change #1124186 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.21.0-a19

https://gerrit.wikimedia.org/r/1124186

Change #1125493 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Process white space tokens after <p> tags out of order in Remex

https://gerrit.wikimedia.org/r/1125493

Change #1125493 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Process new line tokens after <p> tags out of order in Remex

https://gerrit.wikimedia.org/r/1125493

Change #1132724 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.21.0-a23

https://gerrit.wikimedia.org/r/1132724

Change #1132724 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.21.0-a23

https://gerrit.wikimedia.org/r/1132724

Change #1122642 merged by jenkins-bot:

[mediawiki/extensions/Cite@master] Re-enabled fixed Cite tests

https://gerrit.wikimedia.org/r/1122642