Page MenuHomePhabricator

Changing indentation of a comment which already included broken HTML causes grossly inconsistent page results
Closed, DeclinedPublic

Description

This first example, with three colons, works fine:

<small>This text is small

:::</small>This text is normal

The rest of the page is fine.

This second example, with four colons, breaks the entire page:

<small>This text is small

::::</small>This text is small, but it shouldn't be.

This text, and the entire page, is incorrectly rendered as small.

I ran into this issue when viewing EnWiki's WP:Administrators's_noticeboard/Archive267. This bizarre result appears to be relatively recent breakage in the parser.

Event Timeline

Alsee created this task.Jul 8 2018, 2:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 8 2018, 2:02 PM
Aklapper renamed this task from Four colons breaks markup, spilling it across the entire page to Four colons at beginning of line breaks markup, spilling it across the entire page.Jul 8 2018, 2:11 PM
Alsee renamed this task from Four colons at beginning of line breaks markup, spilling it across the entire page to Four or more colons at beginning of line breaks markup, spilling it across the entire page.Jul 8 2018, 2:12 PM
Izno closed this task as Declined.Aug 17 2018, 12:00 AM
Izno added a subscriber: Izno.

This is a MediaWiki-extensions-Linter error exposed by the switch to RemexHtml . It is deliberate.

Alsee added a comment.EditedAug 19 2018, 6:40 PM

@Izno switching to Remex was deliberate, however I don't think it could be deliberate to have radically different, confusing, and broken differences in behavior between three colons and four colons. Independent of any lint issues, is there any reason that this bizarre behavior should not be fixed?

The lint issue exposed a clear bug in Remex or the parser or somewhere.

Izno reopened this task as Open.Aug 19 2018, 7:12 PM
Izno edited projects, added RemexHtml; removed Regression.

Well. I don't know what's going on after testing either. Firefox Quantum indeed displays only the ones with 4 or more as small. It might be that the quantity of >3 is in the rendering algorithm and it's a browser thing rather than a wiki thing? All of the garbage is in the HTML as expected. What browsers have you used to test this?

There's also some really weird HTML being inserted--random <br /> that are not in the source wikitext.

<p><small>content
</small></p><small></small><dl><small></small><dd><small></small> more content</dd></dl>
<p><br />
And some later content
</p><p><br />
</p><p><small>content
</small></p><small></small><dl><small></small><dd><small></small><dl><small></small><dd><small></small> more content</dd></dl></dd></dl>
<p>And some more later content
</p><p><br />
</p><p><br />
<small>content
</small></p><small></small><dl><small></small><dd><small></small><dl><small></small><dd><small></small><dl><small></small><dd><small></small> more content</dd></dl></dd></dl></dd></dl>
<p><br />
And some later content
</p><p><br />
</p><p><small>content
</small></p><small></small><dl><small></small><dd><small></small><dl><small></small><dd><small></small><dl><small></small><dd><small></small><dl><small></small><dd><small> more content</small></dd></dl></dd></dl></dd></dl></dd></dl><small><p><br />
And some more later content
</p><p><br />
</p><p><br />
<small>content
</small></p></small><small><small></small><dl><small></small><dd><small></small><dl><small></small><dd><small></small><dl><small></small><dd><small></small><dl><small></small><dd><small><dl><dd> more content</dd></dl></small></dd></dl></dd></dl></dd></dl></dd></dl></small><small><small><p><br />
And some more later content
</p></small></small><small><small>
Alsee added a comment.Aug 20 2018, 8:15 AM

The problem also appears in Chrome 67.0.3396.99.

For anyone else to examine the problem, it occurs in this page version. The bottom half of the page is <small> text.

In particular it's triggered at:

:::::<small>For over ten years; but he does not boast about it...
:::::::</small>Golbez is far too valuable...

The issue is mysteriously dependent on the number of colons on the second line.

Arlolra added a subscriber: Arlolra.EditedAug 20 2018, 10:03 PM

There's also some really weird HTML being inserted--random <br /> that are not in the source wikitext.

It's not random, it's expected based multiple newlines in the source (at least from inspecting the revision linked to).

See https://github.com/wikimedia/mediawiki/blob/master/includes/parser/BlockLevelPass.php#L362-L376

is there any reason that this bizarre behavior should not be fixed?

The output of the four colon example from the parser, prior to Remex, is

<div class="mw-parser-output"><p><small>This text is small
</p>
<dl><dd><dl><dd><dl><dd><dl><dd></small>This text is small, but it shouldn't be.</dd></dl></dd></dl></dd></dl></dd></dl>
<p>This text, and the entire page, is incorrectly rendered as small.
</p></div>

If you put that in an html file and open it in a browser following the html5 tree-building algorithm, you'll see that it matches the output from Remex. From that perspective, this is expected. I'm not sure why things are specified as such, we can dig into it, but this at least doesn't appear to be an implementation divergence.

The source of the problem is that the closing of the formatting tag is nested in the definition list. I appreciate that it's unexpected that this wouldn't work after three nestings, but I'm not sure there's a compelling reason to make efforts support it.

Izno added a comment.Aug 21 2018, 12:19 AM

It's not random, it's expected based multiple newlines in the source (at least from inspecting the revision linked to).
See https://github.com/wikimedia/mediawiki/blob/master/includes/parser/BlockLevelPass.php#L362-L376

Oh, huh. I didn't realize that extra whitespace I've seen elsewhere was generated by breaks rather than empty paragraphs. One mystery demystified.

ssastry closed this task as Declined.Aug 21 2018, 4:42 AM
ssastry added a subscriber: ssastry.

This usage

<small>text
::::</small> more text

has misnested html tags. Please close the small tag correctly.

<small>text</small>
:::: more text

With the broken wikitext, the fixup is determined by the HTML5 spec (which browsers replicate as Arlola noted above). There is nothing to fix here in MediaWiki as far as I can tell.

Alsee renamed this task from Four or more colons at beginning of line breaks markup, spilling it across the entire page to Changing comment indentation level causes grossly inconsistent page results.Aug 21 2018, 9:28 PM

@ssastry, I've re-titled the bug to try and focus on the actual bug. Please don't get distracted by the tag details. That is merely the known example-case which exposes this new and bad parser behavior.

User story: Alice is a casual but productive editor. Wikitext seemed a bit scary at first, but she happily discovered that wikitext is not HTML. Wikitext is easier and better than HTML. She doesn't have to be a programmer to use it, and she doesn't have to know crazy HTML rules about nesting and stuff. She is is looking at a talk page and decides that the indentation on a comment needs to be changed. She adds or removes a colon on one or more comments. Because this is such a trivial edit, she doesn't need to preview before saving. However due to parser bug T199057, the edit breaks the entire page. Bob, Carol, and David are all expert editors or even admins. They all come across the broken page and they're digging through the edit history trying to figure how the page got broken. They are all extremely confused and waste a lot of time because they can't find any apparent cause for the page to go from "fine" to "completely broken".

Changing the indentation of a comment isn't supposed to break a page.

User story: Alice is a casual but productive editor. Wikitext seemed a bit scary at first, but she happily discovered that wikitext is not HTML. Wikitext is easier and better than HTML. She doesn't have to be a programmer to use it, and she doesn't have to know crazy HTML rules about nesting and stuff.

Except wikitext-of-2018 = simple-native-wikitext-markup-of-200x + HTML5-2018. So, if HTML tags get used, it is unfortunately no longer that simple. Going forward, linting tools will help reduce some of this complexity if we start adding more cleanup categories to flag scenarios like this that confuse Alice. It is upto more experienced editors to clean up some of the markup to eliminate surprising effects like this (if necessary with bots).

Changing the indentation of a comment isn't supposed to break a page.

Understood. But, this is standard HTML5 behavior that got triggered by an edge case. The fix in this case would be to remove the offending </small> tag from the indented comment. With Tidy, there were some class of weird edge case behaviors. With RemexHtml, there will be a different class of weird edge case behaviors. RemexHtml is the preferred provider of weirdness because it matches the weirdness experience provided by browsers (which we believe will aid editors when they are debugging their markup). There is no perfect fix unfortunately because we cannot eliminate edge cases that arise because of bad markup.

The longer-term strategy is to improve wikitext as a markup language. That is a longer project that we have plans to make careful incremental progress on, but that is beyond the scope of this ticket.

Aklapper renamed this task from Changing comment indentation level causes grossly inconsistent page results to Changing indentation of a comment which already included broken HTML causes grossly inconsistent page results.Aug 22 2018, 8:59 AM
Alsee added a comment.EditedAug 22 2018, 3:56 PM

! In T199057#4521165, @ssastry wrote:
this is standard HTML5 behavior

That's a bug. HTML behavior is obviously wrong and undesirable on a wiki. Consider this example:

<b>Bold list
* bold item one
* bold item two</b>

If that were HTML, it would be broken. But it's not HTML, it's wikitext. Editors are not HTML programmers, they do not view it as a DOM tree, they view it in a linear fashion with a start-bold marker and an end-bold marker, and they expect everything in between to be bold. Expecting editors to be HTML programmers would drive off users. The intent of the markup is obvious and wikitext does the right thing. Wikitext's purpose is to allow non-coders to write arbitrary markup and, insofar as possible, generate the required HTML to provide the intended intuitive result. This is why Wikipedia was successful, this is why editors like wikitext.

I know there are complicated issues and there will be difficult corner cases. But this is a new bug, and I do not believe anyone considers the result here desirable. Somebody broke wikitext. Changing the indentation on a comment clearly should not mangle a page.

! In T199057#4521165, @ssastry wrote:
this is standard HTML5 behavior

That's a bug. HTML behavior is obviously wrong and undesirable on a wiki.

We're getting into off-topic territory here, but I think the time has passed on having wikitext not use HTML semantics. There's a really good write up at https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#Why_are_you_replacing_it?_And_with_what? on the why this needed to happen. Yes, wikitext was changed, but in the long run it'll be for the better, even if some individual edge cases will regress.

Alsee added a comment.EditedAug 22 2018, 6:31 PM

@Legoktm the WMF has concluded that it would be harmful to apply HTLM semantics to a tag wrapping a list. So instead of wandering off topic into generalisms, I'm happy to focus on the bug at hand.

Do you have any input on the issue of the new page-breaking inconsistency between handling of comments which are three-indented and four-indented? Do you consider this desirable behavior?

P.S. Can anyone explain why a 4th colon breaks the page?

matmarex added a subscriber: matmarex.EditedAug 22 2018, 7:01 PM

This behavior appears to be the result of the HTML5 "adoption agency algorithm", implemented by Remex: https://www.w3.org/TR/html5/syntax.html#adoption-agency-algorithm (large page).

Specifically, this part:

If outer loop counter is greater than or equal to eight, then abort these steps.

Four colons :::: generate eight tags <dl><dd><dl><dd><dl><dd><dl><dd>.

The behavior is the same with e.g. four stars **** to generate <ul><li><ul><li><ul><li><ul><li>, or just straight up eight divs <div><div><div><div><div><div><div><div>.

the WMF has concluded that it would be harmful to apply HTLM semantics to a tag wrapping a list.

Where?

Izno added a comment.Aug 22 2018, 8:11 PM

This behavior appears to be the result of the HTML5 "adoption agency algorithm", implemented by Remex: https://www.w3.org/TR/html5/syntax.html#adoption-agency-algorithm (large page). Specifically, this part:

If outer loop counter is greater than or equal to eight, then abort these steps.

Four colons :::: generate eight tags <dl><dd><dl><dd><dl><dd><dl><dd>. The behavior is the same with e.g. four stars **** to generate <ul><li><ul><li><ul><li><ul><li>, or just straight up eight divs <div><div><div><div><div><div><div><div>.

This is the specific post I was looking for. I think this bug is appropriately closed invalid or declined as it is now, since it is specified in the algorithm.