Page MenuHomePhabricator

Linter does not detect misc-tidy-replacement-issues when div is wrapped in bold markup
Open, LowPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

I have included the original markup found in an article, and simplified markup to reduce the case to a minimal form.

What happens?:
When Template:Nowrap is wrapped around Template:Unbulleted list, a misc-tidy-replacement-issues "span wrapping a div" error is properly detected.

When the unbulleted list is wrapped in bold markup (three ' characters) and then surrounded by Template:Nowrap, no Linter misc-tidy-replacement-issues is detected.

What should have happened instead?:
A Linter error should be detected in the latter case.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:

Event Timeline

Not just bold markup with three apostrophes, also bold markup with b tags, also italics with two apostrophes or i tags, or small with small tags, and I would assume just about any inline tags that open and close inside the span tags and around the div tags. See https://en.wikipedia.org/w/index.php?title=User:Anomalocaris/sandbox/Lint_Test&oldid=1104580787

Arlolra triaged this task as Medium priority.Aug 23 2023, 10:14 PM
Arlolra moved this task from Backlog to Parsoid on the MediaWiki-extensions-Linter board.
Arlolra subscribed.

Similar to the discussion at T294720#8670929, we need to show that tidy would have done this migration in the presence of a bold tag, since we're trying to detect a rendering difference, post switching to remex, that could result in a visual regression.

The code in the linter is pretty naively looking if the first child in the span is a div,

		$fc = DiffDOMUtils::firstNonSepChild( $node );
		if ( !$fc instanceof Element || DOMCompat::nodeName( $fc ) !== 'div' ) {
			return;
		}

https://github.com/wikimedia/mediawiki-services-parsoid/blob/master/src/Wt2Html/PP/Processors/Linter.php#L869-L872

and the commit adding it doesn't provide much info,
https://github.com/wikimedia/mediawiki-services-parsoid/commit/14bc371505dd1812650ac22538a5fb4ea0281398

Using a version of tidy-html5 (which, at the time of the switch, tidy was html4 based), we can see,

> echo "<span class='nowrap'><div>foo</div></span>" | tidy

line 1 column 1 - Warning: missing </span> before <div>
line 1 column 27 - Warning: inserting implicit <span>
line 1 column 36 - Warning: discarding unexpected </span>
line 1 column 1 - Warning: trimming empty <span>

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div><span class='nowrap'>foo</span></div>
</body>
</html>

that it probably would have done the same,

echo "<span class='nowrap'><b><div>foo</div></b></span>" | tidy

line 1 column 22 - Warning: missing </b> before <div>
line 1 column 1 - Warning: missing </span> before <div>
line 1 column 30 - Warning: inserting implicit <span>
line 1 column 30 - Warning: inserting implicit <b>
line 1 column 39 - Warning: discarding unexpected </b>
line 1 column 43 - Warning: discarding unexpected </span>
line 1 column 22 - Warning: trimming empty <b>
line 1 column 1 - Warning: trimming empty <span>

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div><span class='nowrap'><b>foo</b></span></div>
</body>
</html>

Change 952015 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Lint tidyDivSpanFlip when div is nested

https://gerrit.wikimedia.org/r/952015

This is something I had once asked back in the day after Tidy had been replaced but didn't act on it - I think we should retire some of these linter issues. These existed to give editors a chance to fix up content that might have changed rendering. Given that Tidy is long gone, the rendering we see is what exists and if someone wants to change the appearance, they can do so without needing a linter category.

My proposal is to retire this and also examine all the other tidy-centric issues after examining which of those still have value in a post-Tidy HTML5 world.

This is something I had once asked back in the day after Tidy had been replaced but didn't act on it - I think we should retire some of these linter issues. These existed to give editors a chance to fix up content that might have changed rendering. Given that Tidy is long gone, the rendering we see is what exists and if someone wants to change the appearance, they can do so without needing a linter category.

My proposal is to retire this and also examine all the other tidy-centric issues after examining which of those still have value in a post-Tidy HTML5 world.

This idea merits a widely advertised discussion. I have many thoughts.

This is something I had once asked back in the day after Tidy had been replaced but didn't act on it - I think we should retire some of these linter issues. These existed to give editors a chance to fix up content that might have changed rendering. Given that Tidy is long gone, the rendering we see is what exists and if someone wants to change the appearance, they can do so without needing a linter category.

My proposal is to retire this and also examine all the other tidy-centric issues after examining which of those still have value in a post-Tidy HTML5 world.

Identifying violations of the HTML spec (whether the parsing spec or the wider domain spec) are value add to me and divSpanFlip is one such in at least the latter category (it being invalid to have a span wrapping a div). They inevitably indicate badly-authored documents and should be corrected. I think I've mentioned elsewhere that I'd like to see the linter expand to identifying failures to respect the domain model in HTML 5 (but skimming there is no task for it).

There isn't really another existing alternative to that work, and this specific linter already exists, so...

Happy to have another task to discuss it generally if you want.

(While looking around for tasks, T200562: Consider not outputting content model violations to help editing clients showed up where ironically I had almost said the exact same thing regarding it and just never posted it. So maybe I haven't mentioned it....)

(cc @MSantos for visiblity)

This idea merits a widely advertised discussion. I have many thoughts.

One of us can probably create a wiki page (or a phab task as appropriate) one of these days.

This is something I had once asked back in the day after Tidy had been replaced but didn't act on it - I think we should retire some of these linter issues. These existed to give editors a chance to fix up content that might have changed rendering. Given that Tidy is long gone, the rendering we see is what exists and if someone wants to change the appearance, they can do so without needing a linter category.

My proposal is to retire this and also examine all the other tidy-centric issues after examining which of those still have value in a post-Tidy HTML5 world.

Identifying violations of the HTML spec (whether the parsing spec or the wider domain spec) are value add to me and divSpanFlip is one such in at least the latter category (it being invalid to have a span wrapping a div). They inevitably indicate badly-authored documents and should be corrected. I think I've mentioned elsewhere that I'd like to see the linter expand to identifying failures to respect the domain model in HTML 5 (but skimming there is no task for it).

There isn't really another existing alternative to that work, and this specific linter already exists, so...

I think this is not yet a well-understood goal. See this and this. That second wiki page is probably a good place to start documenting ideas and thoughts around HTML5 compliance -- so please feel free to edit and add notes and arguments there. Not all HTML5 spec violations are created equal since the HTML5 tree builder (parsing) algorithm exists to accepts the vast variety of HTML out on the web and assign consistent semantics to it and have them render identically across browsers. As such, there is a well-accepted and well-understood model of how to handle HTML5 spec violations. So, the ones that matter most are those that lead to serious rendering differences (than intended) vs. cluttering the error logs with all kinds of notices that have no significant ramifications.

As for HTML5 output compliance, there are at least two different approaches; (a) Parsoid automatically fixes up the output to ensure the output is HTML5 spec compliant. In this approach, there is no need for the large body of lint errors (except where there may be some value to flagging something where the final rendering might not match intended rendering) (b) Parsoid doesn't automatically fix up output -- but flags content model violations for editors to fix up.

Before we decide on either (a) or (b), we first need to arrive at a decision as to how important it is to arrive at HTML5 output compliance. But, if we decide that there is merit to it, my vote would be for some version of (a) and not burden the large body of editors with the need to understand the intricacies of the HTML5 spec and only focus on template and module authors.

Anyway, my comment was not just about *this* specific lint, but also to look at all the tidy-specific lints and see which of them have value in a post-tidy world. Now that Tidy is gone, it might be a good time to revisit what our collective goals and aspirations are for a Linting tool and update that page. I think we should separate lints related to HTML5 compliance from lints that are related to bad (native) wikitext markup (bogus image options, unclosed quotes in headings, etc.). And in the former, things like fostered content, links in links are valuable no matter what. I think we could reorganize lints differently from high/medium/low priority (those priorities made more sense while replacing Tidy).

Anyway, I am throwing a bunch of ideas and thoughts for discussion. I know @Legoktm also wrote up something in the context of writing bots to fix lint errors. There is no urgent rush to figure this all out tomorrow, but it is probably time for some reset for this functionality so it continues to be valuable.

Change 952015 abandoned by Arlolra:

[mediawiki/services/parsoid@master] Lint tidyDivSpanFlip when div is nested

Reason:

Isn't consensus that we should do this

https://gerrit.wikimedia.org/r/952015

Arlolra lowered the priority of this task from Medium to Low.

The above patch fixes this issue but since there isn't consensus, I'm abandoning it for now. It can be restored if it's desired.

One way to start the discussion would be to create a wiki page listing all of the current Linter errors. For each one, figure out whether it can cause real rendering problems on pages (e.g. link-in-link errors cause real display problems every time; some unclosed tags cause all subsequent text to be smaller or cause subsequent sections to be nested inside an unclosed div). For obsolete tags, figure out whether support for obsolete tags is going to be removed. And so on.

For errors that cause real rendering problems, the Linter tracking should continue to exist, just as we have error tracking categories in templates so that editors can fix real errors. Otherwise, I can't think of a way to find out which pages have rendering problems.

A sidebar: We might also consider whether to convert Linter error tracking into hidden categories rather than using a one-off system of list pages. There is another ticket around here for that.