Page MenuHomePhabricator

Deprecate nonstandard behavior of self-closed HTML tags in wikitext.
Closed, ResolvedPublic

Description

The HTML5 standard says that the XML-ish self-closed tag syntax <TAGNAME/> (note the trailing slash) is ignored: tags are "self-closed" iff the tag name matches a list of "void tags". The only valid void HTML tags are area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, wbr

As such <b/> and <span/> are treated exactly the same as <b>, <span> in a HTML5 parser. But, the situation is a bit complicated in mediawiki.

  • Without tidy turned on, the Sanitizer mostly enforced this constraint but rewrote <b/> as &lt;b/>. That isn't strictly according to the HTML5 spec (which would rewrite it as <b>) but does get the point across that this is invalid HTML syntax.
  • When tidy is enabled, tidy replaces <b/> with nothing, that is, it removes the invalid tag from the output. This has led to its (ab)use as a way to protect leading/trailing whitespace and punctuation in templates. However, there are alternative ways to do this, including <nowiki/> and &#32;, which don't violate the HTML5 parsing rules.
  • However when we replace Tidy with a HTML5 parser (See T89331), Mediawiki will start enforcing the HTML5 standard and parse <b/>, <span/> as start tags which can break rendering on pages that might (deliberately or accidentally) rely on Tidy removing these tags.

In order to facilitate a smooth migration away from Tidy, we are deprecating the use of non-void self-closed HTML tags (so, to repeat, area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, wbr can be written in self-closed tag form and need not be changed). Additionally, we have started tagging pages using this invalid form with the [[Category:Pages using invalid self-closed HTML tags]] tracking category. Once pages which use this construct are cleaned up, we'll change both the "tidy" and the "no tidy" case to be consistent with the HTML5 parsing standard; that is, <b/> will be transformed into <b>.

Additionally, registered extension tags aren't subject to this consideration since they aren't HTML5 tags. So, for example, <ref /> and <references /> can continue to be used.

See also: https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#Simplified_instructions_for_fixing_pages

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
NicoV added a comment.Jul 19 2016, 3:14 PM

@NicoV Besides I don't think it's effectively doable to show whether the issue is on the current page or in any transcluded template (mind the transclusion can be several levels deep too), when you hit Preview you already see the tracking category.

@Danny_B I was talking about [[Category:Pages using duplicate arguments in template calls]] because that's exactly what is already done for that tracking category : it tells you if the issue is on the current page or if it is in any transcluded template (and it includes the levels when there are several levels...). And it tells you which argument is duplicated also. So it is doable... I don't know how it was done, but it was done and it is very helpful.

So doing it here would also be very helpful if it was done in a same way : current page or transcluded template (with levels) and information about the problem itself (tag name for example would be very helpful to narrow the search).

I agree that you do see the tracking category when you hit Preview but for some articles, it's not very helpful to find where the problem is when the issue is not trivial. For example, I'm currently trying to fix the articles for frwiki, and I've encountered a few pages where I was unable to find the cause of the categorization. For example :
https://fr.wikipedia.org/wiki/Insurrection_de_Boko_Haram
https://fr.wikipedia.org/wiki/Hautes_Tatras
https://fr.wikipedia.org/wiki/Discussion_Portail:Aur%C3%A8s

Hautes Tatras fixed by https://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Panorama_annot%C3%A9_Hautes_Tatras&diff=prev&oldid=127982911

It's easy - just open all linked templates and look for improper selfclosing tags...

jrbs added a subscriber: jrbs.Jul 25 2016, 10:08 PM
Elitre added a subscriber: Elitre.Aug 2 2016, 9:54 AM
Arbnos removed a subscriber: Arbnos.Aug 3 2016, 10:34 AM
JJMC89 added a subscriber: JJMC89.Sep 22 2016, 10:07 PM
Jonesey95 added a subscriber: Jonesey95.EditedOct 8 2016, 5:37 PM

Is this supposed to check for self-closed instances of every tag listed on lines 376-381 of https://gerrit.wikimedia.org/r/#/c/286928/11/includes/Sanitizer.php ?

If so, I think it may be missing at least one. See this page for an example of a "pre" tag that is self-closed, but to which the error category has not been applied:

https://en.wikipedia.org/w/index.php?title=User:Jonesey95/sandbox2&oldid=743233850

[edited to add:] Can someone please point us to a complete list of tags that can be listed on the category page? Thanks.

Is this supposed to check for self-closed instances of every tag listed on lines 376-381 of https://gerrit.wikimedia.org/r/#/c/286928/11/includes/Sanitizer.php ?

If so, I think it may be missing at least one. See this page for an example of a "pre" tag that is self-closed, but to which the error category has not been applied:

See T134423#2301066.

[edited to add:] Can someone please point us to a complete list of tags that can be listed on the category page? Thanks.

The only valid self-closed HTML tags are: area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, wbr. In addition, <pre> is treated as an extension tag in MediaWiki and is also exempt. So, all other HTML tags besides these should be fixed if they use the invalid self-closed form. To be very clear and to repeat what I said elsewhere, extension tags (like ref, references, gallery, syntaxhighlight, nowiki, etc.) aren't affected.

@tstarling would you please run a new report run of P3012 ?

@tstarling would you please run a new report run of P3012 ?

Wikidivspan
arwiki1316
cawiki56
cebwiki00
commonswiki20
dewiki12
enwiki1122
enwikinews83
enwikisource01
enwiktionary00
eswiki43
fawiki1246
fiwiki31
frwiki10
frwikisource10
frwiktionary10
huwiki07
idwiki533
incubatorwiki1522
itwiki70
jawiki58
kowiki321
metawiki422
mgwiktionary04
nlwiki00
nowiki11
plwiki240
ptwiki58
rowiki15
ruwiki01
ruwiktionary33
shwiki330
srwiki09
svwiki10
trwiki21
ukwiki821
viwiki415
warwiki17
wikidatawiki11
zhwiki510
zhwiktionary12
This comment was removed by SamanthaNguyen.
ssastry moved this task from Backlog to In Progress on the MediaWiki-Parser board.Jan 4 2017, 7:30 PM
Liuxinyu970226 added a subscriber: liangent.EditedJan 12 2017, 5:34 AM

@tstarling would you please run a new report run of P3012 ?

Wikidivspan
arwiki1316
cawiki56
cebwiki00
commonswiki20
dewiki12
enwiki1122
enwikinews83
enwikisource01
enwiktionary00
eswiki43
fawiki1246
fiwiki31
frwiki10
frwikisource10
frwiktionary10
huwiki07
idwiki533
incubatorwiki1522
itwiki70
jawiki58
kowiki321
metawiki422
mgwiktionary04
nlwiki00
nowiki11
plwiki240
ptwiki58
rowiki15
ruwiki01
ruwiktionary33
shwiki330
srwiki09
svwiki10
trwiki21
ukwiki821
viwiki415
warwiki17
wikidatawiki11
zhwiki510
zhwiktionary12

for zhwiki, per discussion under Tech News: 2016-20 (@liangent ), most of the rest are jQuery('<div/>') (and jQuery('<span/>')?) and don't need "such fixing"...

for zhwiki, per discussion under Tech News: 2016-20 (@liangent ), most of the rest are jQuery('<div/>') (and jQuery('<span/>')?) and don't need "such fixing"...

Note, though, that for creating a single element with jQuery the MW coding conventions prefer jQuery( '<div>' ) without the trailing slash, so if you want to follow them in on-wiki gadgets/user scripts, you could change these occurrences, too.

Fito added a subscriber: Fito.Feb 13 2017, 4:15 AM

Change 350901 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Fix self-closed HTML tag test.

https://gerrit.wikimedia.org/r/350901

Change 350901 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Fix self-closed HTML tag test.

https://gerrit.wikimedia.org/r/350901

Quiddity removed a subscriber: Quiddity.May 4 2017, 11:47 PM
ssastry closed this task as Resolved.Jul 19 2017, 9:29 PM
ssastry claimed this task.

We have finished the part of deprecating this tag and also identifying them via tracking category as well as via the Linter extension. Editors have been fixing pages and addressing this issue. So, there is nothing more to do here.

Elitre updated the task description. (Show Details)Jul 20 2017, 10:51 AM
Liuxinyu970226 moved this task from Backlog to Closed on the Chinese-Sites board.Jul 23 2017, 9:46 AM
Verdy_p added a subscriber: Verdy_p.EditedAug 10 2017, 8:13 PM

This cleanup is clearly invalid.

HTML5 just says that void elements can have any end tag and must then be self-closed or implicitly closed immediaterly without parsing any content in them (they can only have attributes).

Other elements that ''may'' have contents (but are not required to) are perfectly valid when they are self closed such as <span id="example"/>. It is frequent for such tags to have an empty content, notably in their initial creation, where only attributes may be sued, and are in fact enough (the visible content will be generated elsewhere.
Note that <span id="example"/> is used in Mediawiki instead of <a id="example"/> to insert anchors at an isolated position independantly of what's around (which may also be empty).
I clearly don't see any interest in forcing us to write it as <span id="example"></span>. And not that Mediawiki should not blindlyu strip that empty span as it has important attributes.

You have completely misinterpreted what the HTML5 says! And in fact HTML5 is still supporting the XML syntax where explicit closure of all tags (including sefl-closing) is required, even if the SGML/HTML syntax allows tags to be implicitly closed depending on surrounding contents (for example a <p> is implicitly closed by the next implicitly-closed <p> or next explicitly closed <p>...</p> or <p/> appearing when parsing its child elements, or the closure of already opened container elements (such as <div>.....<p>....</div> where a missing </p> is implied just before </div> so that <div>.....<p>....</div></p> is effectively invalid: the </p> after </div> does not match any pending <p> which has already been closed inside the <div>`).

Since always <elementname/> has always been equivalent to <elementname></elementname> for all elements that may have contents (but HTML does not require any element to have contents, not even HTML5, and using self-closing tags is then perfectly valid and causes no ambiguity at all in parsers). And I don't know which kind of parsing difficulties you want to resolve by restricting self-closing tags ONLY to void tags, there' s no such requirement in HTML5, and in fact HTML5 in XML syntax still requires self-closing tags for all void elements.

So HTML5 will forbid using <br>content</br> (the second tag is recognized as a second break with TidyHTML but invalid in HTML5), but the "content" is in fact not the content of the first break but is its next sibling element: that's the only thing you will want to remove in MediaWiki, i.e. rewriting the wikicode as <br>content<br> or <br/>content<br/>, both being equivalent, but the second form being still required in XML syntax.

Restricted Application added a subscriber: Danmichaelo. · View Herald TranscriptAug 10 2017, 8:13 PM
Verdy_p added a comment.EditedAug 10 2017, 8:26 PM

So in summary, you will NOT stop supporting all self-closing tags, but will stop supporting self-closing thags on element that are not void elements: </br> will become invalid (a common mistake in Mediawiki).

And I wonder what you will gain: dureing parsing, you just need to treat </br> not as an end tag, but as a self-closing start/end tag as if it was just <br> or <br/> (the second one being what most contributors expected when they used that invalid close tag). It is very frequent in talk pages (and nobody will fix them, we don't care): this was a useful feature with implicit autocorrection that did not impact the rest of the parsing. Dropping this basic fix when dropping TidyHTML will just make things worse.

However I approve deprecating the support for partially spanning elements such as:

  • <b><i>AAA</b>BBBB</i>, which should be rewritten in wikicode as <b><i>AAA</i></b><i>BBB</i> (or more simply as <i><b>AAA</b>BBB</i> but the content model is different)
  • <b>000<i>AAA</b>BBB</i>, which should be rewritten in wikicode as <b>000<i>AAA</i></b><i>BBB</i>

I am having trouble distilling meaning from this long comment, but I think you're wrong on at least one point. <elementname/> has only been equivalent to <elementname></elementname> in XHTML; in HTML, <elementname/> means the same as <elementname> (it's just an opening tag and the "stray" slash is ignored). As some tags like <br> do not require a closing tag, this is not a problem for them, but for tags that require contents it is.

No, even since it was defined in HTML, it inherited what was initially
defined in SGML as a shortcut to close elements that have empty content.
Except in very old HTML4 browsers (that did not comply to the HTML4
standard in their tricky mode) it has always been accepted as valid.
HTML5 is explicitly referencing XML as one of its fully supported syntaxes
(other syntaxes are the SGML/HTML4 syntax, others are possibly in JSON or
something else that can represent the normative DOM).
I don't know any browser used today (and compatible with what Mediawiki
generates) that will break on a <span id="..."/>. And anyway we are
speaking about the Wiki syntax that can be much more liberal, and will
still generate XML compatible content (so any "<br>" will still be
converted to "<br/>" to allow strict XML parsing)!
I don't know why you want to remove the old SMGL feature which is still
supported in HTML5 (that also promotes the use of compact code with the
HTML syntax, where self-closing tags like "<span/>" will be smarter than
"<span id='...'></span> which is needlessly overlong and only required for
strict conformance with XML parsers).

As well there's no such elements in HTML5 that REQUIRE contents. Contents
are optional everywhere (including in <table> where a missing <tbody> is
implicitly infered, or in <html> where a missing <head> or <body> are also
implicitly inferred).
For HTML5 the following document <html> is perfectly valid, just like
<html/>, it has no content at all (the missing elements are infered in
the DOM after parsing, and there will be de fault empty <title> in its
default empty <head> child), so the empty document or a simple word is
also a valid HTML5 document (this is not syntaxically the same as the
HTML infered from the DOM after parsing, where elements are also
canonicalized and may also have their element names capitalized, but this
alternate canonicalized syntax gives the same DOM).

I am sorry but you're wrong. Here's a simple test file you can use to verify that <b/> is parsed as an opening tag. It is so in modern browsers, and it has always been (I tested with Firefox 3.6, I don't have older browsers readily available.)

@Verdy_p: Please read https://html.spec.whatwg.org/multipage/parsing.html#parse-error-non-void-html-element-start-tag-with-trailing-solidus
In HTML5, a slash at the end of a start tag

  • is ignored for void elements like <br>
  • is a parser error for non-void elements like <div> (but is just ignored, too)
  • only is treated as an actual self-closing element for foreign elements (i.e. for <svg> and <math>, which aren't allowed in wikitext anyway).

2017-08-11 9:25 GMT+02:00 Schnark <no-reply@phabricator.wikimedia.org>:

Schnark added a comment.

@Verdy_p https://phabricator.wikimedia.org/p/Verdy_p/: Please read
https://html.spec.whatwg.org/multipage/parsing.html#parse-
error-non-void-html-element-start-tag-with-trailing-solidus
In HTML5, a slash at the end of a start tag

  • is ignored for void elements like <br>

Yes, I do not oppose that. But there's no value in rejecting it in

MediaWiki (note also that </br> is also a common error currently accepted
where the slash is simply ignored as if it was a sel-closing tag and not an
end tag).

  • is a parser error for non-void elements like <div> (but is just ignored, too)

Absolutely no. It is only an error if you drop it from MediaWiki. And it

is not an error in HTML5 with XML that it refers normatively. It will
become an error in the MediaWiki parser if you drop this common use. We are
not talking about what HTML5 parsers are doing (with what MediaWiki will
generate in the HTML page), but what MediaWiki will recognize, and I don't
see any interest of dropping it.

I don't see any motivation for now rejecting <span id="..." /> in the
MediaWiki syntax even if the generated HTML will not emit it and will
expand it to <span id="..."></span>.

  • only is treated as an actual self-closing element for foreign elements (i.e. for <svg> and <math>, which aren't allowed in wikitext anyway).

Yes, but MediaWki has its own support for svg and math tags via hooks or

via content models or in the image renderers. They add some limit to
disallow unsafe elements or attributes, just like the MediaWiki parser
actually does not parse HTML but rejects the <a> element, and also rejects
too many safe elements that are part of HTML5 or even part of HTML4 such as
<col>, <colgroup>, <thead>, <tbody>, and <tfoot>.

So there's a fundamental confusion here: you are mixing HTML parsers and Mediawiki
parsers. They are not the same languages and do not operate at the same level.

You are advocating a change in MediaWiki parsers that creates more problems than what it solves and MEdiaWiki syntax will NEVER be conforming to HTML5's SGML/HTML syntax (due to restrictions for security) and as well never conforming to its XML syntax (with also other restrictions).

Your test is also not concluding anything, it is just about what browsers renders when they parse HTML, not about what they render with the MediaWiki language, and wiki pages are NOT HTML pages.

Verdy_p added a comment.EditedAug 11 2017, 4:09 PM

So conclusion;: you are compeltely WRONG. This is not the correct scope of MediaWiki versus HTML.

Verdy_p reopened this task as Open.Aug 11 2017, 4:10 PM
matmarex closed this task as Resolved.Aug 11 2017, 4:26 PM

Yes, MediaWiki's parsing is different from normal HTML parsing, but the very point of this task was to bring them closer for consistency. This has been accomplished therefore this task is resolved. If you disagree with the premise then I'm afraid you're the only one.

I'm not alone, you are adding unnecessary works in pages and many plugins for absolutely no benefit, just because some browsers have difficulties to interpret and render some HTML tricks or select the appropriate parser to use (HTML4quircks, SGML, XML, HTML5...) and the behavior to adopt.
This is now a very minority of browsers, and anyway this doed not depend at all on MediaWiki parsers, but only on what Mediawiki will generate from the wikicode.
So you're pushing the burden to allow MediaWiki doing these parsing to the contributors of pages and will break millions of pages, causing unnecessary work that will need to be done in som many places. Most contributors will not understand what was wrong, or will refuse to do this very deceptive maintenance of the content to match what you desire and which is extremely easy to support in Mediawiki itself.

OK TidyHTML will be internally replaced by another tool, but I don't see why you want to break compatibility with something that has proven to be useful, and refuse to port what was implemented and could be used safely without causing any problem (notably self-closing tags) in pages or Mediawiki extensions.

On the opposite you still make nothing to support basic HTML features that people should use as they are safe (notably the following safe elements "col", "colgroup", "thead", "tfoot", "tbody" "caption", and the semantic elements added in HTML5: instead we are still forced to use complex CSS everywhere in pages and templates, and cannot bdevelop accessible contents as we could (notably for tables).

Let me chime in here for a bit.

  • We right now have a backward compatibility fix in both Parsoid and PHP parser to handle self-closing tags. But, we don't want to have that fix indefinitely and keep accumulating unneeded cruft in the code. So, yes, ideally, we want to match parsing of HTML tags to how they are handled by a HTML parser.
  • While it is theoretically true that wikitext is not HTML, for all practical purposes, editors use HTML tags in wikitext markup and expect them to behave as HTML tags. Exceptions to that are source of confusion and edge cases in parsing.
  • The # of affected pages is not in the millions definitely and most wikis have already done the bulk of the fixup. Bots can probably do the bulk of the remaining work.
  • The issue of col, colgroup, thead, etc. is somewhat orthogonal to this discussion. I understand the comparison you are making here, but that can be advocated for and discussed on its own merits. Let us not bring in that into the discussion here.

But, the TL:DR; summary is that @Verdy_p wants us to treat HTML tags in wiki markup as its own thing and not tie them to HTML5 spec. That argument is not entirely without merit -- and we have in fact considered a more restricted variant of that in https://www.mediawiki.org/wiki/Parsing/Notes/HTML5_Compliance#Fixing_non-compliance and https://www.mediawiki.org/wiki/User:Legoktm/HTML%2BMediaWiki. We plan to use this approach for other use cases like using <figure-inline> for inline images in our output. But, note that this is still *building* upon the HTML5 spec by extending it and not introducing more liberal (vs. more restrictive) exceptions to HTML5 recognized tags.

So, given that self-closing tags fits that HTML5 + Mediawiki extension model at all, and given that consistency with the base spec is desirable (Mediawiki recently added support for HTML5 ids) and given that fixing the remaining pages that rely on this behavior is not a big burden and can be done automatically (even by Parsoid as a one-time fix by replacing self-closing tags with the <span></span> equivalent), I am not convinced there are strong use-case arguments for preserving this self-closing tag inconsistency.

If the goal was consistency, then you would also invalidate the inconsistant extension tags (notably the <tvar>...</>). It's a fact that the notation <span id="..."/> is not inconsitant given it is supported by a wellknown standard (XML).
And contributors in Mediawiki cannot rely only on HTML5 standard and are also used to legacy HTML4 and XML and SVG and many other syntaxes that are partly supported (with restrictions and non conforming extensions that are specific to Mediawiki which selects what to support or not and modifies constantly the syntax used everywhere).

There was no inconsistency in the MediaWiki specifications as long they were stable and did not require reediting millions of articles or templates (not just in Wikiemdia wikis but also in many others). HTML has been developed as a standard that preserves the legacy and provides upward compatibiolity as much as possible, but you want to deviate this trend when changing the supported MediaWiki markup. For gaining what ? Nothing except creating a huge work load to fix so many pages for actually no content value added: wikis don't want this unnecessary workload to stack and finally remain unfixed, creating many more long term problems than the temporary problem you may have to replace TidyHTML with something that you don't want to adapt even if this adaptation is extremely minor.

If you were developongan OS API changing its spec suddenly, you would receive complains from many developers and users complaining their existing apps are no longer compatible and don't work the way it was documented and massively used (and this is the case here). May be bots could solve some of these problems but relying on bots to fix things is a bad decision and completely in opposition with what the vast majority of wiki contributors want: this technical change will just harass them, when someone will suddenly come and say that what they did correctly in the past and was even documented are now considered bad: they want to create content, and don't want that content to be suddenly broken by an inconsistant decision using false arguments that this documentation was "inconsistant" even if MediaWiki does not follow this rule for various things.

They will also not like if some bots trying to fix such pseudo-issues will actualy make things even worse with tons of massive edits whose value is in fact completely void, just stresses the servers with more work loads and tons of edits in the history that will be hard to follow.

Wikis need stability if you don't want contributors to abandon the projects and stop creating actual contents that will be constantly "fixed" for no added value. This is then only an unmotivated decisions by some developers of Mediawiki that are deciding what is good for the project without consulting the community (but in fact given the aspect of this change which is really technical, many contributors will not understand the issue: you will treat them as if they were to stupid to learn, when in fact they could oppose the limited vision by a few MediaWiki developers.

The statistics about this project is clear: most of the changes needed are massive, most contributors will receive notifications of changes in pages they created, it will just produce a lot of noise to them, people will stop listening notifications, wil lstop monitoring pages, and finally the content will just stall as is (and even the issues found in statistics will remain unsolved for years, long after you will have released and deployed a version using your change. And I bet that many non WM wikis will choose to revert this change or will create their own patchs to support again what you'll have removed. This means that you open the door to forks, and splitting the communities instead of joining them in a common effort.

matmarex removed a subscriber: matmarex.Aug 11 2017, 8:20 PM

There are help pages but there is no wikitext specification. Wikitext behavior is defined by what the implementation currently does. And, the implementation has changed over the years, and will continue to change to meet new needs and demands, some of which will require tweaking wikitext on pages and templates. We want to migrate the use of HTML tags in wikitext to be consistent with HTML5 standards. Do note that enwiki has discussed the HTML5 question even before we considered it. To repeat what I said in T134423#3519293, "millions" of pages will not need fix up because of the changes we are introducing. The substantial changes require fixing templates. The closest that comes to your repeated claim about millions of pages is if we started requiring that obsolete tags (<center>, <big>, <font>, etc) should be fixed up. But, we are not requiring that. Right now, it is up to individual wikis what they want to do with it. As I indicated above, enwiki has discussed that, independent of us.

Wikis have always had bots and other changes made for MOS reasons or other style related reasons. And, such changes to pages will continue to be made. The changes being made because of replacing Tidy is not necessarily different -- except they are coming from MediaWiki devs, not wiki editors. In order to reduce the changes necessary, we have added Tidy compatibility code in RemexHTML code to prevent the need for unnecessary fixes that can be handled in code. So, we are definitely not introducing work for editors inconsiderately. But, beyond categorical assertions about statistics being clear, if you have evidence of lots of editor complaints about changes being made to pages because of this deprecation on this task, please point me to those complaints. As far as I am aware, editors have been making these changes readily.

As for your complaints about tvar, we do have plans to fix the translate extension. As for other extension tags, code in extension tags is extension specific and is not always wikitext. So, the fact that they use svg, latex, xml, html4, bash, or whatever else that may rely on is not relevant to the discussion of whether the HTML tags used in wikitext should be HTML4 or HTML5.

As for non-WMF wikis, yes, we do not have any way of providing them support for making these kinds of fixes. For that and other reasons, the Tidy setting will continue to be configurable. So, if they choose to, they can continue to use Tidy4 and not make any of these changes to their wikis. But, Tidy4 is unmaintained at this point. The replacement is html5-tidy. So, there really is no long-term path outside of adopting the HTML5 standard.

But, coming back to this specific task, note that so far, we have only deprecated the use of these self-closing tags and have encouraged editors to fix them. We have not broken that behaviour. But yes, we could break that behaviour in the future if we find that the usage on wikis has dropped sufficiently. Ideally, we would not support self-closed tag behaviour indefinitely, but in the scheme of things that need fixing, this is relatively minor, so if there is pushback from wikis against moving from deprecation to breakage, we can reconsider this. But, without sufficient evidence that fixing self-closing tags to adhere to the HTML5 standard is unduly burdening editors, we will continue down this path of increased consistency and compatibility with HTML5 standards.

Restricted Application added a subscriber: jeblad. · View Herald TranscriptSep 4 2017, 9:41 PM
Dvorapa added a comment.EditedSep 4 2017, 11:17 PM

Side note: I thought we are moving off from Tidy to plain php libraries and I also thought Tidy's successor does not support self-closing tags (at least it is in MediaWiki-extensions-Linter's high priority group) and therefore we are going to break that behavior.

Side note: I thought we are moving off from Tidy to plain php libraries and I also thought Tidy's successor does not support self-closing tags (at least it is in MediaWiki-extensions-Linter's high priority group) and therefore we ARE going to break that behavior.

Yes to both. We are replacing Tidy and Tidy's successor does not support self-closing tags. But, we have a workaround in the parsers before it hits (Tidy or) RemexHTML to prevent breakage. So, self-closing tags won't break if we never remove the workarounds in the parsers. We would like to remove these workarounds and hence we are encouraging editors to fix them. See first question in https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#Other_FAQs

@ssastry Cool, I understand concerns of wikis about breakage, but I personally support it. I'm watching both Linter group and maintenance category on Czech Wikipedia and almost daily I solve new issues when editor adds something like this: <sup>some note<sup/>. Both experienced (typo) and beginners (lack of HTML knowledge). I'm in favor of breakage because the editor would immediately notice there is something wrong. If I wouldn't fix errors from Linter/category, nobody would notice for weeks or months or even years.

Dvorapa added a comment.EditedSep 4 2017, 11:58 PM

But my opinion about breakage apply only on self-closing tags. I don't like fixing Tidy whitespace bugs, nowrap and block on one line bugs or redundant tables and so on, because they are not syntactically wrong, at most they are wrong only semantically. For them workarounds would be maybe sufficient...

But my opinion about breakage apply only on self-closing tags. I don't like fixing Tidy whitespace bugs, nowrap and block on one line bugs or redundant tables and so on, because they are not syntactically wrong, at most they are wrong only semantically. For them workarounds would be maybe sufficient...

This discussion is tangential to this ticket. I'll respond here for now, but we should move additional discussion to the talk page on mediawiki so others can see it as well.

We don't have workarounds for the other categories since the right fix is not automatically available right now -- it requires editors to look at the error and fix it. Note that the redundant table category also hides many syntax errors (ex: missing closing table tag), and they are real syntactic errors when we move to HTML5. You cannot nest a table inside a table row - it has to be nested inside a table cell. But, in reality, that is not the right automatic fix because Tidy does something different.

As for the nowrap and tidy whitespace bug, I think they are found on only a small set of pages (usually 10s and 100s) and in most cases, fixing a few templates will fix them.

Dvorapa added a comment.EditedSep 5 2017, 12:22 AM

This discussion is tangential to this ticket. I'll respond here for now, but we should move additional discussion to the talk page on mediawiki so others can see it as well.

Sure

As for the nowrap and tidy whitespace bug, I think they are found on only a small set of pages (usually 10s and 100s) and in most cases, fixing a few templates will fix them.

Especially tidy whitespace bug is really annoying in templates with ifs around <span>s and <br>s. Currently I'm thinking how to fix this:

<span style="white-space:nowrap; display:inline;">'''{{ #if: {{{skóre|}}} | {{{skóre}}} | v }}''' {{#if:{{{prodl|}}}|<span style="font-size: 85%">([[Prodloužení|prodl.]])</span> }}</span>{{ #if: {{{celkově|}}} | <br /> <div style="font-size: 85%">('''{{{celkově}}}''' [[Playoff|celkově]])</div> | }}{{ #if: {{{penaltyskóre|}}} | <br /> <div  style="font-size: 85%">('''{{{penaltyskóre}}}''' [[Penaltový rozstřel|pen]])</div> | }}

I think I have to change the whole logic, but I'm not sure how

Reduntant table group contained thousands of pages and I don't think this will ever be reduced to zero on Czech Wikipedia

<span style="white-space:nowrap; display:inline;">'''{{ #if: {{{skóre|}}} | {{{skóre}}} | v }}''' {{#if:{{{prodl|}}}|<span style="font-size: 85%">([[Prodloužení|prodl.]])</span> }}</span>{{ #if: {{{celkově|}}} | <br /> <div style="font-size: 85%">('''{{{celkově}}}''' [[Playoff|celkově]])</div> | }}{{ #if: {{{penaltyskóre|}}} | <br /> <div style="font-size: 85%">('''{{{penaltyskóre}}}''' [[Penaltový rozstřel|pen]])</div> | }}

I think I have to change the whole logic, but I'm not sure how

See https://www.mediawiki.org/wiki/Topic:Txk8zb0ba4g3zsdi

Change 585519 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Use HTML5 semantics for self-closed HTML tags in wikitext

https://gerrit.wikimedia.org/r/585519

Verdy_p added a comment.EditedApr 2 2020, 4:16 PM
Dans T134423#3578929, @ssastry a écrit :

<span style="white-space:nowrap; display:inline;">'''{{ #if: {{{skóre|}}} | {{{skóre}}} | v }}''' {{#if:{{{prodl|}}}|<span style="font-size: 85%">([[Prodloužení|prodl.]])</span> }}</span>{{ #if: {{{celkově|}}} | <br /> <div style="font-size: 85%">('''{{{celkově}}}''' [[Playoff|celkově]])</div> | }}{{ #if: {{{penaltyskóre|}}} | <br /> <div style="font-size: 85%">('''{{{penaltyskóre}}}''' [[Penaltový rozstřel|pen]])</div> | }}

I think I have to change the whole logic, but I'm not sure how

The usage of the empty "br" element before the "div" is completely futile. Just drop it (add a top margin style to the following div, if you need it, don't use "br" which is intended to be used as an inline element in the middle of a block, not between blocks. If you do use a "br" element in such case, Mediawiki will correctly put it into an empty "p" element, but as you used it after a an inline "span", that "br" will be appended at end of the block or paragraph containing that "span", but as this occurs here just before a simple "div" and after an inline text, that "br" has no effect).

So:

  • drop these "<br />", they make no sense in your example before any "div"; or
  • replace the "div" elements by "span" (which is probably what was meant here as it contains parenthetic precisions and continues the block containing the leading score ...), but then rewrite them as "<br>".

Note: Mediawiki should still treat "<br />" like it *must* be done in XHTML (which also has an valid HTML5 dialect version, even if basic HTML5 does not require the XML syntax). This means that "<br />" should even be treated as if it was "<br">" in the HTML5 syntax. If Mediawiki discards "<br />" completely, this is a serious bug that should have NEVER occured (and did not even occur in MediaWiki before it started to convert to HTML5).

In my opinion, tracking this usage is only useful now because of the bast serious bug when the tag was silently discarded. It should have never been needed at all and it should not even be treated like an error (given that it is not really deprecated and still valid in HTML5, with the XML syntax which is still allowed with a XML document declaration and after all we never put any document declaration in wiki pages so HTML vs. XML syntax is not relevant for them).

However tracking other tags that may have contents (like "b", "span", "div") is more useful and of course they have to be corrected as they create a parsing ambiguity (between HTML5 and XML parsers that behave differently)... except that the wiki parser is not even a valid HTML5 or XML parser, as it uses its own (simpler?) document type and parsing rules (with its own separate ambiguities and caveats). We should not depend at all on the formats that MediaWiki will generate for publishing, and should not confuse them with the source format that it uses for its own parser.

Change 585519 merged by jenkins-bot:
[mediawiki/core@master] Use HTML5 semantics for self-closed HTML tags in wikitext

https://gerrit.wikimedia.org/r/585519

Change 599906 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Sync parserTests with core

https://gerrit.wikimedia.org/r/599906

Change 599906 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Sync parserTests with core

https://gerrit.wikimedia.org/r/599906

Change 600013 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Remove unused deprecated-self-close-category message

https://gerrit.wikimedia.org/r/600013

Jdlrobson added a subscriber: Jdlrobson.EditedJun 7 2020, 7:16 PM

I think this broke de.wikiquote.org - https://de.wikiquote.org/wiki/Benutzer_Diskussion:Jon_(WMF) can you confirm?

ssastry added a comment.EditedJun 7 2020, 8:33 PM

It is going out in tomorrow's Tech News: https://meta.wikimedia.org/wiki/Tech/News/2020/24

In any case dewikiquote has a handful of these lint errors according to https://de.wikiquote.org/wiki/Spezial:LintErrors/self-closed-tag and someone should just fix them. You happened to edit the one page in the main namespace that has this error :-) .. https://de.wikiquote.org/wiki/Spezial:LintErrors/self-closed-tag?namespace=0

Change 603571 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a16

https://gerrit.wikimedia.org/r/603571

Change 603571 merged by jenkins-bot:
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a16

https://gerrit.wikimedia.org/r/603571

Change 600013 merged by jenkins-bot:
[mediawiki/core@master] Remove unused deprecated-self-close-category message

https://gerrit.wikimedia.org/r/600013

Why was the error-tracking category (Pages using invalid self-closed HTML tags) removed? Will there be something to replace it? Gnomes use it regularly to find pages with invalid code; will it be possible to find such code in some other way? Will each wiki have to create its own detection method?

There is already a linter self-closing-tag category. Since there are very few instances of this error left, that linter category is sufficient to fix those.

More importantly, since self-closing tags will be rendered differently compared to before now (the backward compatibility behavior has been removed), any new uses that cause rendering failures will be obvious right away.

Thanks for the response. I continue to hope that someday, the Linter tracking will actually apply Wikipedia categories, so that we can work with them in the usual way.

One note from a gnome to a developer: "new uses that cause rendering failures will be obvious right away" is not a valid supposition on the English Wikipedia. There are far too many pages, with new pages being created every day, for any failures to be obvious. It is best for consequential failures to be tracked and gathered in a systematic way.

One note from a gnome to a developer: "new uses that cause rendering failures will be obvious right away" is not a valid supposition on the English Wikipedia. There are far too many pages, with new pages being created every day, for any failures to be obvious.

Fair enough! :-)

I continue to hope that someday, the Linter tracking will actually apply Wikipedia categories, so that we can work with them in the usual way.
One note from a gnome to a developer: "new uses that cause rendering failures will be obvious right away" is not a valid supposition.

From a gnome on the (quite smaller) Czech Wikipedia: I confirm and totally agree with both statements. The dualism is not a good thing and (even on smaller wiki) editors do not always care if there is any obvious visual issue with their edit.