Deprecate nonstandard behavior of self-closed HTML tags in wikitext.
Closed, ResolvedPublic

Description

The HTML5 standard says that the XML-ish self-closed tag syntax <TAGNAME/> (note the trailing slash) is ignored: tags are "self-closed" iff the tag name matches a list of "void tags". The only valid void HTML tags are area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, wbr

As such <b/> and <span/> are treated exactly the same as <b>, <span> in a HTML5 parser. But, the situation is a bit complicated in mediawiki.

  • Without tidy turned on, the Sanitizer mostly enforced this constraint but rewrote <b/> as &lt;b/>. That isn't strictly according to the HTML5 spec (which would rewrite it as <b>) but does get the point across that this is invalid HTML syntax.
  • When tidy is enabled, tidy replaces <b/> with nothing, that is, it removes the invalid tag from the output. This has led to its (ab)use as a way to protect leading/trailing whitespace and punctuation in templates. However, there are alternative ways to do this, including <nowiki/> and &#32;, which don't violate the HTML5 parsing rules.
  • However when we replace Tidy with a HTML5 parser (See T89331), Mediawiki will start enforcing the HTML5 standard and parse <b/>, <span/> as start tags which can break rendering on pages that might (deliberately or accidentally) rely on Tidy removing these tags.

In order to facilitate a smooth migration away from Tidy, we are deprecating the use of non-void self-closed HTML tags (so, to repeat, area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, wbr can be written in self-closed tag form and need not be changed). Additionally, we have started tagging pages using this invalid form with the [[Category:Pages using invalid self-closed HTML tags]] tracking category. Once pages which use this construct are cleaned up, we'll change both the "tidy" and the "no tidy" case to be consistent with the HTML5 parsing standard; that is, <b/> will be transformed into <b>.

Additionally, registered extension tags aren't subject to this consideration since they aren't HTML5 tags. So, for example, <ref /> and <references /> can continue to be used.

See also: https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#Simplified_instructions_for_fixing_pages

There are a very large number of changes, so older changes are hidden. Show Older Changes

I have a question to clarify:

Does that mean, that basically all <hr /> and <br /> in wikitext, which were up till now being cleaned up from <br>, <br/> and </br> forms by bots or gadgets, will have to be converted to <hr>' and <br>'?

No. HTML5 accepts the self-closing form for tags and treats them as a start tag.

This is fine for <hr> and <br> since that has the intended effect since the spec says they are void elements.

Can someone please give a clear answer to this circumstance. It is fully illogical to propagate this (void) XHTML tags (which where only created to look like the self-closing XHTML tags). So currently bots/scripts convert <hr> and <br> to XHTML !

So the simply question is: are <hr /> and <br /> deprecated or not?

So the simply question is: are <hr /> and <br /> deprecated or not?

Not deprecated.

Perhelion added a comment.EditedJul 15 2016, 3:45 PM

Not deprecated.

Ok can you say why? Can you retrace that this is something illogical or incomprehensible (
especially when we think about the future)? (Anyway the <br /> is mentioned in the leading description)

So people which convert HTML(5) to XHTML (<br> to <br/>) are anyway right to do this?

ssastry updated the task description. (Show Details)Jul 15 2016, 4:49 PM

The tracking category contains pages not using deprecated self-closing tags. (But containing at least some of <br />, <hr />, <ref />, <references />.) Not sure if to have it reported here or open new task for that.

ssastry added a comment.EditedJul 15 2016, 5:11 PM

Not deprecated.

Ok can you say why? Can you retrace that this is something illogical or incomprehensible (
especially when we think about the future)? (Anyway the <br /> is mentioned in the leading description)

So people which convert HTML(5) to XHTML (<br> to <br/>) are anyway right to do this?

I don't fully understand your question. But, I updated the task description which might hopefully clarify any confusion. Internally, the HTML5 parser converts all self-closing forms to a start tag. So, <br/> is converted to <br> and <b/> is converted to <b>. As the updated description explains, this is a problem for non-void tags. Tidy removes a <b/> tag, but a HTML5 parser will change it to a <b>. So, once Tidy is replaced with a HTML5-compliant cleanup tool (T89331), the rendering of pages that use <b/> might change where you'll see a lot of text in bold.

So yes, it is okay to use <br />. It is also okay to convert <br> to <br />, but it is not necessary.

The tracking category contains pages not using deprecated self-closing tags. (But containing at least some of <br />, <hr />, <ref />, <references />.) Not sure if to have it reported here or open new task for that.

New bug please. I am going to resolve this bug shortly. But, it could be that the page is using a template that generates a deprecated self-closing tag. But, if you have a sample page, please open a new bug with a link. Thanks.

The tracking category contains pages not using deprecated self-closing tags. (But containing at least some of <br />, <hr />, <ref />, <references />.) Not sure if to have it reported here or open new task for that.

New bug please. I am going to resolve this bug shortly. But, it could be that the page is using a template that generates a deprecated self-closing tag. But, if you have a sample page, please open a new bug with a link. Thanks.

Solved in meantime. The issue is that the tracking category isn't populated completely, so it was (and actually still is) not listing any of the templates using on the given page.

Perhelion added a comment.EditedJul 18 2016, 12:03 PM

There is a new HTML5 check needed for the custom-user-signature-settings. Because on fixing several pages, the most where user-sigs. Is this a new bug-report? T140606

NicoV added a subscriber: NicoV.Jul 19 2016, 11:18 AM

In addition to the categorization in [[Category:Pages using invalid self-closed HTML tags]], would it be possible to also have an error message when previewing in edit mode, like what is done for [[Category:Pages using duplicate arguments in template calls]] ? It would help a lot to find where the problem is, and if it's coming from a template used inside the page, rather than from the text in the page.

@NicoV Besides I don't think it's effectively doable to show whether the issue is on the current page or in any transcluded template (mind the transclusion can be several levels deep too), when you hit Preview you already see the tracking category.

OTOH, if the error notice would be there to emphasize, there is such issue (although without mentioning the source, as I pointed above), it would be helpful if the message listed the improperly selfclosed tags as sometimes is quite hard to find it due to weird wild constructions such as <tag{{#if:{{{content|}}}|>{{{content}}}/tag|/}}>, so one would have narrowed scope of what to look for...

NicoV added a comment.Jul 19 2016, 3:14 PM

@NicoV Besides I don't think it's effectively doable to show whether the issue is on the current page or in any transcluded template (mind the transclusion can be several levels deep too), when you hit Preview you already see the tracking category.

@Danny_B I was talking about [[Category:Pages using duplicate arguments in template calls]] because that's exactly what is already done for that tracking category : it tells you if the issue is on the current page or if it is in any transcluded template (and it includes the levels when there are several levels...). And it tells you which argument is duplicated also. So it is doable... I don't know how it was done, but it was done and it is very helpful.

So doing it here would also be very helpful if it was done in a same way : current page or transcluded template (with levels) and information about the problem itself (tag name for example would be very helpful to narrow the search).

I agree that you do see the tracking category when you hit Preview but for some articles, it's not very helpful to find where the problem is when the issue is not trivial. For example, I'm currently trying to fix the articles for frwiki, and I've encountered a few pages where I was unable to find the cause of the categorization. For example :
https://fr.wikipedia.org/wiki/Insurrection_de_Boko_Haram
https://fr.wikipedia.org/wiki/Hautes_Tatras
https://fr.wikipedia.org/wiki/Discussion_Portail:Aur%C3%A8s

Hautes Tatras fixed by https://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Panorama_annot%C3%A9_Hautes_Tatras&diff=prev&oldid=127982911

It's easy - just open all linked templates and look for improper selfclosing tags...

jrbs added a subscriber: jrbs.Jul 25 2016, 10:08 PM
Elitre added a subscriber: Elitre.Aug 2 2016, 9:54 AM
Arbnos removed a subscriber: Arbnos.Aug 3 2016, 10:34 AM
JJMC89 added a subscriber: JJMC89.Sep 22 2016, 10:07 PM
Jonesey95 added a subscriber: Jonesey95.EditedOct 8 2016, 5:37 PM

Is this supposed to check for self-closed instances of every tag listed on lines 376-381 of https://gerrit.wikimedia.org/r/#/c/286928/11/includes/Sanitizer.php ?

If so, I think it may be missing at least one. See this page for an example of a "pre" tag that is self-closed, but to which the error category has not been applied:

https://en.wikipedia.org/w/index.php?title=User:Jonesey95/sandbox2&oldid=743233850

[edited to add:] Can someone please point us to a complete list of tags that can be listed on the category page? Thanks.

Is this supposed to check for self-closed instances of every tag listed on lines 376-381 of https://gerrit.wikimedia.org/r/#/c/286928/11/includes/Sanitizer.php ?

If so, I think it may be missing at least one. See this page for an example of a "pre" tag that is self-closed, but to which the error category has not been applied:

See T134423#2301066.

[edited to add:] Can someone please point us to a complete list of tags that can be listed on the category page? Thanks.

The only valid self-closed HTML tags are: area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, wbr. In addition, <pre> is treated as an extension tag in MediaWiki and is also exempt. So, all other HTML tags besides these should be fixed if they use the invalid self-closed form. To be very clear and to repeat what I said elsewhere, extension tags (like ref, references, gallery, syntaxhighlight, nowiki, etc.) aren't affected.

@tstarling would you please run a new report run of P3012 ?

@tstarling would you please run a new report run of P3012 ?

Wikidivspan
arwiki1316
cawiki56
cebwiki00
commonswiki20
dewiki12
enwiki1122
enwikinews83
enwikisource01
enwiktionary00
eswiki43
fawiki1246
fiwiki31
frwiki10
frwikisource10
frwiktionary10
huwiki07
idwiki533
incubatorwiki1522
itwiki70
jawiki58
kowiki321
metawiki422
mgwiktionary04
nlwiki00
nowiki11
plwiki240
ptwiki58
rowiki15
ruwiki01
ruwiktionary33
shwiki330
srwiki09
svwiki10
trwiki21
ukwiki821
viwiki415
warwiki17
wikidatawiki11
zhwiki510
zhwiktionary12
This comment was removed by SamanthaNguyen.
ssastry moved this task from Backlog to In Progress on the MediaWiki-Parser board.Jan 4 2017, 7:30 PM
Liuxinyu970226 added a subscriber: liangent.EditedJan 12 2017, 5:34 AM

@tstarling would you please run a new report run of P3012 ?

Wikidivspan
arwiki1316
cawiki56
cebwiki00
commonswiki20
dewiki12
enwiki1122
enwikinews83
enwikisource01
enwiktionary00
eswiki43
fawiki1246
fiwiki31
frwiki10
frwikisource10
frwiktionary10
huwiki07
idwiki533
incubatorwiki1522
itwiki70
jawiki58
kowiki321
metawiki422
mgwiktionary04
nlwiki00
nowiki11
plwiki240
ptwiki58
rowiki15
ruwiki01
ruwiktionary33
shwiki330
srwiki09
svwiki10
trwiki21
ukwiki821
viwiki415
warwiki17
wikidatawiki11
zhwiki510
zhwiktionary12

for zhwiki, per discussion under Tech News: 2016-20 (@liangent ), most of the rest are jQuery('<div/>') (and jQuery('<span/>')?) and don't need "such fixing"...

for zhwiki, per discussion under Tech News: 2016-20 (@liangent ), most of the rest are jQuery('<div/>') (and jQuery('<span/>')?) and don't need "such fixing"...

Note, though, that for creating a single element with jQuery the MW coding conventions prefer jQuery( '<div>' ) without the trailing slash, so if you want to follow them in on-wiki gadgets/user scripts, you could change these occurrences, too.

Fito added a subscriber: Fito.Feb 13 2017, 4:15 AM

Change 350901 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Fix self-closed HTML tag test.

https://gerrit.wikimedia.org/r/350901

Change 350901 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Fix self-closed HTML tag test.

https://gerrit.wikimedia.org/r/350901

Quiddity removed a subscriber: Quiddity.May 4 2017, 11:47 PM
ssastry closed this task as Resolved.Jul 19 2017, 9:29 PM
ssastry claimed this task.

We have finished the part of deprecating this tag and also identifying them via tracking category as well as via the Linter extension. Editors have been fixing pages and addressing this issue. So, there is nothing more to do here.

Elitre updated the task description. (Show Details)Jul 20 2017, 10:51 AM
Liuxinyu970226 moved this task from Backlog to Closed on the Chinese-Sites board.Jul 23 2017, 9:46 AM
Verdy_p added a subscriber: Verdy_p.EditedAug 10 2017, 8:13 PM

This cleanup is clearly invalid.

HTML5 just says that void elements can have any end tag and must then be self-closed or implicitly closed immediaterly without parsing any content in them (they can only have attributes).

Other elements that ''may'' have contents (but are not required to) are perfectly valid when they are self closed such as <span id="example"/>. It is frequent for such tags to have an empty content, notably in their initial creation, where only attributes may be sued, and are in fact enough (the visible content will be generated elsewhere.
Note that <span id="example"/> is used in Mediawiki instead of <a id="example"/> to insert anchors at an isolated position independantly of what's around (which may also be empty).
I clearly don't see any interest in forcing us to write it as <span id="example"></span>. And not that Mediawiki should not blindlyu strip that empty span as it has important attributes.

You have completely misinterpreted what the HTML5 says! And in fact HTML5 is still supporting the XML syntax where explicit closure of all tags (including sefl-closing) is required, even if the SGML/HTML syntax allows tags to be implicitly closed depending on surrounding contents (for example a <p> is implicitly closed by the next implicitly-closed <p> or next explicitly closed <p>...</p> or <p/> appearing when parsing its child elements, or the closure of already opened container elements (such as <div>.....<p>....</div> where a missing </p> is implied just before </div> so that <div>.....<p>....</div></p> is effectively invalid: the </p> after </div> does not match any pending <p> which has already been closed inside the <div>`).

Since always <elementname/> has always been equivalent to <elementname></elementname> for all elements that may have contents (but HTML does not require any element to have contents, not even HTML5, and using self-closing tags is then perfectly valid and causes no ambiguity at all in parsers). And I don't know which kind of parsing difficulties you want to resolve by restricting self-closing tags ONLY to void tags, there' s no such requirement in HTML5, and in fact HTML5 in XML syntax still requires self-closing tags for all void elements.

So HTML5 will forbid using <br>content</br> (the second tag is recognized as a second break with TidyHTML but invalid in HTML5), but the "content" is in fact not the content of the first break but is its next sibling element: that's the only thing you will want to remove in MediaWiki, i.e. rewriting the wikicode as <br>content<br> or <br/>content<br/>, both being equivalent, but the second form being still required in XML syntax.

Restricted Application added a subscriber: Danmichaelo. · View Herald TranscriptAug 10 2017, 8:13 PM
Verdy_p added a comment.EditedAug 10 2017, 8:26 PM

So in summary, you will NOT stop supporting all self-closing tags, but will stop supporting self-closing thags on element that are not void elements: </br> will become invalid (a common mistake in Mediawiki).

And I wonder what you will gain: dureing parsing, you just need to treat </br> not as an end tag, but as a self-closing start/end tag as if it was just <br> or <br/> (the second one being what most contributors expected when they used that invalid close tag). It is very frequent in talk pages (and nobody will fix them, we don't care): this was a useful feature with implicit autocorrection that did not impact the rest of the parsing. Dropping this basic fix when dropping TidyHTML will just make things worse.

However I approve deprecating the support for partially spanning elements such as:

  • <b><i>AAA</b>BBBB</i>, which should be rewritten in wikicode as <b><i>AAA</i></b><i>BBB</i> (or more simply as <i><b>AAA</b>BBB</i> but the content model is different)
  • <b>000<i>AAA</b>BBB</i>, which should be rewritten in wikicode as <b>000<i>AAA</i></b><i>BBB</i>

I am having trouble distilling meaning from this long comment, but I think you're wrong on at least one point. <elementname/> has only been equivalent to <elementname></elementname> in XHTML; in HTML, <elementname/> means the same as <elementname> (it's just an opening tag and the "stray" slash is ignored). As some tags like <br> do not require a closing tag, this is not a problem for them, but for tags that require contents it is.

No, even since it was defined in HTML, it inherited what was initially
defined in SGML as a shortcut to close elements that have empty content.
Except in very old HTML4 browsers (that did not comply to the HTML4
standard in their tricky mode) it has always been accepted as valid.
HTML5 is explicitly referencing XML as one of its fully supported syntaxes
(other syntaxes are the SGML/HTML4 syntax, others are possibly in JSON or
something else that can represent the normative DOM).
I don't know any browser used today (and compatible with what Mediawiki
generates) that will break on a <span id="..."/>. And anyway we are
speaking about the Wiki syntax that can be much more liberal, and will
still generate XML compatible content (so any "<br>" will still be
converted to "<br/>" to allow strict XML parsing)!
I don't know why you want to remove the old SMGL feature which is still
supported in HTML5 (that also promotes the use of compact code with the
HTML syntax, where self-closing tags like "<span/>" will be smarter than
"<span id='...'></span> which is needlessly overlong and only required for
strict conformance with XML parsers).

As well there's no such elements in HTML5 that REQUIRE contents. Contents
are optional everywhere (including in <table> where a missing <tbody> is
implicitly infered, or in <html> where a missing <head> or <body> are also
implicitly inferred).
For HTML5 the following document <html> is perfectly valid, just like
<html/>, it has no content at all (the missing elements are infered in
the DOM after parsing, and there will be de fault empty <title> in its
default empty <head> child), so the empty document or a simple word is
also a valid HTML5 document (this is not syntaxically the same as the
HTML infered from the DOM after parsing, where elements are also
canonicalized and may also have their element names capitalized, but this
alternate canonicalized syntax gives the same DOM).

I am sorry but you're wrong. Here's a simple test file you can use to verify that <b/> is parsed as an opening tag. It is so in modern browsers, and it has always been (I tested with Firefox 3.6, I don't have older browsers readily available.)

@Verdy_p: Please read https://html.spec.whatwg.org/multipage/parsing.html#parse-error-non-void-html-element-start-tag-with-trailing-solidus
In HTML5, a slash at the end of a start tag

  • is ignored for void elements like <br>
  • is a parser error for non-void elements like <div> (but is just ignored, too)
  • only is treated as an actual self-closing element for foreign elements (i.e. for <svg> and <math>, which aren't allowed in wikitext anyway).

2017-08-11 9:25 GMT+02:00 Schnark <no-reply@phabricator.wikimedia.org>:

Schnark added a comment.

@Verdy_p https://phabricator.wikimedia.org/p/Verdy_p/: Please read
https://html.spec.whatwg.org/multipage/parsing.html#parse-
error-non-void-html-element-start-tag-with-trailing-solidus
In HTML5, a slash at the end of a start tag

  • is ignored for void elements like <br>

    Yes, I do not oppose that. But there's no value in rejecting it in

MediaWiki (note also that </br> is also a common error currently accepted
where the slash is simply ignored as if it was a sel-closing tag and not an
end tag).

  • is a parser error for non-void elements like <div> (but is just ignored, too)

    Absolutely no. It is only an error if you drop it from MediaWiki. And it

is not an error in HTML5 with XML that it refers normatively. It will
become an error in the MediaWiki parser if you drop this common use. We are
not talking about what HTML5 parsers are doing (with what MediaWiki will
generate in the HTML page), but what MediaWiki will recognize, and I don't
see any interest of dropping it.

I don't see any motivation for now rejecting <span id="..." /> in the
MediaWiki syntax even if the generated HTML will not emit it and will
expand it to <span id="..."></span>.

  • only is treated as an actual self-closing element for foreign elements (i.e. for <svg> and <math>, which aren't allowed in wikitext anyway).

    Yes, but MediaWki has its own support for svg and math tags via hooks or

via content models or in the image renderers. They add some limit to
disallow unsafe elements or attributes, just like the MediaWiki parser
actually does not parse HTML but rejects the <a> element, and also rejects
too many safe elements that are part of HTML5 or even part of HTML4 such as
<col>, <colgroup>, <thead>, <tbody>, and <tfoot>.

So there's a fundamental confusion here: you are mixing HTML parsers and Mediawiki
parsers. They are not the same languages and do not operate at the same level.

You are advocating a change in MediaWiki parsers that creates more problems than what it solves and MEdiaWiki syntax will NEVER be conforming to HTML5's SGML/HTML syntax (due to restrictions for security) and as well never conforming to its XML syntax (with also other restrictions).

Your test is also not concluding anything, it is just about what browsers renders when they parse HTML, not about what they render with the MediaWiki language, and wiki pages are NOT HTML pages.

Verdy_p added a comment.EditedAug 11 2017, 4:09 PM

So conclusion;: you are compeltely WRONG. This is not the correct scope of MediaWiki versus HTML.

Verdy_p reopened this task as Open.Aug 11 2017, 4:10 PM
matmarex closed this task as Resolved.Aug 11 2017, 4:26 PM

Yes, MediaWiki's parsing is different from normal HTML parsing, but the very point of this task was to bring them closer for consistency. This has been accomplished therefore this task is resolved. If you disagree with the premise then I'm afraid you're the only one.

I'm not alone, you are adding unnecessary works in pages and many plugins for absolutely no benefit, just because some browsers have difficulties to interpret and render some HTML tricks or select the appropriate parser to use (HTML4quircks, SGML, XML, HTML5...) and the behavior to adopt.
This is now a very minority of browsers, and anyway this doed not depend at all on MediaWiki parsers, but only on what Mediawiki will generate from the wikicode.
So you're pushing the burden to allow MediaWiki doing these parsing to the contributors of pages and will break millions of pages, causing unnecessary work that will need to be done in som many places. Most contributors will not understand what was wrong, or will refuse to do this very deceptive maintenance of the content to match what you desire and which is extremely easy to support in Mediawiki itself.

OK TidyHTML will be internally replaced by another tool, but I don't see why you want to break compatibility with something that has proven to be useful, and refuse to port what was implemented and could be used safely without causing any problem (notably self-closing tags) in pages or Mediawiki extensions.

On the opposite you still make nothing to support basic HTML features that people should use as they are safe (notably the following safe elements "col", "colgroup", "thead", "tfoot", "tbody" "caption", and the semantic elements added in HTML5: instead we are still forced to use complex CSS everywhere in pages and templates, and cannot bdevelop accessible contents as we could (notably for tables).

Let me chime in here for a bit.

  • We right now have a backward compatibility fix in both Parsoid and PHP parser to handle self-closing tags. But, we don't want to have that fix indefinitely and keep accumulating unneeded cruft in the code. So, yes, ideally, we want to match parsing of HTML tags to how they are handled by a HTML parser.
  • While it is theoretically true that wikitext is not HTML, for all practical purposes, editors use HTML tags in wikitext markup and expect them to behave as HTML tags. Exceptions to that are source of confusion and edge cases in parsing.
  • The # of affected pages is not in the millions definitely and most wikis have already done the bulk of the fixup. Bots can probably do the bulk of the remaining work.
  • The issue of col, colgroup, thead, etc. is somewhat orthogonal to this discussion. I understand the comparison you are making here, but that can be advocated for and discussed on its own merits. Let us not bring in that into the discussion here.

But, the TL:DR; summary is that @Verdy_p wants us to treat HTML tags in wiki markup as its own thing and not tie them to HTML5 spec. That argument is not entirely without merit -- and we have in fact considered a more restricted variant of that in https://www.mediawiki.org/wiki/Parsing/Notes/HTML5_Compliance#Fixing_non-compliance and https://www.mediawiki.org/wiki/User:Legoktm/HTML%2BMediaWiki. We plan to use this approach for other use cases like using <figure-inline> for inline images in our output. But, note that this is still *building* upon the HTML5 spec by extending it and not introducing more liberal (vs. more restrictive) exceptions to HTML5 recognized tags.

So, given that self-closing tags fits that HTML5 + Mediawiki extension model at all, and given that consistency with the base spec is desirable (Mediawiki recently added support for HTML5 ids) and given that fixing the remaining pages that rely on this behavior is not a big burden and can be done automatically (even by Parsoid as a one-time fix by replacing self-closing tags with the <span></span> equivalent), I am not convinced there are strong use-case arguments for preserving this self-closing tag inconsistency.

If the goal was consistency, then you would also invalidate the inconsistant extension tags (notably the <tvar>...</>). It's a fact that the notation <span id="..."/> is not inconsitant given it is supported by a wellknown standard (XML).
And contributors in Mediawiki cannot rely only on HTML5 standard and are also used to legacy HTML4 and XML and SVG and many other syntaxes that are partly supported (with restrictions and non conforming extensions that are specific to Mediawiki which selects what to support or not and modifies constantly the syntax used everywhere).

There was no inconsistency in the MediaWiki specifications as long they were stable and did not require reediting millions of articles or templates (not just in Wikiemdia wikis but also in many others). HTML has been developed as a standard that preserves the legacy and provides upward compatibiolity as much as possible, but you want to deviate this trend when changing the supported MediaWiki markup. For gaining what ? Nothing except creating a huge work load to fix so many pages for actually no content value added: wikis don't want this unnecessary workload to stack and finally remain unfixed, creating many more long term problems than the temporary problem you may have to replace TidyHTML with something that you don't want to adapt even if this adaptation is extremely minor.

If you were developongan OS API changing its spec suddenly, you would receive complains from many developers and users complaining their existing apps are no longer compatible and don't work the way it was documented and massively used (and this is the case here). May be bots could solve some of these problems but relying on bots to fix things is a bad decision and completely in opposition with what the vast majority of wiki contributors want: this technical change will just harass them, when someone will suddenly come and say that what they did correctly in the past and was even documented are now considered bad: they want to create content, and don't want that content to be suddenly broken by an inconsistant decision using false arguments that this documentation was "inconsistant" even if MediaWiki does not follow this rule for various things.

They will also not like if some bots trying to fix such pseudo-issues will actualy make things even worse with tons of massive edits whose value is in fact completely void, just stresses the servers with more work loads and tons of edits in the history that will be hard to follow.

Wikis need stability if you don't want contributors to abandon the projects and stop creating actual contents that will be constantly "fixed" for no added value. This is then only an unmotivated decisions by some developers of Mediawiki that are deciding what is good for the project without consulting the community (but in fact given the aspect of this change which is really technical, many contributors will not understand the issue: you will treat them as if they were to stupid to learn, when in fact they could oppose the limited vision by a few MediaWiki developers.

The statistics about this project is clear: most of the changes needed are massive, most contributors will receive notifications of changes in pages they created, it will just produce a lot of noise to them, people will stop listening notifications, wil lstop monitoring pages, and finally the content will just stall as is (and even the issues found in statistics will remain unsolved for years, long after you will have released and deployed a version using your change. And I bet that many non WM wikis will choose to revert this change or will create their own patchs to support again what you'll have removed. This means that you open the door to forks, and splitting the communities instead of joining them in a common effort.

matmarex removed a subscriber: matmarex.Aug 11 2017, 8:20 PM

There are help pages but there is no wikitext specification. Wikitext behavior is defined by what the implementation currently does. And, the implementation has changed over the years, and will continue to change to meet new needs and demands, some of which will require tweaking wikitext on pages and templates. We want to migrate the use of HTML tags in wikitext to be consistent with HTML5 standards. Do note that enwiki has discussed the HTML5 question even before we considered it. To repeat what I said in T134423#3519293, "millions" of pages will not need fix up because of the changes we are introducing. The substantial changes require fixing templates. The closest that comes to your repeated claim about millions of pages is if we started requiring that obsolete tags (<center>, <big>, <font>, etc) should be fixed up. But, we are not requiring that. Right now, it is up to individual wikis what they want to do with it. As I indicated above, enwiki has discussed that, independent of us.

Wikis have always had bots and other changes made for MOS reasons or other style related reasons. And, such changes to pages will continue to be made. The changes being made because of replacing Tidy is not necessarily different -- except they are coming from MediaWiki devs, not wiki editors. In order to reduce the changes necessary, we have added Tidy compatibility code in RemexHTML code to prevent the need for unnecessary fixes that can be handled in code. So, we are definitely not introducing work for editors inconsiderately. But, beyond categorical assertions about statistics being clear, if you have evidence of lots of editor complaints about changes being made to pages because of this deprecation on this task, please point me to those complaints. As far as I am aware, editors have been making these changes readily.

As for your complaints about tvar, we do have plans to fix the translate extension. As for other extension tags, code in extension tags is extension specific and is not always wikitext. So, the fact that they use svg, latex, xml, html4, bash, or whatever else that may rely on is not relevant to the discussion of whether the HTML tags used in wikitext should be HTML4 or HTML5.

As for non-WMF wikis, yes, we do not have any way of providing them support for making these kinds of fixes. For that and other reasons, the Tidy setting will continue to be configurable. So, if they choose to, they can continue to use Tidy4 and not make any of these changes to their wikis. But, Tidy4 is unmaintained at this point. The replacement is html5-tidy. So, there really is no long-term path outside of adopting the HTML5 standard.

But, coming back to this specific task, note that so far, we have only deprecated the use of these self-closing tags and have encouraged editors to fix them. We have not broken that behaviour. But yes, we could break that behaviour in the future if we find that the usage on wikis has dropped sufficiently. Ideally, we would not support self-closed tag behaviour indefinitely, but in the scheme of things that need fixing, this is relatively minor, so if there is pushback from wikis against moving from deprecation to breakage, we can reconsider this. But, without sufficient evidence that fixing self-closing tags to adhere to the HTML5 standard is unduly burdening editors, we will continue down this path of increased consistency and compatibility with HTML5 standards.

Restricted Application added a subscriber: jeblad. · View Herald TranscriptSep 4 2017, 9:41 PM
Dvorapa added a comment.EditedSep 4 2017, 11:17 PM

Side note: I thought we are moving off from Tidy to plain php libraries and I also thought Tidy's successor does not support self-closing tags (at least it is in MediaWiki-extensions-Linter's high priority group) and therefore we are going to break that behavior.

Side note: I thought we are moving off from Tidy to plain php libraries and I also thought Tidy's successor does not support self-closing tags (at least it is in MediaWiki-extensions-Linter's high priority group) and therefore we ARE going to break that behavior.

Yes to both. We are replacing Tidy and Tidy's successor does not support self-closing tags. But, we have a workaround in the parsers before it hits (Tidy or) RemexHTML to prevent breakage. So, self-closing tags won't break if we never remove the workarounds in the parsers. We would like to remove these workarounds and hence we are encouraging editors to fix them. See first question in https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy/FAQ#Other_FAQs

@ssastry Cool, I understand concerns of wikis about breakage, but I personally support it. I'm watching both Linter group and maintenance category on Czech Wikipedia and almost daily I solve new issues when editor adds something like this: <sup>some note<sup/>. Both experienced (typo) and beginners (lack of HTML knowledge). I'm in favor of breakage because the editor would immediately notice there is something wrong. If I wouldn't fix errors from Linter/category, nobody would notice for weeks or months or even years.

Dvorapa added a comment.EditedSep 4 2017, 11:58 PM

But my opinion about breakage apply only on self-closing tags. I don't like fixing Tidy whitespace bugs, nowrap and block on one line bugs or redundant tables and so on, because they are not syntactically wrong, at most they are wrong only semantically. For them workarounds would be maybe sufficient...

But my opinion about breakage apply only on self-closing tags. I don't like fixing Tidy whitespace bugs, nowrap and block on one line bugs or redundant tables and so on, because they are not syntactically wrong, at most they are wrong only semantically. For them workarounds would be maybe sufficient...

This discussion is tangential to this ticket. I'll respond here for now, but we should move additional discussion to the talk page on mediawiki so others can see it as well.

We don't have workarounds for the other categories since the right fix is not automatically available right now -- it requires editors to look at the error and fix it. Note that the redundant table category also hides many syntax errors (ex: missing closing table tag), and they are real syntactic errors when we move to HTML5. You cannot nest a table inside a table row - it has to be nested inside a table cell. But, in reality, that is not the right automatic fix because Tidy does something different.

As for the nowrap and tidy whitespace bug, I think they are found on only a small set of pages (usually 10s and 100s) and in most cases, fixing a few templates will fix them.

Dvorapa added a comment.EditedSep 5 2017, 12:22 AM

This discussion is tangential to this ticket. I'll respond here for now, but we should move additional discussion to the talk page on mediawiki so others can see it as well.

Sure

As for the nowrap and tidy whitespace bug, I think they are found on only a small set of pages (usually 10s and 100s) and in most cases, fixing a few templates will fix them.

Especially tidy whitespace bug is really annoying in templates with ifs around <span>s and <br>s. Currently I'm thinking how to fix this:

<span style="white-space:nowrap; display:inline;">'''{{ #if: {{{skóre|}}} | {{{skóre}}} | v }}''' {{#if:{{{prodl|}}}|<span style="font-size: 85%">([[Prodloužení|prodl.]])</span> }}</span>{{ #if: {{{celkově|}}} | <br /> <div style="font-size: 85%">('''{{{celkově}}}''' [[Playoff|celkově]])</div> | }}{{ #if: {{{penaltyskóre|}}} | <br /> <div  style="font-size: 85%">('''{{{penaltyskóre}}}''' [[Penaltový rozstřel|pen]])</div> | }}

I think I have to change the whole logic, but I'm not sure how

Reduntant table group contained thousands of pages and I don't think this will ever be reduced to zero on Czech Wikipedia

<span style="white-space:nowrap; display:inline;">'''{{ #if: {{{skóre|}}} | {{{skóre}}} | v }}''' {{#if:{{{prodl|}}}|<span style="font-size: 85%">([[Prodloužení|prodl.]])</span> }}</span>{{ #if: {{{celkově|}}} | <br /> <div style="font-size: 85%">('''{{{celkově}}}''' [[Playoff|celkově]])</div> | }}{{ #if: {{{penaltyskóre|}}} | <br /> <div style="font-size: 85%">('''{{{penaltyskóre}}}''' [[Penaltový rozstřel|pen]])</div> | }}

I think I have to change the whole logic, but I'm not sure how

See https://www.mediawiki.org/wiki/Topic:Txk8zb0ba4g3zsdi