Page MenuHomePhabricator

id attributes for Unicode code points start with "." and break validation
Closed, DeclinedPublic

Description

Author: mgharish

Description:
Consider the markup:

{|

<h1>ಅ</h1>
-
-
}

This is converted to HTML as:

<table> <tbody><tr> <td> <h1><span id=".E0.B2.85" class="mw-headline">ಅ</span></h1> </td> </tr> <tr> <td>ಆ</td> </tr> </tbody></table>

Look at the <span> tag, which is not present in the actual wiki markup. The problem is that it is taking the Unicode code points of the text ಅ (Kannada letter A) in UTF-8 format, and is converting to value of the "id" attribute. This is breaking the XHTML validation of the texts. Why this extra <span> tag is introduced by MediaWiki? This is observed only for the heading tags <h1> to <h6>, and not for any other tag.

Thanks & Regards,
Harish


Version: unspecified
Severity: blocker
URL: http://kn.wikipedia.org/w/index.php?title=%E0%B2%B5%E0%B2%BF%E0%B2%95%E0%B2%BF%E0%B2%AA%E0%B3%80%E0%B2%A1%E0%B2%BF%E0%B2%AF:%E0%B2%AA%E0%B3%8D%E0%B2%B0%E0%B2%AF%E0%B3%8B%E0%B2%97_%E0%B2%B6%E0%B2%BE%E0%B2%B2%E0%B3%86&oldid=197048

Details

Reference
bz28164

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:27 PM
bzimport set Reference to bz28164.
bzimport added a subscriber: Unknown Object (MLST).

The extra span tag isn't what is messing up the validation, but, rather, the contents the id tag.

The span is there because "<h1>x</h1>" is identical to "= x ="

mgharish wrote:

Yeah, right. Why that id is needed?

ayg wrote:

Because the heading will be added to the table of contents, if there is one, and the table of contents will then link to it. In theory we could avoid emitting the id if there's no TOC, but I don't see the gain. It just reduces consistency.

I'm resolving WONTFIX because

  1. This markup is valid in HTML5. $wgHtml5 = false is still supported for now, but it won't be supported forever and it's not the default, so the motivation for fixing it is limited.
  1. We don't want to break existing id's without good reason, and there's no good reason to break them in non-HTML5 mode if we're going to eventually remove support for it anyway and re-break them.