⚓ T11530 Section heading anchors shouldn't begin with invalid characters

		Status	Subtype	Assigned	Task
		Resolved		None	T2209 [DO NOT USE] HTML validity (tracking)
		Declined		None	T11530 Section heading anchors shouldn't begin with invalid characters

• bzimport raised the priority of this task from to Low.Nov 21 2014, 9:41 PM

• bzimport added a project: MediaWiki-Parser.

• bzimport set Reference to bz9530.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Apr 8 2007, 6:40 PM

ayg wrote:

Every a element in the URL in question that has a name element appears to also have a matching

id element.

Every id in the URL in question begins with either a letter or underscore.

The URL you linked to appears to be completely correct.

pixeltoo wrote:

1)attribute name is deprecated. If there are id and name for each anchor, how do
you explain this warning at the line 275.

#line 275 column 4 - Warning: <a> cannot copy name attribute to id

source html: <p><a name=".C3.82ge_d.E2.80.99or_.281890-1909.29"></a></p>

why do you use underscore at the begining whereas id attribute must begin

with a letter ([A-Za-z])?

example:
line 1365 column 1 - Warning: <li> ID "_note-effetpanneaux" uses XML ID syntax

source html:
<li id="_note-effetpanneaux"><a href="#_ref-effetpanneaux_0" title="">↑</a>
<span style="cursor:help"><span
style="font-family:monospace;font-weight:bold;font-size:small"
title="Langue : français">(fr)</span></span>

ayg wrote:

id's in XML may begin with letters, colons, or underscores. See: <http://www.w3.org/TR/2006/REC-xml-
20060816/#NT-Name>. It's true that this is not permitted in HTML 4.0, which is presumably where the
warning comes in, but we aren't using HTML 4.0, so I don't see an issue. The compatibility problems
would probably be minor.

You're correct that we aren't careful to avoid headings that don't start with valid characters.
Repurposing to that.

pixeltoo wrote:

ok thanks

ayg wrote:

This is now fixed if $wgEnforceHtmlIds is set to false. If all goes well, it should be fixed by default sometime this week.

ayg wrote:

*** Bug 4515 has been marked as a duplicate of this bug. ***

Bug 10218 is more general, should this be duped to it?

ayg wrote:

No, it's better to have more specific bugs so we can close them fixed when each one is fixed, and generally maintain their status separately.

validate --xml says
Error at line 85, character 34: value of attribute "id" invalid: "."

cannot start a name

<a name=".E8.B7.AF.E7.B7.9A" id=".E8.B7.AF.E7.B7.9A"></a><h2><span class="editsection">[<a href="/index
.php?title=%E5%B7%A8%E6%A5%AD_%E5%BD%B0%E5%8C%96-%E5%A4%A7%E8%82%9A-%E6%B2%99%E9%B9%BF&action=edit&sect

Bug 18548 has been marked as a duplicate of this bug. ***

Bug 18838 has been marked as a duplicate of this bug. ***

herd wrote:

*** Bug 19890 has been marked as a duplicate of this bug. ***

Bug 20225 has been marked as a duplicate of this bug. ***

Bug 22418 has been marked as a duplicate of this bug. ***

Bug 22693 has been marked as a duplicate of this bug. ***

ayg wrote:

Note that this is moot if $wgHtml5 is true, since HTML5 permits id's to begin with any character they can contain. I'll leave the bug open because we still do support XHTML1 as an output format for now, but I don't see it as likely that anyone will fix this bug.

(Also:

(In reply to comment #5)

This is now fixed if $wgEnforceHtmlIds is set to false. If all goes well, it
should be fixed by default sometime this week.

I love how optimistic I always am about things getting enabled. :) )

http://www.mediawiki.org/wiki/Manual:$wgEnforceHtmlIds
$wgEnforceHtmlIds was renamed to renamed to $wgExperimentalHtmlIds in r61691 (1.16) and converted to use laxer HTML 5 syntax.

http://www.mediawiki.org/wiki/Manual:$wgExperimentalHtmlIds
This option is for testing -- when the functionality is ready, it will be on by default with no option.

ayg wrote:

. . . yes, and?

Bug 25907 has been marked as a duplicate of this bug. ***

What's the history of the practice of trying to encode the name of the section in the anchor for that section?

It seems to get very messy and unpredictable unless the heading text is written in latin characters without any punctuation. And even in those cases, it's still possible that the heading text will be encoded as the same ID as other IDs on the page, such as those used by the skin or other software that renders user interfaces.

The IDs seem to be effectively unpredictable in a couple of ways:

The encoding does not follow a standard encoding algorithm, making any non string that's not [a-zA-Z0-9 ] be converted to something that only someone who knows the algorithm well would expect.
The anchor names could possibly intersect with IDs used on the page for other things. An effort has been made to conform the IDs of skins use the mw-* namespace, but it's still not a guarantee, just a bit less likely.

If anchor names were encoded in a predictable way, such as id="section-1.2" the anchors would be able to correspond to the table of contents, which is pretty simple and straightforward, plus we could know for sure that there would never be collisions with IDs so long as we never use the section-* namespace in skins or other software. Since we have more control over the software than the content, this seems like a superior approach.

The point is to be able to link to a specific section of an article from another article (or even externally). Yes, it is possible that the name will change, but often that does not happen, especially for articles using a specific template. Naming it "section-1.2" would not really provide anything useful, and would be even more likely to change than the name of the header.

"but often that does not happen"

What is this based on?

Personal experience? Maybe you edit different articles than I do, but a lot of them are relatively static, especially in terms of section headers. And that's just on Wikipedia.

Section header links are used a lot. Maybe they shouldn't be, in their current state, but they are. Getting rid of them entirely is not a good idea, nor is replacing them with something that would be useless for their main purpose: linking to a specific part of an article.

Jumping to named sections is very common. Numbered sections change, so cannot be used for section jumping.

And it's not that difficult. A section anchor is basically: section title -> uri encode -> replace ( '%', '.' )
I do agree we have a lot of issues with id clashing. Anchors really should have their own prefix, but changing that now would be rather disruptive I fear...

ayg wrote:

(In reply to comment #20)

What's the history of the practice of trying to encode the name of the section
in the anchor for that section?

The same as the practice of trying to encode the name of the article in the URL for the article, I imagine. Pretty URLs are nice.

It seems to get very messy and unpredictable unless the heading text is written
in latin characters without any punctuation.

$wgExperimentalHtmlIds is enabled in trunk, so this is no longer the case -- non-Latin scripts and punctuation will work fine. (Although there are still a bunch of other problems.)

The encoding does not follow a standard encoding algorithm, making any non

string that's not [a-zA-Z0-9 ] be converted to something that only someone who
knows the algorithm well would expect.

This is no longer the case on trunk. (Actually, legacy id's are just urlencoded as UTF-8, but with "%" replaced by ".", so that's not really nonstandard. But it is ugly.)

The anchor names could possibly intersect with IDs used on the page for

other things. An effort has been made to conform the IDs of skins use the mw-*
namespace, but it's still not a guarantee, just a bit less likely.

Yes, this is a problem.

If anchor names were encoded in a predictable way, such as id="section-1.2" the
anchors would be able to correspond to the table of contents, which is pretty
simple and straightforward, plus we could know for sure that there would never
be collisions with IDs so long as we never use the section-* namespace in skins
or other software. Since we have more control over the software than the
content, this seems like a superior approach.

We could also make our URLs use page_id instead of the article title, but I don't think it's desirable. The section name is more stable than the number, because it doesn't change when sections are added or removed, and adding/removing sections is more common than renaming them. (No, I have no stats on this, but it's clear to me from personal experience.)

The section name can also be typed manually or copy-pasted from the rendered page, not just copy-pasted from the URL, so it's more convenient. You could type a #section-1.2 type anchor manually too, but only if you count the sections, which isn't worth it on large articles.

And the section name gives you an idea of what section you're being linked to before you click the URL. The section number is opaque.

There are indeed some problems with the way we do things in trunk. Overall, IMO, they're not enough to offset the (modest) advantages we get from using section names instead of numbers. It would be pretty easy to more or less eliminate anchor collisions from headers by just making a big pattern of reserved anchors, including unprefixed ones like "content" and "top", and tweaking header id's if they matched -- we wouldn't get them all, but we'd make the problem really negligible.

May be the solution would be to use Punycode encoding ? (and really, you should avoid dots everywhere because they are not liked in CSS selectors.
Punycode (used in IDN) solves all these problems. We just don't have to restrict dots (don't need nameprep and its internal very complex character equivalence mappings), they can be Punycoded like the rest.

The good thing about Punycode is that it just uses letters and digits and is case insenstive ; minus-hyphens are use to separate "words" made of cese insensitive letters and digits, and the result can be mappedto javascript properties (like in HTYML5 dataset).

Note that this won't make the ID's built from section headers necesarily unique (there's still a frequent case where multiple headers for distcint sections have identical text content; if this ever happens, some suffix should be appended to the duplicate section headers only.

Oh, my... chaging the algorithm will force all preparsed pages to have their HTML flushed from the server cache, and could break existing URLs that are inserted in discussions as is. But not a major problem, as it would break these URLS, only anchors won't be found. This already happens frequently independantly of this bug, simply because not enough people know how to use the MediaWiki parser functions for computing anchro links. We have utility template on all wikis for this, this is a matter of training, but it is not critical in discussions.

In main articles however, we occasionnly find links to other article sections: a quelity check of these pages should really use the URL-building parser funtions. Or the aricles should contain manually inserted (and predictable) anchor (using <span id=""></psan>), independantly of the text used in section headings.

ayg wrote:

(In reply to comment #26)

May be the solution would be to use Punycode encoding ?

Punycode is gibberish. We want readable anchors, if it's not too problematic. And it's not.

(and really, you should
avoid dots everywhere because they are not liked in CSS selectors.

People are not likely to refer to section id's in CSS. If they do, they can escape special characters with backslashes.

ho94949 wrote:

(In reply to comment #26)

May be the solution would be to use Punycode encoding ?

Maybe using punycode may take problem.
First, that is duplicated with each title

For example, title with è
in punycode that is xn--8ca
than make another title with title xn--8ca
then, we cannot find difference about them.

Also, think about span tag, that is in the article.

Yes I know, but the id duplication is another problem (also for HTML5 conformance and for having autogenerated summaries to link to the appropriate section when we click on them).

Yes there's an extra need for making these ID's unique (required in XHTML) by adding some suffixes to duplicate section headings, when they exist in any page, but this is another issue, independant of this one, that should be handled automatically without any additional markup in the edited pages. This duplication is extremely frequent in vote pages (with standardized subsection headings like "Approve" or "Oppose" or "Neutral"). Adding a span tag will not resolve the issue with the standard summaries which completely ignore this markup in the autogenerated anchors.

Here we were speaking about invalid characters, and it is clear that a valid ID must not contain any dot (and at least must not start with it), and that converting them using ".XX" hex sequences for each non-ASCII UTF-8-encoded character is also not needed in most cases (an ID can perfectly accept non-ASCII letters without this extra encoding to ASCII on top of UTF-8). Really, the generated IDs should be the same and compatible for direct use in URLs, or in CSS selectors, or for the XML syntax. This is possible, but it will require a better encoding than the bogous current one, plus the general need to make them unique by adding some suffixes for duplicates.

ayg wrote:

(In reply to comment #29)

Yes I know, but the id duplication is another problem (also for HTML5
conformance and for having autogenerated summaries to link to the appropriate
section when we click on them).

The algorithm already accounts for this. E.g.,

Foo

"Foo"

will give the latter an anchor of #Summary_2 in current trunk. First the anchor is generated, then a number is appended if it's the same as a previous anchor. You have to have this code anyway, to handle cases like

Foo

So punycode doesn't gain anything for uniqueness.

Here we were speaking about invalid characters, and it is clear that a valid ID
must not contain any dot (and at least must not start with it)

Valid ID's in XHTML 1.0 and HTML5 may contain a dot. Valid ID's in HTML5 may start with a dot.

and that
converting them using ".XX" hex sequences for each non-ASCII UTF-8-encoded
character

We no longer do this in trunk. We just convert runs of whitespace and other bad characters to a single underscore, and otherwise output as-is (possibly with a number appended).

Really, the generated IDs should be the same and compatible for direct use in
URLs, or in CSS selectors, or for the XML syntax. This is possible, but it will
require a better encoding than the bogous current one, plus the general need to
make them unique by adding some suffixes for duplicates.

The id's being output by trunk in HTML5 mode can be used directly in URLs (in reasonably recent browsers), can be used in CSS selectors with proper escaping (although I doubt much of anyone does), and can be used in XML just as in any other markup language.

I really *didnot* say that Punycode wpould solve the duplicates. In fact you're resaying exactly what I said (even speaking that this was a separate issue).

I have never intended that Punycode would solve duplicates. It was just possible to use it as an alternative to the incorrect syntax of existing id's that contain dots.

You can argue anything you want but anything generated like:
id=".C2.BF"
is completely invalid. There MUST not ne any dot in id's generated from non-ASCII characters. and the current encoding exposes each UTF-8 byte in its hex form, which is really inefficient (in terms of length) and really unreadable (not more that Punycode), when many more letters outside ASCII are perfectly valid in HTML an XML id's.

Why can't we have ids like:

id="Résumé" (which is perfectly valid)

and we still have to see things like:

id="R.C2.E9sum.C2.E9" (which is completely invalid)

???

That's ALL what I was commenting (and I did not introduce myself the separate problem of duplicates).

You misunderstood or simply did not read my own statements. Punycode was ONLY a suggestion for the first problem. It is of course based on a framework where we absolutely don't need to keep the additional "xn--" prefix which is definitely not part of Punycode itself, but part of its use in IDNA (which has a much more restricted subset of valid characters, than the set of valid characters in XML id's). But Punycode still offers a good encoding framework for building valid XML id's for the case were some characters are restricted (we'll still need to encode in some way the presence of dots in NON-encoded section headings).

See this reference for the syntax of names (and the restricted set) in XML:

http://www.w3.org/TR/REC-xml/#NT-Name

You'll immediately se that the ONLY ASCII characters valid everywhere in IDs are only the ASCII letters and the underscore; and plenty of other non ASCII characters are accepted which don't need any UTF-8 bytes-based ".NN" hex encoding like it is done today (in an overlong form).

Additional ASCII characters are accepted ONLY after the initial position (this includes the dot, the minus-hyphen, and ASCII digits); additional NON-ASCII characters are also accepted after the first position (notably the combining characters).

Note that this is most problameatic with non-Latin wikis (see the Chinse, Korean, Japanese, Arabic, Hebrew, Thai wikis !), were almost all characters are hex-encoded (in overlong sequences) and expose an invalid leading dot, when no hex encoding at all was even necessary in most cases.

Do you still argue that these hex-encoded ID's are "readable" ??? They're definitely not !

OK, so leading dots are invalid.

Simple fix: add a prefix to all section headers, such as section_ to ensure the id always starts with a valid character. MediaWiki dot encodes UTF characters as needed, which are valid in any position other than the first.

The cite.php citation extension does this in the same manner, using cite_ref- and cite_note- as id prefixes to ensure valid HTML output.

ayg wrote:

(In reply to comment #31)

You can argue anything you want but anything generated like:
id=".C2.BF"

MediaWiki trunk does not create such id's.

is completely invalid.

Not in HTML5.

Why can't we have ids like:
id="Résumé" (which is perfectly valid)
and we still have to see things like:
id="R.C2.E9sum.C2.E9" (which is completely invalid)
???

In trunk, the wikitext

== Résumé ==

produces

<div id="R.C3.A9sum.C3.A9"></div>
<h2><span class="mw-headline" id="Résumé">Résumé</span></h2>

The link from the table of contents is <a href="#Résumé">. (The extra div is kept so old links don't break -- it can be removed eventually.)

(In reply to comment #32)

See this reference for the syntax of names (and the restricted set) in XML:

http://www.w3.org/TR/REC-xml/#NT-Name

HTML5 does not define the id attribute to be an XML Name. In fact, it has no DTD at all. The restrictions on the id element in HTML5 (in both text/html and XML syntax) are stated here:

"The value must be unique amongst all the IDs in the element's home subtree and must contain at least one character. The value must not contain any space characters."
http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute

As further evidence that id's in HTML5 are allowed to start with dots in XML as well as text/html format, see this link:

http://validator.nu/?doc=data%3Aapplication%2Fxhtml%2Bxml%2C%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+id%3D%22.%22%3E%3Chead%3E%3Ctitle%3E%3C%2Ftitle%3E%3C%2Fhead%3E%3Cbody%3E%3C%2Fbody%3E%3C%2Fhtml%3E

This is not permitted by XHTML 1.0, which we still technically support, but that's obsolete and I don't expect us to care if we break XHTML 1.0 validation in some cases in the future. Eventually we're likely to remove the mode entirely. This bug will be INVALID as soon as $wgHtml5 is removed and set always to true.

(In reply to comment #34)

Do you still argue that these hex-encoded ID's are "readable" ??? They're
definitely not !

The id's *in trunk* are readable. Example heading taken from he.wikipedia.org:

== איפה נמצא העמוד ראשי של מחר? ==

id generated in trunk:

id="איפה_נמצא_העמוד_ראשי_של_מחר?"

This is valid in HTML5 and works in the (recent-enough) browsers it's been tested in.

Yes XHTML 1.0 will be deprecated, but ony its "modular" design, that highly depends on validation with external schemas that have to be declared in the document.

But HTML5 does NOT deprecate the compatibility with XML and it explciitly says that it will support TWO syntaxes for its serialization : the historical SGML-based HTML syntax, AND the XML syntax. The inlydifference is that it will not be modular and the schema for the validation of the content model will be inferred. This means that you'll still be able to use an XML parser, but the schema will not be specified by the document itself, so that the documetn can be processed in "standalone" mode using a schema provided by the application using the XML parser, instead of by the ocument or by the site producing it.

In other words, the XML contraints on names continue to apply to HTML5 for strict conformance, otherwise it will not be possible to use an XML parser for it, or to embed HTML5 in an XML document.

XHTML 1.0 is also abandonned because it created a fork from HTML in a separate branch. HTML5 remerges the two branches and deprecates the modular design and free extensions based on non-standard schemas.

Note also, that XML documents do not need to validate, but must still observe the conformance rules. Even a non-validating XML parser will choke on invalid ID's if they are presented with the "xml:id" pseudo-attribute. But I agree with you, id's in HTML5 are not used with a "xml:id" pseudo-attribute, but by a plain "id" attribute : if the XML parser does not validate the schema, it will accept an "id" attribute containing anything. But as soon as you'll want to build a schema for your XML parsed document, it will be impossible to use anything else than the XML type name that restricts its value, if you want the compatibility with schemas built for XHTML 1.0, unless you make this "id" atribute into an unrestricted text type.

ayg wrote:

(In reply to comment #37)

But HTML5 does NOT deprecate the compatibility with XML and it explciitly says
that it will support TWO syntaxes for its serialization : the historical
SGML-based HTML syntax, AND the XML syntax.

Correct. However, the id element is not defined as an XML Name in HTML5's XML syntax.

In other words, the XML contraints on names continue to apply to HTML5 for
strict conformance, otherwise it will not be possible to use an XML parser for
it, or to embed HTML5 in an XML document.

You're mistaken. There is no conformance mode in HTML5 that prohibits id=".", nor any notion of "strict conformance" defined anywhere in HTML5. And XML parsers can handle such id's just fine. It will not validate in a DTD that makes id="" a Name, but HTML5 has no DTD, so this is fine.

But as soon as you'll want to
build a schema for your XML parsed document, it will be impossible to use
anything else than the XML type name that restricts its value, if you want the
compatibility with schemas built for XHTML 1.0, unless you make this "id"
atribute into an unrestricted text type.

Correct. Any DTD that's compatible with HTML5's requirements will make "id" an unrestricted text type. Any DTD that restricts it to Name will incorrectly declare valid HTML5 documents to be invalid.

Bug 29843 has been marked as a duplicate of this bug. ***

With a dot in such id, it won't be possible to reference the element by Id using DOM (getElementById() can only return a single element, unlike getElementByTagName()). And CSS selectors won't work with those id's if they contain a dot.

We need still a correct implementation of anchorEncode() that still operates correctly with CSS selectors, and with the DOM API.

Note: Uniqueness of generated Id's is still needed, so beside anchorencode() we also need a suffix generation for duplicates (Facebook does that in its framework, look at how the application-generated HTML is transformed when the page is generated: all application-generated Id's are dynamically suffixed, as well as the generated CSS class names, to create a correct sandboxing isolating them in their own namespace, and enforcing the uniqueness to avoid garbling the content generated outside the framework or from another app, and this is an excellent initiative ; Google does that also for application gadgets for iGoogle, using its framework).

ayg wrote:

(In reply to comment #40)

With a dot in such id, it won't be possible to reference the element by Id
using DOM

getElementById() will accept id's that have dots in them just fine.

And CSS selectors won't work with those id's if they
contain a dot.

You can escape the dot, like #foo\.bar { color: red }.

We need still a correct implementation of anchorEncode() that still operates
correctly with CSS selectors, and with the DOM API.

We have one. Try viewing a page like this, e.g., by inputting it at http://software.hixie.ch/utilities/js/live-dom-viewer/:

<!DOCTYPE html>
<span id=foo.bar>Hello</span>
<style>#foo\.bar { color: red }</style>
<script>document.getElementById("foo.bar").innerHTML += "!"</script>

You'll see "Hello!", red and with an exclamation point.

Note: Uniqueness of generated Id's is still needed, so beside anchorencode() we
also need a suffix generation for duplicates

We've always done this for headings, although there are various bugs, including ignoring non-heading id's.

Bug 29877 has been marked as a duplicate of this bug. ***

Marking as WONTFIX as we have completely removed XHTML 1.0 support from core. It's rules are no longer relevant and hence this bug is invalid.

You've stated to support HTML5, whch includes XHTML5. It will work as long as the HTML5/XHTML5 parsers do not attempt to map its type to a name or to an XML id. As long as the schema validator used keeps this attribute as an unrestricted text type, and the HTML DOM accepts this (including through Javascript), we can live with it.

But ensuring the uniqueness of id values is still a problem when you use document.getElementById() and you don't know which element will be returned. Apparently, browsers have implemented this Javascript API so that they will return an array of elements if this ever occurs (and it's up to Javascript applications to be aware that a single element *may* not be returned by this call, just like with document.getElementByName()...). This means that the id attribute duplicates the function of the name attribute now in (X)HTML5 and we can ignore the non-working "validity" restrictions of XHTML1 and HTML4 or before in their schema.

But we still need a way to create unique anchors which will remain readable and more or less stable when linking between different articles. For now MediaWiki does not track anchors (id's) that are referenced between articles, these anchors are modified in articles without notice, and links from other articles no longer work as expected. And MediaWiki still does not warn editors when we have two sections showing the same heading in the same article, so we can fix them to have working links, with readable anchors usable in other articles.

(X)HTML5 does not define id as a name/XML id, any code handling it as such is a non-conforming parser and there is no reason to support it.

The parser already ensures the uniqueness of ids for headers within a page. ID Uniqueness in other locations are bug 7356 and bug 35371. If you think we should add extra processing to track user-specified id="" values and reject them when a user writes bad markup duplicating ids open a new bug.

Section heading anchors shouldn't begin with invalid characters
Closed, DeclinedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Foo

"Foo"

Foo

Foo

Section heading anchors shouldn't begin with invalid charactersClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Foo

"Foo"

Foo

Foo

Section heading anchors shouldn't begin with invalid characters
Closed, DeclinedPublic
Actions

Related Objects
Search...