Do not allow #, %, [, ], nbsp in fragment identifiers
OpenPublic

Description

Testcase

The characters "#", "%", "[" and "]" as well as any Unicode whitespace characters (no-break space etc.) should be banned in HTML5 IDs because they trigger a validation error if used in a href attribute. At least "#" are "%" also causing problems in practice (and not just in IE6 as the comment from r62134 suggests); see attached testcase.


Version: unspecified
Severity: enhancement

attachment fragments.html ignored as obsolete

bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz24918.
Entlinkt created this task.Via LegacyAug 24 2010, 4:27 AM
Entlinkt added a comment.Via ConduitAug 24 2010, 4:58 AM

Created attachment 7649
Extended testcase

It seems that percent-encoding (the only way to avoid the validation error) does not work at all in any IE version and is implemented inconsistently in other browsers.

Attached: fragments.html

Entlinkt added a comment.Via ConduitAug 24 2010, 5:59 AM

It seems that the disallowed characters are based on section 2.2 of RFC 3987: ":", "/", "?", "#", "[", "]", "@" (gen-delims) minus "/" and "?" (explicitly allowed for ifragment) minus ":" and "@" (explicitly allowed for ipchar) plus "%" (special case) gives "#", "%", "[" and "]" in the end.

I don't know why Unicode whitespace characters aren't allowed, but the HTML5 validator complains about <a href="#&nbsp;"></a>, <a href="#&thinsp;"></a> and the like.

Entlinkt added a comment.Via ConduitAug 24 2010, 7:19 AM

See also section 3.1 of RFC 3987: Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs [...]. Please note that the number sign ("#"), the percent sign ("%"), and the square bracket characters ("[", "]") are not part of the above list and MUST NOT be converted. [...]

bzimport added a comment.Via ConduitAug 25 2010, 5:58 PM

ayg wrote:

I'm not really worried about us not following the spec, since the spec can be changed if it's unreasonable, or ignored. If we can't reliably link to these characters, though, we should strip them. Your attachment only illustrates behavior with "#", which is already stripped -- do "%", "[", and "]" also not work in practice? I tested with [[mw:User:Simetrical/Id test]] and they seemed to work fine. I didn't test percent-escaping exhaustively, though -- in particular, since you point it out, things like "%3F" are likely to be interpreted differently by different browsers, and I didn't test that in all browsers.

Stripped "%" in r71636. Does anything else cause browsers to misbehave? If not, I'll look into filing spec or validator bugs where possible.

Entlinkt added a comment.Via ConduitAug 25 2010, 10:15 PM

Attachment 7649 also shows inconsistent behaviour with "%". IE does not seem to support percent-encoding in fragments at all; it takes the "%" sign literally even if the two characters that follow could be hex digits. (If it did support percent-encoding in fragments, this were all moot, since we could just percent-encode these characters.)

Other browsers seem to try to guess how "%" was meant, but do it differently: Chrome prefers to take it literally, Mozilla and Opera prefer taking it as a hex number. location.hash is different again: Mozilla decodes it, but Chrome and Opera don't.

I have not found any practical issues with "[" and "]" so far.

Entlinkt added a comment.Via ConduitAug 26 2010, 12:27 AM

MediaWiki has a funny handling of these characters in external links that is exactly the other way round. The wikitext

[http://example.com/#&#x23;&#x25;&#x5B;&#x5D;]

gives this HTML:

<a href="http://example.com/##%%5B%5D">

So it lets the more problematic characters through unencoded and encodes the less problematic ones. Why that?

bzimport added a comment.Via ConduitAug 26 2010, 7:43 PM

ayg wrote:

So the only remaining problem is that the validator complains about things like <a href="#&nbsp;"></a>? If so, I'll look into reporting that as a spec or validator bug, and mark this FIXED.

(In reply to comment #6)

MediaWiki has a funny handling of these characters in external links that is
exactly the other way round. The wikitext

[http://example.com/#&#x23;&#x25;&#x5B;&#x5D;]

gives this HTML:

<a href="http://example.com/##%%5B%5D">

So it lets the more problematic characters through unencoded and encodes the
less problematic ones. Why that?

I don't know. I glanced at the code but didn't see an obvious reason. It's a separate bug.

Entlinkt added a comment.Via ConduitAug 26 2010, 11:44 PM

So the only remaining problem is that the validator complains about things like
<a href="#&nbsp;"></a>?

Not quite. It's unclear why the HTML5 validator complains about Unicode whitespace like nbsp etc.; the RFCs give no clue. But unencoded "[" and "]" are clearly non-compliant. RFC 3987 says "... square bracket characters ... MUST NOT be converted" and then RFC 3986 says "A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax."

I don't know. I glanced at the code but didn't see an obvious reason. It's a
separate bug.

Separate, but related. There is apparently no way to write links to sections with "[" and "]" in the title as external links (this includes permalinks) without getting them percent-encoded (more than that, it's hard to write them at all, as they clash with wiki markup).

Other than that, I'm not sure if stripping the most problematic characters is the right approach at all. It doesn't solve all compatibility issues. I've just noticed the following: Paste http://example.com/#< into Firefox' address bar. Copy from there and paste into an arbitrary text editor. You'll get http://example.com/#%3C (tested in a current Firefox 4.0 nightly), which doesn't work in IE. This happens with some funny ASCII characters like "<" and ">", but also - and that's far worse - non-ASCII characters that occur in natural language.

So Firefox users will create links that don't work in IE as long as IE doesn't understand percent encoding. Maybe we should therefore allow all characters in IDs, percent-encode where necessary (that is, just 4 ASCII characters which rarely occur in natural language anyway) and accept that this minor detail doesn't work in IE. That's at least compliant; the whole attempt to allow arbitrary Unicode characters isn't interoperable with Firefox enforcing percent encoding and IE not supporting it.

bzimport added a comment.Via ConduitAug 27 2010, 5:22 PM

ayg wrote:

(In reply to comment #8)

But unencoded "[" and "]" are
clearly non-compliant. RFC 3987 says "... square bracket characters ... MUST
NOT be converted" and then RFC 3986 says "A host identified by an Internet
Protocol literal address, version 6 [RFC3513] or later, is distinguished by
enclosing the IP literal within square brackets ("[" and "]"). This is the
only place where square bracket characters are allowed in the URI syntax."

Hmm. We could strip those too, but it seems silly if all browsers accept them. If the spec requires something that not all browsers support, and prohibits something equivalent that all browsers do support, the spec is broken.

Separate, but related. There is apparently no way to write links to sections
with "[" and "]" in the title as external links (this includes permalinks)
without getting them percent-encoded (more than that, it's hard to write them
at all, as they clash with wiki markup).

The sensible thing would be to urldecode() anchors automatically in external links, if that's what it takes for IE to accept them . . . if that's necessary for the links to actually work but specs prohibit it, the specs are wrong. But that's a separate issue from a development perspective, as I said, although conceputally related.

Other than that, I'm not sure if stripping the most problematic characters is
the right approach at all. It doesn't solve all compatibility issues. I've just
noticed the following: Paste http://example.com/#< into Firefox' address bar.
Copy from there and paste into an arbitrary text editor. You'll get
http://example.com/#%3C (tested in a current Firefox 4.0 nightly), which
doesn't work in IE. This happens with some funny ASCII characters like "<" and
">", but also - and that's far worse - non-ASCII characters that occur in
natural language.

So Firefox users will create links that don't work in IE as long as IE doesn't
understand percent encoding.

This seems like a minor enough failure. At worst, the very small number of people who this happens to won't make it to the right section. Not the end of the world.

I've reported the issue to Microsoft, after verifying that it still exists in IE9PP4:

https://connect.microsoft.com/IE/feedback/details/590087/percent-encoding-fragments-hashes-anchors-does-not-work

(A [free] Microsoft Live account is needed to view.)

Add Comment