Page MenuHomePhabricator

The Sanitizer doesn't validate the contents of id attributes
Closed, ResolvedPublic

Description

Author: avarab

Description:
id attributes MUST (RFC 2119) have a value that begins with a letter [A-Za-z],
MediaWiki doesn't enforce this.

Steps to reproduce

  1. Save the text "<div id=9></div>" on a page

Actual output

<div id="9"></div>

Expected output

No idea how we should properly go about this...


Version: unspecified
Severity: minor
URL: http://www.w3.org/TR/html401/types.html#type-name

Details

Reference
bz4515

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:01 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz4515.
bzimport added a subscriber: Unknown Object (MLST).
bzimport created this task.Jan 7 2006, 9:04 AM

avarab wrote:

This bug has a parsertest called "Sanitizer: Validating the contents of the id
attribute"

zigger wrote:

(In reply to comment #0)

...
No idea how we should properly go about this...

What would Tidy do?

hashar added a comment.Jan 7 2006, 1:26 PM

<div id="nine"></div> ? :o)

avarab wrote:

(In reply to comment #2)

(In reply to comment #0)

...
No idea how we should properly go about this...

What would Tidy do?

$ echo Parser::Tidy( "<div id=9>" );
<div id="9"></div>

I.e. it has the same issue.

brion added a comment.Jun 14 2006, 6:37 AM
  • Bug 6301 has been marked as a duplicate of this bug. ***

ayg wrote:

As I understand it, Tidy explicitly does not fix invalid ids, because that
wouldn't necessarily fix the issue; invalid ids indicate a systematic error that
will inherently screw up, e.g., anchors, no matter what Tidy does. It has to be
fixed with intelligence.

That said, a hack to kill multiple ids after Tidy tidies might be a good idea.
If multiple or invalid ids are detected, any id after the first/any invalid id
should perhaps just be commented out with an error message in the comment. If
someone tries to save something that explicitly (rather than via template)
contains tags that collectively contain one or more ids, a warning should be
thrown up on attempting to save.

There are rather more pressing things to worry about. (On the non-dev side,
stub templates should use classes, not ids.)

From 6301 : should also enforce id uniqueness

wulf.bugmail wrote:

I agree with Simetrical, stubs should use classes, not ids…

why not fixing it by adding a default compliant prefix for such ids?
So just prepend "id-" and you're done!
-> id="9" becomes id="id-9"

Hmm. in fact the standard is a bit more restrictive about which Unicode
characters are usable as identifiers, either at a start position, or at a
continuing position.

Note that there are currently *two* sets of Unicode properties (starting by
"ID_" or "XID_") but Unicode is about to make them equivalent : the only
difference is whever the MIDDLE DOT should be managed identically as a possible
continuing character, because there's an issue with the compatibility
equivalence of Catalan character L WITH MIDDLE DOT, which is a letter and valid
both as a start character or a continuation character in ids, but not its in its
equivalent when encoded as a valid L start charter, and a MIDDLE DOT continue
character).

See the ongoing Unicode Public Review about this issue and the proposed change
that would make the existing two sets of character properties for identifiers
completely equivalent.

The area where it could cause problems is possibly with HTML and XML strict
conformance for identifier names, or other programming languages (e.g. Java or
C#) that uses one of the affected set of character properties, and this
justifies the ongoing Unicode public review (because this may requiring those
language definitions to be adjusted to preserve the compatibility with their
strict conformance verifiers)

But even if we forget this very limited case, we can at least adopt full
conformance for the other characters usable in identifiers, without causing
strict HTML/XML parsers to fail and adopt automatically the old legacy
compatiblity profile support instead (which affects the rendering of pages).

Currently, it does not seem that either IE or Mozilla/Firefox do enforce the
strict validity of identifier names, and not even their uniqueness in the
document (both can return an array of elements from GetElementById(), if there
are multiple occurences of the same id, instead of failing, or returning null,
or returning a single element chosen arbitrarily), so this is not a demonstrated
major issue... for now...

(I don't know what other browsers are doing to enforce identifiers validity)

ayg wrote:

(In reply to comment #9)

why not fixing it by adding a default compliant prefix for such ids?
So just prepend "id-" and you're done!
-> id="9" becomes id="id-9"

Fine by me. We also have to fix anchor link generators (like for the TOC, section edit redirect,
history, etc.), of course.

ayg wrote:

*** Bug 4687 has been marked as a duplicate of this bug. ***

david.sledge wrote:

Why not just run it through static method escapeId() in the Sanitizer class (includes/Sanitizer.php)? If you set the $flag argument to Sanitizer::NONE, it will make sure the ID starts with a letter. In fact, the header anchors are created with this method (though the $flag argument is defaulted to Sanitizer::INITIAL_NONLETTER).

ayg wrote:

Doesn't help with id uniqueness, which is a much trickier problem. It would be a start, though.

david.sledge wrote:

I know. I've thrown in my $0.02 about that, too. See bug #7356.

david.sledge wrote:

One thing to keep in mind is that whatever mechanism is used to ensure valid id attribute values, the wiki section links must be parsed using a similar method. If the code converts

<span id="0 9"/>

into

<span id="x0_9"/>

then the wiki markup

[[Page#0 9|text]]

must result in

<a href="/wiki/Page#x0_9" title="Page">text</a>

ayg wrote:

Which will only work if the built-in id's aren't broken. So if we ditch id's starting with numbers, we'd better do it everywhere at once, including in hardcoded id's (although I can't think of any of those that begin with numbers).

david.sledge wrote:

Since the built-in ids can't be set by some random user like the ones in an article body, I think it would be reasonable the code to throw an exception whenever an invalid id is encountered. Just have to make sure that the code can distinguish the built-in ids from the user-supplied ones. Put the onus of fixing them on the admin, and in fact if the built-in ones are in violation, they can be caught before an official release.

ayg wrote:

Hardcoded id's don't get run through the Sanitizer anyway (why would they? check them by eye . . .), so there's nowhere to throw an exception. I think they're all valid, though, thinking about it, except that maybe I recall the Cite.php ones aren't.

david.sledge wrote:

Hardcoded id's don't get run through the Sanitizer anyway (why would they?
check them by eye . . .)

Ah, but you're forgetting the discussion on bug 7356, to which this bug is closely related.

Additionally, though all the built-in IDs are valid out-the-box, someone who skins their own wiki might make the mistake of using an invalid ID, in which case it'd be useful to have some sort of notification.

  • Bug 13832 has been marked as a duplicate of this bug. ***

ayg wrote:

The current summary, "The Sanitizer doesn't validate the contents of id attributes", is misleading. The Sanitizer does validate id attributes, just not fully. There are several cases I know of where we might have id validity problems:

  1. Id's can start with invalid characters, like digits. This is bug 9530.
  1. The software itself produces duplicate id's, say in the interface. I'm not aware of anywhere this happens in core. Any case where this occurs should be filed as a separate bug.
  1. Section id's and manually-specified id's ("manually-specified" here means something like <span id="foo">, as opposed to == foo ==) can both conflict with interface id's. This is bug 13926 and can be dealt with there.
  1. Manually-specified id's can conflict with each other and with section id's. This is covered by bug 7356.
  1. It just occurred to me that section id's can conflict with each other in obscure cases, e.g.,

a

a

a_2

This is fairly unlikely to occur, but might be worth fixing anyway. I've filed it as bug 16891.

All the individual cases are covered, so I don't see any reason for a general-purpose bug to remain open. I don't think we need a tracking bug here. I'll mark this duplicate of bug 9530 (which is hopefully about to be FIXED), since that's what the original bug was about.

*** This bug has been marked as a duplicate of bug 9530 ***