Page MenuHomePhabricator

User-specified HTML IDs can be the same as interface IDs
Open, LowPublic

Description

If any of the header/subheader is given as == content ==, firefox 1.5.0.7 draws
an semi-complete dashed box next to it.

Repo:
create a page with the following text:

content

preview or save, and observer the result.


Version: unspecified
Severity: normal
URL: http://en.wikipedia.org/wiki/User:Simetrical/7356

Details

Reference
bz7356

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:24 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz7356.
bzimport added a subscriber: Unknown Object (MLST).
Yurik created this task.Sep 17 2006, 7:48 PM

ayg wrote:

I don't see anything. Does it happen if you log out? Does it happen at the URL
I just added to this bug?

Yurik added a comment.Sep 18 2006, 5:31 AM

That's because you capitalized the word "Content". It must be all lower case.

dto wrote:

The heading generates an anchor with name=id=content, which collides with the
id=content div. :(

ayg wrote:

Ouch. That's nasty. The only solution I can see would be to move all header
id's to stuff like #h-content instead of #content. (You could also special-case
the few bad id's, but that will a) lead to confusion and b) be hard to maintain.)

dto wrote:

*** Bug 7662 has been marked as a duplicate of this bug. ***

ayg wrote:

(In reply to comment #4)

Ouch. That's nasty. The only solution I can see would be to move all header
id's to stuff like #h-content instead of #content. (You could also special-case
the few bad id's, but that will a) lead to confusion and b) be hard to maintain.)

Better solution: prefix all interface id's with "mw-" and then ban that from
non-interface id's. Should be pretty simple to fix, although it will
unfortunately be slightly disruptive.

david.sledge wrote:

Even if the aforementioned solutions are applied, someone could just as easily edit/create a page with the following:

==content==

<span id="content">text</span>

and the same problem would exist. Also, if you don't allow user-supplied ids/anchor names (or derived ids/anchor names from user-supplied content) to have the prefix "mw-", how would you deal with the following:

==mw-content==

Let's not forget templates. If a page includes a template, it's possible that both pages use the same id/anchor name, even though within each page individually, the ids/anchor names are unique. And I've found a similar problem with extensions that generate their own ids/anchor names like Cite. (see bug #11625)

One thing I've noticed is that if a tag is created with an ID that has characters not allowed, the parser is smart enough to single out the id and swap out the invalid characters with valid ones.

What if the parser kept a running list of all the ids and anchor names already in use? When it replaces the invalid id/anchor name characters, it can check against the list to make sure the id/anchor name in question is not already in use. Duplicates would be resolved the same way headers with the same text are resolved.

The only issue I can see at the moment are when extensions create links to destination anchors yet to be rendered. Let's take Cite for example. Given the following:

I like cheese<ref>It's true!</ref>.

...

<references/>

when the "ref" tag gets rendered, a link must be created to a destination anchor that doesn't yet exist, so two things have to happen: (a) an id/anchor name must be created on the spot, so it can be linked to the footnote (even the footnote itself has not been created yet), and (b) all other destination anchors must be prevented from using the generated id/anchor name, without preventing the "references" tag from using it, too.

  • Bug 11625 has been marked as a duplicate of this bug. ***

ayg wrote:

(In reply to comment #7)

What if the parser kept a running list of all the ids and anchor names already
in use? When it replaces the invalid id/anchor name characters, it can check
against the list to make sure the id/anchor name in question is not already in
use. Duplicates would be resolved the same way headers with the same text are
resolved.

Something broadly like that is, of course, the only way to fix this bug. To begin with, though, much of the interface isn't run through the Sanitizer, so we'd have to manually (!) keep track of every single one of the hundreds of id's used in the software, which tend not to follow any rhyme or reason. It's still doable, certainly.

david.sledge wrote:

Sounds like it might be tedious task, but not necessarily a difficult one. Worst case scenario is that all the IDs and anchor names outside the actual article body are hard-coded into the list. A better option is to have the surrounding HTML completely assembled before the article body is, and pass it into a method that extracts every id and anchor name and adds it to the list.

ayg wrote:

Patches are appreciated.

david.sledge wrote:

*** Bug 13926 has been marked as a duplicate of this bug. ***

demon added a comment.Jul 29 2009, 7:00 PM
  • Bug 17650 has been marked as a duplicate of this bug. ***

*** Bug 21440 has been marked as a duplicate of this bug. ***

*** Bug 21856 has been marked as a duplicate of this bug. ***

Merl added a comment.Dec 16 2009, 12:36 AM

Because the heading can start with a non ascii letter a invalid id is created which starts with a point.
According to specification of xhtml 1.0 an id has to start with [A-Za-z]. Numbers and some other characters (e.g. point) are only allow at the following character.

Überschrift

creates
<span class="mw-headline" id=".C3.9Cberschrift">Überschrift</span>

So a prefix to the id should solve this problem because mw-.C3.9Cberschrift would be a valid id.

ayg wrote:

MediaWiki no longer outputs XHTML1 by default, but HTML5. id's in HTML5 can be any nonempty string that doesn't contain whitespace:

http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute

(In reply to comment #17)

MediaWiki no longer outputs XHTML1 by default, but HTML5. id's in HTML5 can be
any nonempty string that doesn't contain whitespace:
http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute

But still can (and on WMF wikis does) output XHTML1, so the solution must count with that DTD.

  • Bug 22587 has been marked as a duplicate of this bug. ***
demon added a comment.Jul 6 2010, 3:00 PM
  • Bug 24285 has been marked as a duplicate of this bug. ***

theevilipaddress wrote:

Can't we do it here the way we do it with duplicate sections. For example,

Heading

bla bla...

Heading

bla bla...

becomes

id="Heading"
bla bla bla...
id="Heading_2"
bla bla bla...

In this case,

content

should simply become id="content_2".

ayg wrote:

Basically, yes. What we have to do is make a list of all the id's used by the software and blacklist them for section titles and other user-provided id's. This is feasible to maintain if we adopt a strict policy of prefixing all software-generated id's with "mw-", which we often do already, although we're not very strict about it. Then we can just blacklist the "mw-" prefix, in addition to a hopefully-not-expanding list of legacy unprefixed id's.

We can't feasibly check the list of interface id's used on the current page on the fly, while parsing. This works for things the parser generates, but parser output can't depend on UI output. The same cached parser output is stuck into a variety of skins, plus no skin at all (action=raw, API output, etc.). So we need to get a list of all id's used anywhere in the software and ban them in all pages.

Both sound needed (interface prefix "mw-", and, upcounting them in the headings).

With upcounting I mean what The Evil IP address mentioned above. That "mw-content" would be treated like a duplicate heading.

So that the following

something

something

content

mw-content

would become

id="something"
id="something_2"
id="content_2"
id="mw-content_2"

brion added a comment.Jun 2 2011, 8:02 PM
  • Bug 29049 has been marked as a duplicate of this bug. ***

We also have the problem that with section editing, we get ids in previews which differ from the ids in the full page. That is at least bewildering, and worst may lead to bogus wrong ids being copied and used elsewhere.

Editing a page closer to the beginning may lead to ids further down being renumbered. References to ids from elsewhere, e.g. via links having a fragment identfier, should ideally not break in such cases.

In bug 29049, it has been suggested that editors be warned when a page is saved with duplicate id values, also to just accept duplicates
during a 2nd save, such like empty "Summary" fields. Maybe even
a toggle in Special:Preferences similar to the one for the
handling of empty "Summary" fields might be considered for the
id= value checking.

A warning on Save does not seem like the right approach. The ID problem is an internal, technical shortcoming of MediaWiki. Exposing this to non-technical editors would just be confusing to them.

  • Bug 29480 has been marked as a duplicate of this bug. ***

(In reply to comment #18)

(In reply to comment #17)

MediaWiki no longer outputs XHTML1 by default, but HTML5. id's in HTML5 can be
any nonempty string that doesn't contain whitespace:
http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#the-id-attribute

But still can (and on WMF wikis does) output XHTML1, so the solution must count
with that DTD.

WMS only uses XHTML because of some bots and scripts that haven't updated yet. Eventually WMF WILL be using html5. And as this is a pure validation thing (browsers are not going to care if you use an XHTML doctype but actually follow html5's rules) we don't care about XHTML rules.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 16 2015, 3:25 PM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 7:10 PM