Page MenuHomePhabricator

Can't have <, >, {, or } in page titles
Closed, DeclinedPublic

Description

Author: dbenbenn

Description:
MediaWiki doesn't allow < or > in page titles. Perhaps it has
something to do with shell interpretation. But I don't see why this
should be a technical limitation. File names in Unix can have < and

, after all. You just have to be very careful about properly

quoting.


Version: 1.5.x
Severity: minor

Details

Reference
bz2908

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 8:41 PM
bzimport set Reference to bz2908.
bzimport added a subscriber: Unknown Object (MLST).

Quoting RFC 1738 http://www.faqs.org/rfcs/rfc1738.html :

The characters "<" and ">" are unsafe because they are used
as the delimiters around URLs in free text;

They could be escaped to %3C and %3E though.

dbenbenn wrote:

The same paragraph of rfc1738 lists ^ as also being unsafe, yet we have [[^]] as
a redirect to [[circumflex]]. (Also [["]] redirects to [[quotation mark]],
[[%]] to [[percentage]], [[\]] to [[backslash]], [[~]] to [[tilde]], and [[`]]
to [[grave accent]].)

The only characters listed as unsafe in that RFC that we don't allow in page
titles are <, >, [, ], {, }, and |. The |, [, and ] are because of wiki-syntax
limitations. <, >, {, and }, though, should be allowed in page titles, possibly
via %-escaping.

rowan.collins wrote:

Well, it seems sensible to disallow '{' and '}' for the same reason as '[' and
']' - how would you include a page called "}"? Sure, we could make the user
escape them by hand, but that's arguably more ugly than just making them choose
a different name, and an invitation for bugs to come and nest in our code.

'<' and '>', meanwhile, have the potential to generate malformed output which
includes unfiltered HTML tags. Obviously, this is perfectly avoidable, and
wouldn't require any mangling by users, but we would have to be very careful to
get this right, and the benefits (slightly nicer titles) may not outweigh the
risks, and the effort required to avoid them.

Just my €0.02, of course...

{ and } are markup used in links and cannot ever be part of page titles for this
reason.

< and > are disallowed for safety.

dbenbenn wrote:

Trivial edit to Title.php:legalChars

I see no reason why [[}]] shouldn't link to the page called }. Similarly,
[[a]b]] should link to the page called a]b.

I don't know about the safety of < and > (not having audited the entire code!),
but I would think that &, ;, and ! are just as dangerous.

It appears that Parser.php is written in such a way that it can handle all
those characters ([]{}<>) without modification. All that is necessary is to
edit Title.php:legalChars in the obvious way (see the patch). Then all of the
following Wiki code works as you'd expect

[[]]] links to ]
[[a]b]] links to a]b
[[a]]]] links to a]]
[[}}]] links to }}
{{}}}} includes Template:}}
[[>]] links to >

And yes, having a redirect on Wikipedia from [ and ] to Bracket would be
useful.

Attached:

If you think about this for half a second you'll see why that doesn't work:
[[This is a]] long page title which is still in the link [[and why not?]]

&, ;, and ! are not dangerous in any way. ; and ! have no special meaning at all, and &
is merely annoying if output incorrectly (invalid (X)HTML or unexpected character
entity).

rowan.collins wrote:

(In reply to comment #5)

following Wiki code works as you'd expect

[[]]] links to ]
[[a]b]] links to a]b
[[a]]]] links to a]]
[[}}]] links to }}
{{}}}} includes Template:}}
[[>]] links to >

But *are* these always the expected behaviours?

  • What about using a template or template parameter to determine what title to

use (e.g. "[[Wikiquote:{{PAGENAME}}|{{PAGENAME}}]]" or "[[{{{1}}}{{{month}}}
1{{{4}}}|1]]" or "{{SeptemberCalendar{{CURRENTYEAR}}}}"; all real examples)?

  • And what about images with links in their caption? - e.g.

"[[Image:Foo.jpeg|thumb|this is a [[photo]] of [[foo]]]]"; since your patch also
allows "[" in titles, this syntax is extremely ambiguous.

  • Or even just a mix of links and punctuation, like "[See [[foo]]]"

While some of these things appear to still work with your patch, because of the
order things are processed in the existing code, making them *reliably* do so
would be a nightmare.

dbenbenn wrote:

But *are* these always the expected behaviours?

  • What about using a template or template parameter to determine what title to

use (e.g. "[[Wikiquote:{{PAGENAME}}|{{PAGENAME}}]]" or "[[{{{1}}}{{{month}}}
1{{{4}}}|1]]" or "{{SeptemberCalendar{{CURRENTYEAR}}}}"; all real examples)?

  • And what about images with links in their caption? - e.g.

"[[Image:Foo.jpeg|thumb|this is a [[photo]] of [[foo]]]]"; since your patch also
allows "[" in titles, this syntax is extremely ambiguous.

  • Or even just a mix of links and punctuation, like "[See [[foo]]]"

Thanks for the constructive comments, Rowan! That's a good point, "[See
[[foo]]]" no longer works the same way. (Your other examples do still work.)

Note that MediaWiki is currently a bit inconsistent. "[See [[foo]]]" displays
as "[See <a>foo</a>]", whereas "[See [[foo|bar]]]" displays as "[See
<a>bar]</a>". Also, the alt text of "[[Image:Barnstar.png|[[foo|bar]]] hey
[[foo|bar]]]]]" is "bar] hey bar"---the second "[[foo|bar]]]" is interpreted
differently.

The patch exacerbates this inconsistency. Perhaps Parser.php should be changed
so that links end at the ''beginning'' of the first string of two or more ],
instead of at the end. Then "[See [[foo|bar]]]" would display as "[See
<a>bar</a>]".

I understand now what Brion was referring to above about < and > being unsafe.
Check out

  1. http://en.wikipedia.org/wiki/&lt; (which doesn't exist, but is moderately

broken)

  1. http://en.wikipedia.org/wiki/Special:Movepage/User:Dbenbenn/%26lt%3B (the

"Move page:" field displays wrong)

Thus, even without < and > in titles, it's important to escape characters correctly.

dbenbenn wrote:

  1. http://en.wikipedia.org/wiki/&lt; (which doesn't exist, but is moderately

broken)

Oops, the link above wasn't parsed correctly. Try
http://en.wikipedia.org/wiki/%26lt%3B instead.

dbenbenn wrote:

(In reply to comment #7)

because of the order things are processed in the existing code,
making them *reliably* do so would be a nightmare.

Perhaps Rowan is right. For example, currently "[[a [[test]]" links
to "test", whereas with the patch it would link to "a [[test". It's
somewhat evil to break existing pages.

Fortunately, it isn't necessary! You can link to "A" with [[a]] or
[[&#97;]]. Similarly, to link to [ with the current parser syntax,
you'd expect to use "[[&#91;]]".

That doesn't actually work. The reason is that the notions
of "characters that can go within a wiki link" and "characters that
can be in a page title" are conflated---they're both defined by
Title.php:legalChars. If we separate the two concepts, then we can
allow page titles with [, ], {, }, and |, without having to modify
the parser at all. (And once the safety issues are worked out, we
can allow < and > too.)

By the way, see bug 3243 for a list of at least 14 places where &
isn't correctly HTML-sanitized.

rowan.collins wrote:

(In reply to comment #10)

Fortunately, it isn't necessary! You can link to "A" with [[a]] or
[[&#97;]]. Similarly, to link to [ with the current parser syntax,
you'd expect to use "[[&#91;]]".

Well, to link to an article about '[' now, you could type "[[left bracket]]" -
so what would we have gained? OK, the page might look a bit nicer when you get
there (although a heading of just '[' might look weird anyway, so you'd redirect
to something more verbose; in which case, it amounts to being able to type
[[&#91;]] and get redirected to [[left bracket]] anyway!), but this kind of
change is frankly a lot of headaches for a very small improvement in the actual
software.

As for your comments about existing inconsistencies, some of those may well be
considered bugs - the "parser", so called, is widely considered extremely ugly,
and is "designed" (i.e. hacked together) to work mostly as expected, most of the
time. And the fact that such inconsistencies *already* exist should demonstrate
just how much trouble would be unleashed by making it any *more* complicated -
you'd have to be pretty sure the benefits outweighed the risks!

dbenbenn wrote:

(In reply to comment #11)

Well, to link to an article about '[' now, you could type "[[left

bracket]]" -so what would we have gained?

I don't expect one would ever want to link to [. But it would be a useful
redirect for the go/search box, for anyone who didn't know it was
called "bracket".

But that specific page isn't really the issue, anyway. How about a music
album that uses [ and ] in the title? Or a book title with { and }? Do
you want to personally guarantee that no one will ever have a legitimate
use for any of these characters?

And the fact that such inconsistencies *already* exist should

demonstrate just how much trouble would be unleashed by making it any
*more* complicated -you'd have to be pretty sure the benefits outweighed
the risks!

That's why it's so lucky that this bug can be fixed (I think) without
touching the parser at all!

dbenbenn wrote:

A related issue (which I won't bother listing as a separate bug, since it will
merely be resolved to "wontfix" regardless) involves % in page titles. For
example, [[%2542]] doesn't produce a link when parsed. (Presumably it should
link to the page entitled "%42", since MediaWiki strangely supports
[[percent-encoding]] in wiki links.) Note that the URL

http://en.wikipedia.org/wiki/%2542

---the percent-encoded URL for "%42"---returns [[Bad title]].

Such titles are forbidden because they can't be round-tripped -- when written in
wikitext the chars are decoded and the original page becomes inaccessible.

dbenbenn wrote:

Thanks for the explanation; that kind of makes sense. Note that [[''foo'']]
can't be round-tripped, either---to get the parser to link to that page, you
have to use something like [[<nowiki>''foo''</nowiki>]].

It seems to me that if people really need a page named %42 (album title,
perhaps?), they'll be willing to learn how to link to it.

The cleanest solution would be if we could turn off percent-encoding in wiki
syntax. (I know, I know, we can't do that because people insist on copying URLs
instead of page titles into wiki text.) Alternatively, perhaps
[[<nowiki>%2542</nowiki>]] should work like the ''foo'' example above.