Page MenuHomePhabricator

Internal links with URL-encoded square brackets are not parsed into wiki links
Closed, DuplicatePublic

Description

URL-encoded square brackets in an internal link are not correctly parsed into a link. External links ARE correctly parsed.

These work, and are correctly parsed into external links:

[http://someurl.com/SomePage?&q=someParameter]

[http://someurl.com/SomePage?&q=%5BsomeParameter%3A]

[http://someurl.com/SomePage?&q=%5B%5BsomeParameter%3A%3A]

These work, and are correctly parsed into internal links:

[[SomePage?&q=someParameter]]

These do NOT work, and not parsed into internal links:

[[SomePage?&q=%5BsomeParameter%3A]]

[[SomePage?&q=%5B%5BsomeParameter%3A%3A]]

The only ones that do not work are internal links with URL-encoded square brackets. The culprits are the left bracket [ encoded as %5B and the right bracket encoded as %3A.

It does not seem to matter how many, in what order, etc. If %5B or %3A exist in the internal link URL, the link will not be parsed into an internal wiki link. The problem does not occur for external links.

I marked this bug as "major", but not "critical", because the bug can be worked around by changing internal links to external links (with the full URL) until the bug is fixed. However, since only a few of the most proactive site administrators are likely to read this bug report, this should be considered a critical bug for ordinary MediaWiki users, and casual administrators that don't read bug reports.


Version: 1.17.x
Severity: normal

Details

Reference
bz30883

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 11:55 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz30883.
bzimport added a subscriber: Unknown Object (MLST).

Whoops, %3A should be %5D.

Also, I found a link that seems to be employing a workaround, but I haven't figured it out yet to get it to work on my own test wiki:

[http://semanticweb.org/wiki/Special:Ask/-5B-5BCategory:Person-5D-5D-20-5B-5B:%2B-7C-7CUser:%2B-5D-5D/-3FAffiliation/sort%3D/order%3DASC query for all persons on semanticweb.org]

From here:

http://semantic-mediawiki.org/wiki/Help:Semantic_search

The URL uses hyphens in place of % for problematic characters. It is an external link, but maybe it would work as an internal link. It's a clue I'll follow to see if there's a better workaround (for site admins), so I'll keep testing.

[ and ] are not allowed characters in MediaWiki page titles; escaping them in the link doesn't actually turn them valid.

Note also that ? and & are valid title characters -- so eg [[SomePage?&q=foo]] links to a page called "SomePage?&q=foo" -- it does not link to a page "SomePage" with query parameters added to the URL.

There are many reasons why an internal URL will have URL encoded brackets that don't involve a page being titled with brackets. The example I already gave shows this in parameters (that are passed to an extension, probably).

It sounds like this functionality has been deliberately blocked, and there's no good reason for it. MediaWiki apparently has no problem with brackets in parameters, so there's no reason to block it in internal links.

I noticed that FF does not URI-encode the brackets when you cut-n-paste a URL from the location bar. Is that the problem possibly related to this cut-n-paste problem?

Opera, IE, Chrome, etc do not seem to have this problem and there is an upstream bug for it.

In short, if you are copying a URL from your browser into the wikitext, it is the responsibility of the browser to URI encode the brackets. FF has a bug that causes this not to happen. Other browsers do not.

tagging upstream since otherwise this is WONTFIX

I'm using Opera, so that's not the problem.

I wouldn't expect this to work if the brackets weren't URL-encoded, because that would be inviting the parser to parse the in-URL brackets, which is not what we want it to do. We only want it to parse the wiki link, and ignore the URL.

Right now, it's not ignoring the URL, it's apparently parsing the URL-encoded brackets to produce a decision to not parse the whole thing, which is where the problem lies. It should not be messing with the URL at all, it should just produce the link, and assume the user made a link because they wanted a link.

According to Brion Vibber, it is assuming the user is trying to link to an invalid title page, which is wrong, and then it is assuming that it's in the best interest of the user to fail to do as it is commanded, which is wrong again.

What does "upstream" mean?

I'm confused, bug report seems to be:
*Square brackets are never allowed in an internal link
*The subpage part of a special page name could potentially contain [ or ] (because it is not bound by normal legal title character rules). So could an interwiki link I suppose if you're abusing that feature (Are there usecases I missed here? Please speak up if there is)

Well firefox might not properly encode such characters (Are such characters even illegal in url's. You obviously can't use them in the host part for compat with IPv6, but for the rest of the url, I don't see why they neccesarily need to be encoded) I don't see that as being particularly relavent to this bug.

However, at the same time, I think this is a bug that could reasonably be wontfixed or at least "lowest" priority. Any solution should make sure that such links are only valid for special pages and possibly interwikis(?) imo. But if it works for interwikis, then it should work for normal pages. I suppose it could just link to the "Invalid page title" page if its not a special page. (The more I think, the more this sounds like a wontfix to me)

Anyways, with that in mind, removing keyword upstream.

What does "upstream" mean?

Basically "upstream" is a dev way of saying "Somebody else's problem" (aka software we use in making mediawiki is up the stream, and software that uses mediawiki would be "downstream")

For use cases, I discovered the bug when trying to make an internal link directly to Semantic MediaWiki's Semantic Search form, so users could easily find dynamically generated lists of stuff. For example, here's a link to a list of cities and their populations:

http://semantic-mediawiki.org/wiki/Special:Ask?title=Special%3AAsk&q=%5B%5BCategory%3ACity%5D%5D&po=%3FPopulation&sort_num=&order_num=ASC&eq=yes&p%5Bformat%5D=broadtable&p%5Blimit%5D=20&p%5Boffset%5D=0&p%5Bheaders%5D=show&p%5Bmainlabel%5D=&p%5Blink%5D=all&p%5Bintro%5D=&p%5Boutro%5D=&p%5Bdefault%5D=&eq=yes

The query parameter strings in that URL contain URL-encoded square brackets. There's no good reason why I should not be able to produce an internal link to such a common type of thing. Semantic MediaWiki just happens to be where I discovered the bug, but it could occur in any situation where the URL contains URL-encoded brackets, for a form, or whatever.

Currently the parser is decoding URL-encoded characters to arrive to the incorrect conclusion that it is a wiki title with illegal characters, and it should not be doing that. It should just make the link.

With Bawolff's information, it sounds like instead of just making the link, it should more specifically ignore URL-encoded characters in URL parameters, special pages, etc, as he described. I don't understand all those details, so I can only report what's going wrong. I'm not sure what's going right :)

To be more clear, the parser should not decode URL-encoded characters in URL parameters. URL parameters are described here as occurring after the question mark character:

http://en.wikipedia.org/wiki/URI_scheme

Parameters are separated by ampersands, but the first parameter does not require it. On MediaWiki URLs, the first parameter is the page title, and is the only one that needs validity checking. Parameters after the page title, preceded by & do not need to be checked, because they could be anything, including brackets, as long as they're URL-encoded.

Again, please note that anything appearing after a "?" in a wiki link *IS NOT A QUERY STRING*, but is simply part of the title.

& is a valid title character
? is a valid title character
[ is not a valid title character
] is not a valid title character

I understand that you *want* to make a query string link, but that's not what you're doing as far as MediaWiki knows.

You can use [{{fullurl:Title|param=1|param=2}} link text ] or such, which will create a URL link that points to the same place and actually has a query string.

That is not correct. See:

http://en.wikipedia.org/wiki/URI_scheme

For example, various forms of the same page:

http://en.wikipedia.org/wiki/URI_scheme?&action=purge
http://en.wikipedia.org/wiki/URI_scheme?action=purge
http://en.wikipedia.org/w/index.php?title=URI_scheme&action=purge

None of those parameters are part of the "URI scheme" title. The MediaWiki parser has a bug, if it thinks they are, as I've already stated. The problem is not me, it isn't the URI standard, and it isn't even MediaWiki per se. It is the parser.

It is a valid bug.

Short version: don't see a bug here, seems to (at least for my testcases) behave as Brion says.

This seems pretty simple to me:

this wikitext:

  • [[foo?bar]]
  • [[foo&bar]]
  • [[foo&bar]]
  • [[foo%20bar22]]
  • [[foo_bar22]]
  • [[foo%26bar]]

is rendered into this <ul>:

<ul><li> <a href="/w/index.php?title=Foo%3Fbar&amp;action=edit&amp;redlink=1" class="new" title="Foo?bar (page does not exist)">foo?bar</a>
</li><li> <a href="/w/index.php?title=Foo%26bar&amp;action=edit&amp;redlink=1" class="new" title="Foo&amp;bar (page does not exist)">foo&amp;bar</a>
</li><li> <a href="/w/index.php?title=Foo%26bar&amp;action=edit&amp;redlink=1" class="new" title="Foo&amp;bar (page does not exist)">foo&amp;bar</a>
</li><li> <a href="/w/index.php?title=Foo_bar22&amp;action=edit&amp;redlink=1" class="new" title="Foo bar22 (page does not exist)">foo bar22</a>
</li><li> <a href="/w/index.php?title=Foo_bar22&amp;action=edit&amp;redlink=1" class="new" title="Foo bar22 (page does not exist)">foo_bar22</a>
</li><li> <a href="/w/index.php?title=Foo%26bar&amp;action=edit&amp;redlink=1" class="new" title="Foo&amp;bar (page does not exist)">foo&amp;bar</a>
</li></ul>

which my browser renders as:

  • foo?bar
  • foo&bar
  • foo&bar
  • foo bar22
  • foo_bar22
  • foo&bar

http://en.wikipedia.org/wiki/Special:ExpandTemplates?input=*+%5B%5Bfoo%3Fbar%5D%5D%0D%0A*+%5B%5Bfoo%26bar%5D%5D%0D%0A*+%5B%5Bfoo%26amp%3Bbar%5D%5D%0D%0A*+%5B%5Bfoo%2520bar22%5D%5D%0D%0A*+%5B%5Bfoo_bar22%5D%5D%0D%0A*+%5B%5Bfoo%2526bar%5D%5D

The problem is
not me, it isn't the URI standard, and it isn't even MediaWiki per se. It is
the parser.

The parser is doing the job its supposed to. Internal links are not uri's, and thus aren't treated like uri's. Anything inside [[ and ]] should be a pagename, and only a pagename. (well plus a | to specify the alternate text for the link)

There might be a limited argument that special page titles don't have the same restrictions in their "title's", so one should be able to link to those extended titles in internal links, but its not all that convincing of an argument.

I'm going to go ahead and (re)-close this wontfix.

I see the issue now, thanks Jeremy. It appears that since ? and & are valid for page titles, there's currently no way for the MediaWiki parser to tell whether it's looking at a parameter, or a title, so it just assumes it's looking at a title.

The workaround for this being to use an external link to an internal page, with the plainlinks class:

<span class="plainlinks">[http://someurl.com/SomePage?&q=%5B%5BsomeParameter%5D%5D]</span>

and with $wgExternalLinkTarget = '_self'; or just left to the default, which is also _self, so the behavior is identical to internal links, with the extra work of using a span class on each link.

That workaround is similar to the magic word workaround Brion suggested:

[{{fullurl:Title|param=1|param=2}} link text ]

though I haven't tested it, since it's more complicated than my workaround with the plainlinks class.

It appears the parser must decode URL-encoded characters in internal links in order to make characters that will trigger an invalid-title error. This seems like the wrong job for the parser, and instead MediaWiki itself should be doing that by sending users to an error page explaining that certain characters are not allowed in links.

That way, the parser can parse only what it needs to parse, so it won't screw up things it isn't designed to understand. MediaWiki itself is understands what is a page title, and what is a query string, so the parser should not hijack that role, because doing so causes problems related to the limitations of the parser's inability to know what it is parsing.

Sound reasonable?

We present an error if people go to an illegal title in addition to only linking internal links that have characters from Title::legalChars() (as well as % for % encoding [not that it's needed] and # for fragments.) in them. MediaWiki is designed to use redlinks to help guide people to create new articles. If there were red links to invalid page title's, it'd be confusing.

It appears the parser must decode URL-encoded characters in internal links in
order to make characters that will trigger an invalid-title error

The parser needs to normalize things like [[%65]] -> [[e]] since such things are considered a perfectly valid Title. In fact the parser generally needs to turn an internal link into a Title object. Invalid title's don't really have mappings to Title objects.

OK, makes good sense, thank you.

I didn't check thoroughly, but it does appear that all the form extensions use special pages to process the forms:

http://www.mediawiki.org/wiki/Category:Form_extensions

Semantic MediaWiki, and probably many others, are not form extensions per se,
but have a special page containing a form. I'm betting that all of these would
trigger this bug if an internal link were made to the results of any of these
forms. That's a lot of mysteriously non-working internal links.

Do you agree that internal links not working for an entire class of internal links is a problem? I don't think this should be dismissed. If the technical solution is for the parser to treat special pages internal links differently, then that's probably what should be done, right?

This bug is related:

https://bugzilla.wikimedia.org/show_bug.cgi?id=11477

Would solving that bug be a good solution to this one?

If there's no objections, I just changed this bug to Verified Duplicate because bug 11477 is an indirect fix for this bug. So, it's not a wontfix, and it has apparently been verified by many others.

Thanks to everyone for looking into this.

  • This bug has been marked as a duplicate of bug 11477 ***

I'm going to update the documentation to discuss that links with parameters need to contain the whole URL in an external link format, with the plainlinks class. I don't think making an exception here in this bug for just special pages will be nearly as useful as just fixing bug 11477, which solves both problems.