Page MenuHomePhabricator

PAGENAMEE magic word does not work properly when used in target links of files and pagename contains a semicolon
Closed, DeclinedPublic

Description

Links under files, like [[File:Some file|link=...]] are incorrect if created using PAGENAME or PAGENAMEE magic words and the page name used to create that link contains a semicolon in its name.

Insted of non-encoded (;) or URL-encoded (%3B) semicolon an encoded HTML entity (&#59 encoded as %26%2359%3B) appears in URL.

Testcase is shown here:

https://test.wikipedia.org/wiki/See;_or_not

See also:

Event Timeline

Ankry renamed this task from PAGENAMEE magic word does not work properly when used in target links of files to PAGENAMEE magic word does not work properly when used in target links of files and pagename contains a semicolon.Dec 11 2017, 9:44 PM

Change #1130262 had a related patch set uploaded (by 0xDeadbeef; author: 0xDeadbeef):

[mediawiki/core@master] wfUrlencode: properly escape semicolons

https://gerrit.wikimedia.org/r/1130262

Change #1130262 abandoned by 0xDeadbeef:

[mediawiki/core@master] wfUrlencode: properly escape semicolons

Reason:

dup change id, recreating changes

https://gerrit.wikimedia.org/r/1130262

Change #1130265 had a related patch set uploaded (by 0xDeadbeef; author: 0xDeadbeef):

[mediawiki/core@master] wfUrlencode: properly escape semicolons

https://gerrit.wikimedia.org/r/1130265

What is the problem exactly?

In Firefox:

wfUrlencode: properly escape semicolons
Per RFC 1739, (Section 3.3 HTTP) the ";" character is reserved. Even
though it may not cause issues in libraries that normally ignore
semicolons and parse as usual, at least one Python library (CherryPy)
treats semicolons as a query separator and causes issues. See also
https://en.wikipedia.org/wiki/Special:PermaLink/1281102234#Error_in_Toolforge_Code
where this came up (due to FULLPAGENAMEE usage)

It is okay for a library to have a feature to parse query parameters. Lots of libraries do, CherryPy is not unique or unusual in that regard.

If someone is calling that library with a string known to be a title, but asking it to intepret it as a URL, then the problem is in calling that library function, not the shape of the input or the logic in the library.

I could be wrong, but please share a small reproducible example on the task, so that we can take a closer look.

See also https://wikitech.wikimedia.org/wiki/URL_path_normalization about the sensitive nature of canonical URL encodings and how these can cause widespread issues in browsers and caching if not handled well.

; is a reserved character in URL queries because it is unspecified whether something like https://www.example.org?something=something;else&foo=bar to parse as something=something;else,foo=bar or as something=something,else,foo=bar, or as something=something,foo=bar. I have seen all three possibilities for libraries parsing URLs through me crawling through this issue. The maintainer of CherryPy has put up a valid reasoning for why they parse as something=something,else,foo=bar because the RFCs on HTTP has explicitly allowed parsers to treat ; the same as & in URLs. See https://github.com/cherrypy/cherrypy/issues/1860#issuecomment-640246780. I don't agree with his decision to do so, but there is nothing I can do and it is much much better to just encode this properly anyways.

The original code wrongly assumes that PHP is being conservative while encoding ; and believes that it can just be passed through without encoding. That is false and that is why I'm putting up a fix.

I understand that this proposed change will solve your use case, but there are other considerations at play. It is not as simple or obvious as it may seem, and this particular approach would not scale in general.

To decide how to solve this, we first need to understand the problem, and identify potential other less-dangerous solutions. Let's zoom out to the high-level problem before we distract ourselves with IETF RFCs and CherryPy implementation details.

Why is it considered valid to pass a MediaWiki-encoded title to a CherryPy function intended for parsing URLs? Where does this happen? Where is the code using CherryPy? Where is it getting the MediaWiki-encoded title from?

You're misunderstanding the issue. I understand that this proposed change will solve your use case, but it is not a workable or scalable solution for the ecocystem in general, nor is it actually needed because the problem at hand has a different cause and needs a different fix.

From git blame, there was an ancient change that apparently erroneously believed that ; can be forwarded in URL encodings. I'm not sure why you'd think properly doing this now would not be workable or scalable, but I guess that's the thing with legacy codebases now that a correct change gets blocked because it has too much impact.

I'm not misunderstanding the issue. I am fully aware of what the issue is and even if the issue I'm referring to isn't related to this ticket, it is still an issue. Please take some time to read what I wrote.

If you want to see the problem solved, please ensure that another person gets a detailed understanding of the problem, and can identify other possible solutions. Let's zoom out to first understand the high-level problem before we distract outselves with IETF RFCs and CherryPy implementation details.

Fair. The specific situation is a manifestation of the general issue. It is the example I used to get an understanding of the problem. I'm going to elaborate it now.

Why is it considered valid to pass a MediaWiki-encoded title

Because FULLPAGENAMEE is currently used as a query parameter for external URLs. (https://sigma.toolforge.org/usersearch.py?page={{FULLPAGENAMEE}}&server=enwiki) Based on the Wikitech article about URL normalization, that seems like the best solution to do so.

to a CherryPy function intended for parsing URLs?

The entire URL is defined in the template. The FULLPAGENAMEE is used as an query parameter.

Where does this happen?

The link the template generates puts titles with ; as a query parameter without encoding. The user clicks on the link which is handled by the CherryPy tool server instance.

Where is the code using CherryPy?

The tool hosted on Toolforge.

Where is it getting the MediaWiki-encoded title from?

The URL of the HTTP request sent to the tool. Which comes from the link the user follows as that template uses FULLPAGENAMEE to help create the URL.

Why is it considered valid to pass a MediaWiki-encoded title

Because FULLPAGENAMEE is currently used as a query parameter for external URLs. […]

to a CherryPy function intended for parsing URLs?

The entire URL is defined in the template. The FULLPAGENAMEE is used as an query parameter.

Where does this happen?

The link the template generates puts titles with ; as a query parameter […] handled by the CherryPy tool server instance […] hosted on Toolforge.

Where is it getting the MediaWiki-encoded title from?

The URL of the HTTP request sent to the tool. Which comes from the link the user follows as that template uses FULLPAGENAMEE to help create the URL.

Thanks for explaining. Okay, so the actor here is the MediaWiki:Histlegend template on en.wikipedia.org, and it is seeking to transmitt data to an external webservice (https://sigma.toolforge.org/usersearch.py) which accepts data in a particular URL format.

This is not uncommon. All sorts of services take in data in all sorts of formats. Nothing about this scenario requires the URL structure for MediaWiki itself to change, e.g. en.wikipedia.org/wiki/:title. The consumers of such URLs are:

  • For visitors, a web browser like Firefox, or http client like cURL or python-requests. They need to be able to make requests to such URLs, or follow them as redirects.
  • For MediaWiki sysadmins, a web server like Apache like Varnish needs to be okay with correctly caching, proxying, or serving such requests
  • For bots such as a search engine crawler, or wget, they may need to extract and parse URLs from our HTML responses (e.g. <a href=>).

If one of these has a compatibility problem, regardless of what any IETF RFC says, that would (and in the past, has) justified the kind of effort and analysis required for a high-risk change to MediaWiki's canonical URL format. So long as the characters we use are valid as literals in URL paths, their meaning is only relevant to MediaWiki and its web server(s). It is out of scope for consumers to need to decode or parse our query parameters. They are certainly free to do so, but then they would have to follow our patterns, it is unreasonable to expect this to parse correctly in a random framework. They perceived it as a URL path, and the contents of that path are a blackbox passed back to MediaWiki.

I don't think this is incompatible with what you're seeing. After all, Sigma isn't actually encountering and trying to parse a MediaWiki URL. What Sigma and CherryPy do is perfectly reasonable as far as I'm concerned.

Any other encounter with with a MediaWiki title would be the responsibility for the actor transmitting that data. For example, one would not expect a piece of JSON from /w/api.php or HTML generated from {{PAGENAME}} or {{PAGENAMEE}} to be safe and valid for use in a SQL query, or HTML attribute. I get why something described as "URL encoded" seems like it should "just work", but there's more to it. For one, there isn't just one "URL encoding".

  • As an example, various REST APIs these days use slashes as argument separators. Imagine something like /rest.php/page/:title/revisions, /rest.php/page/:title/wikitext, and /rest.php/page/:title/revisions. This kind of API requires slashes within a title to be percent-encoded. That's the responsibility of an API client to do. This does not that mean Wikipedia can't use slashes on its own website and URLs, like https://en.wikipedia.org/wiki/User_talk:Krinkle/Archive_1.
  • There are other examples one could find for every "special" (but valid) URL path character, including !, (), :, ?, & and ;.
  • Some services require + for spaces, others %20.

The requirements of other services, are the responsibility for consumers of that service.

The humble semicolon ;, too, is a reasonable delimiter for an external service to use. I do find it questionable that CherryPy and Sigma would be unwilling to choose one of these and stick to it for a given site or tool (I get why CherryPy may not want to do a semver-major to require admins to pick one, but it seems reasonable to provide it as an option to standardize within one instance of CherryPy). For example, if Sigma mainly generated semicolon-style URLs itself, then no-one woud try to create URLs to it with a function intended for ampersand-style URLs. Anyway, this doesn't change our question.

The question is: How may an editor on a MediaWiki site, use wikitext to safely encode a text value that can be transmitted to a service that uses this kind of format?

  • {{PAGENAME}} plain text
  • {{PAGENAMEE}} "encoded for use in MediaWiki URLs" (emphasis mine), which exists specifically for creating MediaWiki URLs.
  • {{urlencode:input}} or {{urlencode:input|QUERY}}, which maximally URL-encodes values to conform to a safe subset of RFC 1738, with spaces encoded as plus +.
  • {{urlencode:input|PATH}}, Idem, but without encoding tilde ~ and encoding spaces as %20.
  • {{urlencode:input|WIKI}}, MediaWiki's "pretty URL" encoding, effectively the same as {{PAGENAMEE}}, and exists solely for creating MediaWiki URLs.

See also:

What you want is probably something like {{urlencode: data | QUERY}}, which does what you need today.

I do note that the specific case of passing {{PAGENAME}} does not work currently, because {{PAGENAME}} is intended for displaying text in markup, and thus outputs special characters as HTML entities to avoid being misinterpreted as wikitext syntax (eg. avoid creating italics, bullet points, and such after variable substitution has taken place in the Parser). This means passing it as input to {{urlencode:}} does not achieve the intended effect. The bug for that is: T15288: urlencode on variables get double-encoded.

[…]
What you want is probably something like {{urlencode: data | QUERY}}, which does what you need today.

I do note that the specific case of passing {{PAGENAME}} does not work currently, because {{PAGENAME}} is intended for displaying text in markup, and thus outputs special characters as HTML entities to avoid being misinterpreted as wikitext syntax […]. The bug for that is: T15288: urlencode on variables get double-encoded.

To pass the current title as-is to another parser function (i.e. without any encoding), you can use {{#titleparts:}} instead, like so:

{{urlencode:{{#titleparts:{{PAGENAME}}}}|QUERY}}

Or, in Lua as mw.uri.encode(mw.title.getCurrentTitle().prefixedText).

See also https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions##titleparts which talks about which magic words output text for display in markup, and which are for parser functions.

[…]
What you want is probably something like {{urlencode: data | QUERY}}, which does what you need today.

I do note that the specific case of passing {{PAGENAME}} does not work currently, because {{PAGENAME}} is intended for displaying text in markup, and thus outputs special characters as HTML entities to avoid being misinterpreted as wikitext syntax […]. The bug for that is: T15288: urlencode on variables get double-encoded.

To pass the current title as-is to another parser function (i.e. without any encoding), you can use {{#titleparts:}} instead, like so:

{{urlencode:{{#titleparts:{{PAGENAME}}}}|QUERY}}

Or, in Lua as mw.uri.encode(mw.title.getCurrentTitle().prefixedText).

See also https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions##titleparts which talks about which magic words output text for display in markup, and which are for parser functions.

Thank you so much. This worked. Sorry for not being well versed in parser functions to know #titleparts... I'll abandon that change now that it isn't needed. ^^

Change #1130265 abandoned by 0xDeadbeef:

[mediawiki/core@master] wfUrlencode: properly escape semicolons

Reason:

no longer needed (original issue resolved through parser functions)

https://gerrit.wikimedia.org/r/1130265

Pppery closed this task as Declined.EditedMar 30 2025, 6:41 PM
Pppery subscribed.

Per Krinkle.

Persumably whatever user case lead to the creation of this task in 2017 could and should have been solved in the same way as the 2025 resurgence.