{{PAGENAME}} must not escape special chars, otherwise it makes {{#ifeq:}} unusable
Closed, DeclinedPublic

Description

{{PAGENAME}} must not escape special chars, otherwise it makes {{#ifeq:}} unusable.

{{#ifeq:{{PAGENAME}}|Q & A|true|false}} returns false on page with title "Q & A" because & is converted to &

Obviously same wrong behavior with ' and " in page names.

Same goes with {{FULLPAGENAME}}.


Version: unspecified
Severity: normal

bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz35746.
Danny_B created this task.Via LegacyApr 6 2012, 1:49 AM
Beta16 added a comment.Via ConduitApr 6 2012, 7:27 AM

See also bug 16474 and bug 35628

Nikerabbit added a comment.Via ConduitApr 6 2012, 7:34 AM

Would you rather have broken texts when '' in page name triggers italic in middle of message? That's the reason why it does escaping.

Not critical because there are easy workarounds starting from {{PAGENAME:Q & A}}.

MarkAHershberger added a comment.Via ConduitApr 6 2012, 2:30 PM

Already discussed on Bug #35628

  • This bug has been marked as a duplicate of bug 35628 ***
Danny_B added a comment.Via ConduitApr 7 2012, 11:27 PM

Although discussed in bug 35628, this is a bit different.

That bug wants to escape parser functions, this bug wants to unescape magic words.

Bawolff added a comment.Via ConduitApr 7 2012, 11:42 PM

(In reply to comment #0)

{{PAGENAME}} must not escape special chars, otherwise it makes {{#ifeq:}}
unusable.

{{#ifeq:{{PAGENAME}}|Q & A|true|false}} returns false on page with title "Q &
A" because & is converted to &

Obviously same wrong behavior with ' and " in page names.

Same goes with {{FULLPAGENAME}}.

I disagree. I think #ifeq et al should unescape their args.

(I suppose that would make & == & but I don't entirely think that is a bad thing).

Danny_B added a comment.Via ConduitApr 7 2012, 11:53 PM

(In reply to comment #5)

I disagree. I think #ifeq et al should unescape their args.

Well, that's third approach. Wanna submit a new bug about it so later on it can be decided which approach is to be taken and other bugs can be closed in favour of that one?

Bawolff added a comment.Via ConduitApr 8 2012, 12:31 AM

(In reply to comment #6)

(In reply to comment #5)
> I disagree. I think #ifeq et al should unescape their args.

Well, that's third approach. Wanna submit a new bug about it so later on it can
be decided which approach is to be taken and other bugs can be closed in favour
of that one?

I'd prefer we just kept discussion on one bug. Bugs should be about problems, not the solutions imho.

The reason i prefer to keep escaping in {{PAGENAME}}, is that the escaping was introduced to work around the problem of a page named "*foo" starting a list when you put {{PAGENAME}} in a page.

Verdy_p added a comment.Via ConduitJan 29 2014, 1:22 PM

The safest way to compare page names is to pass them BOTH through {{PAGENAMEE|pagename}}, or BOTH to {{PAGENAMEE|pagename}}. If you want to also compare their namespaces, pass both pagenames in parameter to {{FULLPAGENAME|pagename}} so that the given pagename won't have its namespace parsed and removed.

Note that these functions will also resolve relative paths in subpages and FULLPAGENAME(E) will also resolve the namespace.

So:

{{#ifeq: {{PAGENAME}}|Q & A|true|false}}

will always be false on every page, but the following will work:

{{#ifeq: {{PAGENAME}}|{{PAGENAME|Q & A}}|true|false}}

as it will return "true" on the expected page.

With full page names where you also check the namespace:

{{#ifeq: {{FULLPAGENAME}}|{{FULLPAGENAME|Q & A}}|true|false}}

will also return true but only in the main namespace (it will be false on a Category page named "Category:Q & A", because the second parameter of "#if" gets the full page name of page "Q & A" in te main namespace).


In summary:

  • {{(FULL|BASE|SUB)PAGENAMEE|...}} return URL-encoded names
  • {{(FULL|BASE|SUB)PAGENAME|...}} return HTML-encoded names

There's NO function in MediaWiki that returns the raw pagename.


But note:

{{(FULL|BASE|SUB)PAGENAMEE|...}}

is also different from

{{URLENCODE:{{(FULL|BASE|SUB)PAGENAME|...}}}}

Because in the later case, URLENCODE will take in parameter an HTML-encoded name, so the result will be double-encoded, where HTML entities (containing the character & # ;) and SPACEs will be URL-encoded using %nn and +.

But in the first case the MediaWiki-specific URL-encoding performed by PAGENAMEE is different than standard URL-encoding (it does not generate "+" for spaces, but generates underscores).

So:

  1. "{{PAGENAMEE|Q & A}}" returns in fact "Q_%26_A"
  2. "{{PAGENAME|Q & A}}" returns in fact "Q & A"
  3. "{{URLENCODE:{{PAGENAME|Q & A}}}}" returns in fact at least this: "Q+%26%2338;+A" I don't know if URLENCODE also recodes the semicolon, if so the result will be instead: "Q+%26%2338%2B+A" In all cases this will be different from the result of case 1 !!!

This strange behavior means that there are some characters "permitted" in URLs to MediaWiki sites that are transformed in a fery strange way, such as:

  1. http://www.mediawiki.org/wiki/Q & A

    not directly a valid URL, but the browser transforms it to URL-encoding of UTF-8 and requests:

    http://www.mediawiki.org/wiki/Q%20&%20A

    the server all accept to load the page name "Q & A"
  1. http://www.mediawiki.org/wiki/Q+%26%2338%2B+A

    the server parses this URL as containing an URL-encoded pagename, so it first URL-decodes it as:

    Q & A

    the server will then parse the URL and will think it contains an anchor, it will attempt to load a page named only "Q &", with the anchor "38; A" dropped !
  1. Valid page names may contain isolated ampersand or ampersands ad valdi characters in pagenames (internally they are HTML-encoded if you query their {{PAGENAME}}) but some sequences will generate errors,

such as "&", but "a amp;" will be accepted...

All this is completely inconsistant, but this time this does not occur in parser functions, but at the server API level when handling incoming HTTP(S) requests that may, or may not, be HTML-encoded, when the HTTP-standard says that URLs should be ONLY URL-encoded ! The server also performs such double-decoding when resolving requests.

Verdy_p added a comment.Via ConduitJan 29 2014, 1:51 PM

See also bug 35628 about the weird way the various parser functions interpret (or not) their input (URL-decoding, HTML-decoding, sometimes mixed up!), and how they may or may not reencode their output.

If this was not already complex within ASCII only, it becomes a nightmare with non-ASCII characters not because they are UTF-8 encoded, this is a convention) but because non-ASCII bytes (which may represent UTF-8 sequences of a single character... or not, because MediaWiki accepts invalid Unicode characters such as U+FFFF when they are pseudo-encoded as UTF-8, and then URL-encoded using %nn hex sequences ! On the API level, any %xx encoded byte is accepted, but the UTF-8 encoding is in fact not enforced.

The server just treats *raw* sequences of bytes (filtering only some ASCII characters, but not restricring at all the range of bytes in 0x80 to 0xFF, and not restricting later the range of 16-bit code units in the full range 0x0020 to 0xFFFF (when they are used in various libraries working with UTF-16 instead of real 21-bit code points.

I wonder how this inconsistency could defeat some security restrictions such as violating access rights on blocked pages. It is possible that one could create some weird page names via the HTTP API that will later not be accessible from any other MEdiaWiki page, or from Wiki administrtors with their online tools. and someone could maliciously create those weird page names to fill in a category or some generated MediaWiki pages that list pages in categories.

Possibly a user could also create a user account with such weird name and have his user page name inaccessible from standard blocking tools.

And CheckUser admmins may have difficulty to read logs and find the relevat users.

Bawolff added a comment.Via ConduitFeb 15 2014, 1:47 AM
  • Bug 61407 has been marked as a duplicate of this bug. ***
kaldari added a comment.Via ConduitFeb 15 2014, 1:57 AM

Philippe, is there any workaround for:
{{#ifeq:{{{1}}}|{{FULLPAGENAME}}|..}}

This is currently broken for https://en.wikipedia.org/wiki/Template:Clickable_button_2 and I haven't come up with any way to fix it. {{URLENCODE}} doesn't always work since URL encoding isn't the same as the escaping that {{FULLPAGENAMEE}} does (apparently).

MZMcBride added a comment.Via ConduitFeb 15 2014, 1:59 AM

{{urlencode:}} has various options, as I recall. One of them probably works.

Bawolff added a comment.Via ConduitFeb 15 2014, 2:04 AM

(In reply to Ryan Kaldari from comment #11)

Philippe, is there any workaround for:
{{#ifeq:{{{1}}}|{{FULLPAGENAME}}|..}}

{{#ifeq:{{FULLPAGENAME:{{{1}}}}}|{{FULLPAGENAME}}|...}}


/me is working on a proper patch for this bug

Bawolff added a comment.Via ConduitFeb 15 2014, 2:47 AM

(In reply to Bawolff (Brian Wolff) from comment #13)

(In reply to Ryan Kaldari from comment #11)
> Philippe, is there any workaround for:
> {{#ifeq:{{{1}}}|{{FULLPAGENAME}}|..}}

{{#ifeq:{{FULLPAGENAME:{{{1}}}}}|{{FULLPAGENAME}}|...}}


/me is working on a proper patch for this bug

Ok, so these bugs are kind of convoluted. I submitted a fix for bug 35628 (Unencode the arguments to #ifeq:). This bug is technically asking for {{PAGENAME}} to not output encoded stuff (whatever happened to bugs are for problems not solutions?), which is not going to happen per comment 2. So closing this wontfix

Verdy_p added a comment.EditedVia ConduitFeb 19 2014, 1:57 PM

{{URLENCODE:...}} supports three styles of encoding.

{{PAGENAMEE}} uses the deprecated "WIKI" style; but still with its own differences!

See [[mw:Manual:PAGENAMEE encoding]] for extensive details.

What a mess !

And yes Bawolff (Brian Wolff) is correct about the way to fix things when comparing pagenames: you have to consistantly use {{PAGENAME:...}} or {{PAGENAMEE:...}} on *all* source texts to compare with #ifeq: and #switch, otherwise the result is unpredictable due to possible differences in their HTML-encoding (or non-encoding, which is even worse as this creates possible collisions between distinct names!).

This trick should also continue working after the proposed patch of #ifeq: and #switch in order to decode HTML entities (in addition to trimming them) in their parameters before comparing strings, even if they continue return strings with HTML entities.

Add Comment