Page MenuHomePhabricator

$wgLegalTitleChars is hard to use outside PHP
Open, Needs TriagePublic

Description

$wgLegalTitleChars via Title::legalChars() is exposed in action=query&meta=siteinfo but because it's a regex class designed for PHP, it's really hard to use outside of PHP.

There's a significant amount of workarounds for this, including Title::convertByteClassToUnicodeClass() in PHP and the same in JavaScript: https://github.com/wikimedia/mediawiki-title/blob/master/lib/utils.js

I'm currently working on a Rust equivalent to mediawiki-title and this is a problem. It would be easier if instead of using regexes, they were referenced by some abstraction that each client could transform. For example:

$wgLegalTitleCharacters = [
	'characters' => " %!\"$&'()*,-.:;=?@^_`~",
	// +
	'plus' => true,
	// / and \
	'slashes' => true,
	// A-z, 0-9
	'alphanumeric' => true,
	// \x80-\xFF+
	'non-ascii' => true,
];

Event Timeline

Change 745365 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/Translate@master] Don't use $wgLegalTitleChars directly

https://gerrit.wikimedia.org/r/745365

Change 745386 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/core@master] Make set of legal title characters easier to reuse outside PHP

https://gerrit.wikimedia.org/r/745386

Change 745365 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Don't use $wgLegalTitleChars directly

https://gerrit.wikimedia.org/r/745365

Change 745553 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/extensions/Interwiki@master] Replace $wgLegalTitleChars with Title::legalChars()

https://gerrit.wikimedia.org/r/745553

Change 745553 merged by jenkins-bot:

[mediawiki/extensions/Interwiki@master] Replace $wgLegalTitleChars with Title::legalChars()

https://gerrit.wikimedia.org/r/745553

My proposed patch has at least one breaking change, and we might break that even further, so I've emailed wikitech-l per policy: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/ASODV6622T4YUAY3JO5ZVBL3B5ZQDX2U/

I very much support the idea of getting rid of language-specific details in configuration. However:

  • What's the chance anybody will ever disallow basic alphanumeric characters? And even if, isn't a TitleBlacklist filter much better then? It feels like we should consider this hard-coded and actually do this: hard-code 0-9A-Za-z\x80-\xFF.
    • This alone might solve your problem as the only remaining characters are just that: characters. No need for ranges any more.
  • It also feels like it's much easier to specify disallowed characters. That would be \x00-\x1F<>[]{|}.
    • As above I think we should simply hard-code the \x00-\x1F range and only make <>[]{|} configuration. That's effectively what all this means: allow all valid Unicode characters except these.

Specifically responding to the proposed code:

  • Why is the plus not part of characters?
  • Can we separate forward and backward slash? I can think of situations where one is legal but the other isn't.
  • I like the simple definition of "alphanumeric". Goes well with e.g. ctype_alnum().
  • I find non-ascii ambiguous.
    • Does it translate to 8-bit \x80-\xFF or Unicode \u0080-\uFFFF?
    • As far as I understand \x00-\x1F is not strictly specified in the ASCII standard. ASCII just reserves the range. Does this mean it's included in "non-ascii"?
    • Is \x7F included or not?

@thiemowmde: ASCII is defined by Unicode to include code points U+0000-U+007F, and so here non-ASCII is referring to code points U+0080 and above.

\x80-\xFF in the comment in the patch is not very clear. [\x80-\xFF]+ in PHP matches bytes 0x80-0xFF and therefore matches the UTF-8 encodings of code points U+0080-U+10FFFF. But it has a different meaning in JavaScript (RegExp) and Python (re), where only matches code points U+0080-U+00FF. So it would be better if the comment said U+0080-U+10FFFF.

I agree that the list of invalid characters is much easier to understand. \x00-\x1F is clearly control characters (at least to programmers), and <>[]{|} is clearly (for people who write wikitext) characters that have special meanings in HTML and wikitext, whereas the list of valid ASCII punctuation characters looks random.

In principle an allow list is safer than a denylist because if the outside environment changes you don't need to update stuff to block things, you only need to make changes to enable new things. But given that we allow nearly all Unicode characters, which keeps adding things we presumably should be blocking that's not really working. So with that consideration out of the way, I think just keeping it as an allowlist is just easier because it's one less thing to change.

Specifically responding to the proposed code:

  • Why is the plus not part of characters?

Because of the comment which says In some rare cases you may wish to remove + for compatibility with old links.

So the primary use cases are that you will either want to add a new character, or remove +. Both are straightforward now:

$wgLegalTitleCharacters['characters'] .= '?';
$wgLegalTitleCharacters['plus'] = false;
  • Can we separate forward and backward slash? I can think of situations where one is legal but the other isn't.

My idea for a followup after looking at the use in extensions is where people are adjusting the regex and providing a better interface for it. Example:

$cfg = Title::legalCharacters();

$cfg['characters'] = str_replace( '/', '', $cfg['characters'] );
$regex = Title::buildLegalRegex( $cfg );

Is that what you meant or something else?

Change 942710 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Deprecate $wgLegalTitleChars and $wgIllegalFileChars

https://gerrit.wikimedia.org/r/942710

Change 942710 merged by jenkins-bot:

[mediawiki/core@master] Deprecate $wgLegalTitleChars and $wgIllegalFileChars

https://gerrit.wikimedia.org/r/942710