$wgLegalTitleChars is hard to use outside PHP
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Legoktm
	Dec 9 2021, 5:33 AM

Description

$wgLegalTitleChars via Title::legalChars() is exposed in action=query&meta=siteinfo but because it's a regex class designed for PHP, it's really hard to use outside of PHP.

There's a significant amount of workarounds for this, including Title::convertByteClassToUnicodeClass() in PHP and the same in JavaScript: https://github.com/wikimedia/mediawiki-title/blob/master/lib/utils.js

I'm currently working on a Rust equivalent to mediawiki-title and this is a problem. It would be easier if instead of using regexes, they were referenced by some abstraction that each client could transform. For example:

$wgLegalTitleCharacters = [
	'characters' => " %!\"$&'()*,-.:;=?@^_`~",
	// +
	'plus' => true,
	// / and \
	'slashes' => true,
	// A-z, 0-9
	'alphanumeric' => true,
	// \x80-\xFF+
	'non-ascii' => true,
];

Details

Subject	Repo	Branch	Lines +/-
Make set of legal title characters easier to reuse outside PHP	mediawiki/core	master	+77 -9
Deprecate $wgLegalTitleChars and $wgIllegalFileChars	mediawiki/core	master	+21 -0
Replace $wgLegalTitleChars with Title::legalChars()	mediawiki/extensions/Interwiki	master	+2 -2
Don't use $wgLegalTitleChars directly	mediawiki/extensions/Translate	master	+4 -6

Customize query in gerrit

Related Objects

Mentioned In: T343047: mediawiki-title package (likely) does not support $wgIllegalFileChars
T343046: Parsoid's title library does not support $wgIllegalFileChars
P18116 Rust in my MediaWiki? no way

Event Timeline

Legoktm created this task.Dec 9 2021, 5:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 9 2021, 5:33 AM

Change 745365 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/extensions/Translate@master] Don't use $wgLegalTitleChars directly

https://gerrit.wikimedia.org/r/745365

gerritbot added a project: Patch-For-Review.Dec 9 2021, 5:35 AM

Change 745386 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/core@master] Make set of legal title characters easier to reuse outside PHP

https://gerrit.wikimedia.org/r/745386

Legoktm updated the task description. (Show Details)Dec 9 2021, 6:57 AM

Change 745365 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Don't use $wgLegalTitleChars directly

https://gerrit.wikimedia.org/r/745365

ReleaseTaggerBot added a project: MW-1.38-notes (1.38.0-wmf.13; 2021-12-13).Dec 9 2021, 1:00 PM

Change 745553 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/extensions/Interwiki@master] Replace $wgLegalTitleChars with Title::legalChars()

https://gerrit.wikimedia.org/r/745553

Erutuon subscribed.Dec 9 2021, 11:25 PM

Zabe subscribed.Dec 10 2021, 1:09 AM

Change 745553 merged by jenkins-bot:

[mediawiki/extensions/Interwiki@master] Replace $wgLegalTitleChars with Title::legalChars()

https://gerrit.wikimedia.org/r/745553

My proposed patch has at least one breaking change, and we might break that even further, so I've emailed wikitech-l per policy: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/ASODV6622T4YUAY3JO5ZVBL3B5ZQDX2U/

Legoktm mentioned this in P18116 Rust in my MediaWiki? no way.Dec 13 2021, 7:04 AM

I very much support the idea of getting rid of language-specific details in configuration. However:

What's the chance anybody will ever disallow basic alphanumeric characters? And even if, isn't a TitleBlacklist filter much better then? It feels like we should consider this hard-coded and actually do this: hard-code 0-9A-Za-z\x80-\xFF.
- This alone might solve your problem as the only remaining characters are just that: characters. No need for ranges any more.
It also feels like it's much easier to specify disallowed characters. That would be \x00-\x1F<>[]{|}.
- As above I think we should simply hard-code the \x00-\x1F range and only make <>[]{|} configuration. That's effectively what all this means: allow all valid Unicode characters except these.

Specifically responding to the proposed code:

Why is the plus not part of characters?
Can we separate forward and backward slash? I can think of situations where one is legal but the other isn't.
I like the simple definition of "alphanumeric". Goes well with e.g. ctype_alnum().
I find non-ascii ambiguous.
- Does it translate to 8-bit \x80-\xFF or Unicode \u0080-\uFFFF?
- As far as I understand \x00-\x1F is not strictly specified in the ASCII standard. ASCII just reserves the range. Does this mean it's included in "non-ascii"?
- Is \x7F included or not?

@thiemowmde: ASCII is defined by Unicode to include code points U+0000-U+007F, and so here non-ASCII is referring to code points U+0080 and above.

\x80-\xFF in the comment in the patch is not very clear. [\x80-\xFF]+ in PHP matches bytes 0x80-0xFF and therefore matches the UTF-8 encodings of code points U+0080-U+10FFFF. But it has a different meaning in JavaScript (RegExp) and Python (re), where only matches code points U+0080-U+00FF. So it would be better if the comment said U+0080-U+10FFFF.

I agree that the list of invalid characters is much easier to understand. \x00-\x1F is clearly control characters (at least to programmers), and <>[]{|} is clearly (for people who write wikitext) characters that have special meanings in HTML and wikitext, whereas the list of valid ASCII punctuation characters looks random.

In principle an allow list is safer than a denylist because if the outside environment changes you don't need to update stuff to block things, you only need to make changes to enable new things. But given that we allow nearly all Unicode characters, which keeps adding things we presumably should be blocking that's not really working. So with that consideration out of the way, I think just keeping it as an allowlist is just easier because it's one less thing to change.

In T297340#7566345, @thiemowmde wrote:

Specifically responding to the proposed code:

Why is the plus not part of characters?

Because of the comment which says In some rare cases you may wish to remove + for compatibility with old links.

So the primary use cases are that you will either want to add a new character, or remove +. Both are straightforward now:

$wgLegalTitleCharacters['characters'] .= '?';
$wgLegalTitleCharacters['plus'] = false;

Can we separate forward and backward slash? I can think of situations where one is legal but the other isn't.

My idea for a followup after looking at the use in extensions is where people are adjusting the regex and providing a better interface for it. Example:

$cfg = Title::legalCharacters();

$cfg['characters'] = str_replace( '/', '', $cfg['characters'] );
$regex = Title::buildLegalRegex( $cfg );

Is that what you meant or something else?

Change 942710 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Deprecate $wgLegalTitleChars and $wgIllegalFileChars

https://gerrit.wikimedia.org/r/942710

cscott mentioned this in T343046: Parsoid's title library does not support $wgIllegalFileChars.Jul 28 2023, 8:50 PM

cscott mentioned this in T343047: mediawiki-title package (likely) does not support $wgIllegalFileChars.Jul 28 2023, 8:54 PM

Change 942710 merged by jenkins-bot: