Page MenuHomePhabricator

Expose phpCharToUpper map for title normalization via the API
Open, Needs TriagePublic

Description

PHP's mb_strtoupper has some oddities around characters that are supposed to be transformed into characters with a different length, this is covered in T141723#5057472 and T219279.

MediaWiki ships a map of these characters in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/resources/src/mediawiki.Title/phpCharToUpper.json for use in JavaScript. It's also copied into mediawiki-title (JavaScript) https://github.com/wikimedia/mediawiki-title/blob/master/lib/mediawiki.Title.phpCharToUpper.js and I have a port to Rust as well https://gitlab.com/mwbot-rs/mwbot/-/merge_requests/18

This should be exposed via the API so external libraries don't have to copy the map. Preferably in action=query&meta=siteinfo because that already has all the information needed to normalize and validate titles. It should also dump $wgOverrideUcfirstCharacters, if it's set.

Ideally this map would be generated based on the ICU/Unicode version that the server is using, and not whatever Wikimedia production is using and checked into core, but that's a separate issue.