Page MenuHomePhabricator

Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11
Closed, ResolvedPublic

Description

PHP's Unicode capitalization functions change between PHP 7.2 and PHP 7.4 due to PHP's migration to Unicode 11.

Using PHP 7.4's mb_strtoupper() to capitalize the first letter of an article title turns out to be inadvisable, most notably because it would map Georgian characters to their Mtavruli equivalents. Mtavruli is not used for the first character of a word in Georgian, rather it is used for emphasis, like italic.

Unicode's concept of "title case" is much closer to what we want. It doesn't map the Georgian characters, and it maps ligatures in an appropriate way for first-letter capitalization, for example Η† becomes Η… instead of Η„. So, we will use that instead.

Title case would map ß to Ss, which breaks some existing Wikipedia articles and user names without any apparent benefit, so we'll permanently override that so that ß can continue to be used as the first character of a page title.

Migration plan:

  • Deploy a backwards-compatible override, so that PHP 7.2 capitalization is used despite PHP 7.4 being fully deployed.
  • Run foreachwiki uppercaseTitlesForUnicodeTransition.php --charmap ucfirst-72-to-title.php --userlist /tmp/user_renames.txt --suffix ' (technical rename)' where ucfirst-72-to-title.php is P35451.
  • Provide a list of pages which will be renamed to the community. Most affected pages should be deleted rather than automatically renamed.
  • Notify users who will be renamed.
  • Wait for a week.
  • Rerun uppercaseTitlesForUnicodeTransition.php, then rename users: mwscript extensions/WikimediaMaintenance/renameInvalidUsernames.php --wiki metawiki --list /tmp/user_renames.txt
  • Wait a while for global renames to take effect
  • Rerun uppercaseTitlesForUnicodeTransition.php with the --run option
  • Deploy the new override map gerrit 842243. This will prevent further creation of pages or users with initial lowercase letters.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think it's reasonable to run the rename only once we've fully migrated to php 7.4. We will anyways need to install the conversion map so that those pages/usernames are still reachable before we actually start sending traffic to php 7.4

Also I kind of remember we also need a similar conversion table for js -> php for VE, @Esanders can you confirm the details maybe?

No just VE, but mw.Title.js in general has a phpCharToUpper method which is generated by a maintenance script (GeneratePhpCharToUpperMappings) and outputs the result of Language::ucfirst.

Change 811875 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Add ucfirst overrides for PHP 7.2 -> 7.4

https://gerrit.wikimedia.org/r/811875

Change 811875 merged by jenkins-bot:

[operations/mediawiki-config@master] Add ucfirst overrides for the PHP 7.4 migration

https://gerrit.wikimedia.org/r/811875

Before deployment of 811875

$ PHP=php7.4 mwscript eval.php --wiki=enwiki
> print Language::factory('en')->ucfirst('ß');
SS

After deployment

$ PHP=php7.4 mwscript eval.php --wiki=enwiki
> print Language::factory('en')->ucfirst('ß');
ß
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Given we're doing the final parts of this after we've fully migrated, I've switched the task dependency around.

The migration is now complete; We should proceed with running the necessary scripts to rename users and remove the conversion table.

I'm trying a dry run. But the proposed suffix "former Unicode character" seems weird. I think I would prefer something more technical and technically correct.

Some people are going to want to choose their own new name, for example all the users who used ß as a stylized B (ßlackHeart etc.). So we should do a notification round and then wait for manual rename requests.

There's 11584 global users to be renamed, which is a lot. Most of them are in the Georgian script -- the change is apparently not desirable and may need to be permanently overridden. Wikipedia says "Nowadays, Mtavruli is typically used in all-caps text in titles or to emphasize a word, though in the late 19th and early 20th centuries it was occasionally used, as in Latin and Cyrillic scripts, to capitalize proper nouns or the first word of a sentence." Using these letters for title case is not correct.

The script would rename 306,244 out of 462,201 pages on kawiki. I'm pretty sure we shouldn't go ahead with that.

Log sample:

kawiki:  Would rename αƒ›αƒαƒœαƒ“αƒ£αƒ αƒαƒ°αƒ˜ β†’ α²›αƒαƒœαƒ“αƒ£αƒ αƒαƒ°αƒ˜
kawiki:  Would rename αƒ›αƒαƒœαƒ”, αƒ™αƒšαƒαƒ“ β†’ α²›αƒαƒœαƒ”, αƒ™αƒšαƒαƒ“
kawiki:  Would rename αƒ›αƒαƒœαƒ”αƒ‘αƒ˜αƒ‘ αƒ“αƒ˜αƒ“αƒ˜ გბა β†’ α²›αƒαƒœαƒ”αƒ‘αƒ˜αƒ‘ αƒ“αƒ˜αƒ“αƒ˜ გბა
kawiki:  Would rename αƒ›αƒαƒœαƒ”αƒ‘αƒ˜αƒ‘ αƒ›αƒ“αƒ˜αƒœαƒαƒ αƒ” β†’ α²›αƒαƒœαƒ”αƒ‘αƒ˜αƒ‘ αƒ›αƒ“αƒ˜αƒœαƒαƒ αƒ”

In Georgian we don't have lowercase and uppercase. Mtavruli and Mkhedruli cannot be confused. Or it is written only with Mkhedruli, or with a Mtavruli. The modern Georgian language uses only Mkhedruli. Mtavruli is one of the types of Mkhedruli. Please stop discussing about Georgian.

Change 842019 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Migrate to PHP 7.4 case mapping, but retain Georgian overrides

https://gerrit.wikimedia.org/r/842019

I removed the Georgian characters from Pchelolo's case map, and I'm running the dry run script again.

That reduced the number of user renames to 441, of which 370 are for decorative Eszett (ß) characters (ßlackHeart, ßrandon, etc.). The number of page renames was reduced to 912, of which 124 begin with Eszett.

We should consider permanently overriding Eszett. Most wikis have an article named for the character, e.g. https://en.wikipedia.org/wiki/%C3%9F, and renaming it to [[SS]] would be unhelpful and it usually conflicts with an existing article. The only other usage of ß in the first character of main namespace page titles is as a typo correction for beta (β), for example https://en.wikipedia.org/wiki/%C3%9F_caroten . In no case is the transformation ß -> SS actually helpful, since like Georgian, it is not a valid title casing transformation.

I added the Eszett override to the proposed Gerrit change.

Really, all the ligatures are broken, not just ß, because ucfirst is trying to produce title case, we don't literally want to convert the first character to upper case. For example, fi should become Fi not FI. Now that we have PHP 7.4 we could use mb_convert_case() with MB_CASE_TITLE. That would solve Georgian without the need for an override table. It also fixes incorrect mappings of Greek iota subscripts, e.g. αΎ€ has title case mapping ᾈ and upper case mapping αΌˆΞ™. But it would produce ß -> Ss which we would presumably override.

P35451 is the charmap for uppercaseTitlesForUnicodeTransition.php which would prepare the site for a transition to MB_CASE_TITLE excluding ß -> Ss. It's the difference between the PHP 7.2 upper case table and the PHP 7.4 title case table.

Change 842028 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] In Language::ucfirst(), use title case instead of upper case

https://gerrit.wikimedia.org/r/842028

Change 842030 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Maintenance script updates to support ucfirst() title case

https://gerrit.wikimedia.org/r/842030

Change 842019 abandoned by Tim Starling:

[operations/mediawiki-config@master] Migrate to PHP 7.4 case mapping, but retain Georgian and Eszett overrides

Reason:

I'm going to use title case instead

https://gerrit.wikimedia.org/r/842019

Change 842242 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Remove PHP 7.4 version check and prepare for title case

https://gerrit.wikimedia.org/r/842242

Change 842243 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Migrate to PHP 7.4 title case mapping, but retain Eszett override

https://gerrit.wikimedia.org/r/842243

Change 842030 merged by jenkins-bot:

[mediawiki/core@master] Maintenance script updates to support ucfirst() title case

https://gerrit.wikimedia.org/r/842030

Change 842242 merged by jenkins-bot:

[operations/mediawiki-config@master] Remove PHP 7.4 version check and prepare for title case

https://gerrit.wikimedia.org/r/842242

Mentioned in SAL (#wikimedia-operations) [2022-10-14T01:20:48Z] <tstarling@deploy1002> Synchronized wmf-config/UcfirstOverrides.php: for T292552, should have no effect at this stage (duration: 03m 46s)

tstarling renamed this task from Rename articles and users to prepare for PHP 7.3 unicode changes to Rename articles and users to prepare for PHP 7.4 unicode changes.Oct 14 2022, 2:12 AM
tstarling renamed this task from Rename articles and users to prepare for PHP 7.4 unicode changes to Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11.
tstarling updated the task description. (Show Details)

The list of affected pages and users is at https://meta.wikimedia.org/wiki/Unicode_11_case_map_migration

For the user accounts, I've added public links to global account info (in addition to the private rename action link that was already there). I went through them and most are blocked or deleted accounts, or accounts with zero non-deleted edits. I identified 8 accounts as possibly active, and listed that subset in a new section for easy reference.

I notified those 8 users via their user talk pages. I also sent β“β“˜β“™β“¦β“œ an email.

Change 843588 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] [DNM] Quick hack for [[m:Unicode 11 case map migration]]

https://gerrit.wikimedia.org/r/843588

Mentioned in SAL (#wikimedia-operations) [2022-10-24T23:00:25Z] <TimStarling> on mwmaint1002 running renameInvalidUsernames.php for T292552

The script is still running.

One problem is that it tries to do a global rename on every local account, with no deduplication. We have 275 local accounts for β“β“˜β“™β“¦β“œ , and for each of these accounts, it also tries to rename the other 274.

The user list text file doesn't specify the old name, just the old ID, so after the first global rename has succeeded, it ends up trying to rename the user to itself, β“ƒβ“˜β“™β“¦β“œ -> β“ƒβ“˜β“™β“¦β“œ.

The no-op LocalRenameUser jobs proceed anyway, performing write queries. The script waits for all the jobs to complete.

I deduplicated the user rename list by destination name, although this misses some users due to different users being renamed to the same name.

I resolved some users whose normalized name was taken, by prefixing with ~. Then I ran the script again.

Many jobs failed with an exception like Tried to promote '~ff' to a global account except it doesn't exist locally. This is apparently because the script uses the promotetoglobal job parameter when it should have used reattach. The affected accounts were probably detached by a previous run of the script.

I ran fixStuckGlobalRename.php for enwiki ο¬… -> ~ο¬… . But this left the account unattached. There is a remaining local account on viwiki and gu_name still has the old name.

I reinserted the 14 β“β“˜β“™β“¦β“œ -> β“ƒβ“˜β“™β“¦β“œ rows into renameuser_status that I previously deleted, and I ran fixStuckGlobalRename.php for all of them. Now there's no local β“β“˜β“™β“¦β“œ accounts remaining, but there are still 239 unattached β“ƒβ“˜β“™β“¦β“œ accounts.

I used attachAccount.php to reattach the remaining β“ƒβ“˜β“™β“¦β“œ accounts.

Trying to clean up the remaining stuck renames with

sql centralauth -- -B -e "select concat('mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=', ru_wiki, ' ',ru_oldname,' ',ru_newname,' --logwiki=metawiki') from renameuser_status"
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki Η‡ ~Lj --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki Η‡ no. ΖΌ ~Lj no. ΖΌ --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=cswiki αΊ—ercasek T̈ercasek --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=dewiki ὒλη Ξ₯̓̀λη --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki β…½β…½β…½β…½ombobreaker! β…­β…½β…½β…½ombobreaker! --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki ⓖⓑⓀⓒⓣⓝⓐⓨⓐ ⓕⓔⓔⓗⓖⓐ ⒼⓑⓀⓒⓣⓝⓐⓨⓐ ⓕⓔⓔⓗⓖⓐ --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki ⓗⓐⓖⓖⓔⓑ? Ⓗⓐⓖⓖⓔⓑ? --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki ⓙⓗⓐⓨ-β“‘ Ⓙⓗⓐⓨ-β“‘ --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki β“›β“žβ“›β“Ÿβ“”β“β“˜β“’ β“β“žβ“›β“Ÿβ“”β“β“˜β“’ --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki β“œβ“β“’β““β“™β“˜β““β“§β““β“–β“₯ β“‚β“β“’β““β“™β“˜β““β“§β““β“–β“₯ --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki β“£β“β“β“™β“˜β“œβ“β“–β“ž β“‰β“β“β“™β“˜β“œβ“β“–β“ž --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki ff ~ff --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki fi ~fi --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki fireworkeaterr Fireworkeaterr --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki fl ~fl --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki ffi ~ffi --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki ffl ~ffl --logwiki=metawiki
mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki ο¬…inky ~ο¬…inky --logwiki=metawiki
  • Η‡: done with Special:GlobalRenameUser
  • Η‡ no. ΖΌ: renamed to ~~Lj no. ΖΌ since there was already a ~Lj no. ΖΌ on enwiki, not sure where it came from
  • αΊ—ercasek: I ran fixStuckGlobalRename.php, but this only renamed the local account. I renamed the global account with Special:GlobalRenameUser, and then it showed up as admin attached.
  • ὒλη: renamed with Special:GlobalRenameUser
  • β…½β…½β…½β…½ombobreaker!: deleted renameuser_status row, wiped cache with eval.php and renamed with Special:GlobalRenameUser
  • ⓖⓑⓀⓒⓣⓝⓐⓨⓐ ⓕⓔⓔⓗⓖⓐ: ditto
  • ⓗⓐⓖⓖⓔⓑ?: ditto
  • ⓙⓗⓐⓨ-β“‘: ditto
  • β“›β“žβ“›β“Ÿβ“”β“β“˜β“’: ditto
  • β“œβ“β“’β““β“™β“˜β““β“§β““β“–β“₯: ditto
  • β“£β“β“β“™β“˜β“œβ“β“–β“ž: ditto
  • ff: ditto
  • fi: ditto
  • fireworkeaterr: ditto
  • fl: ditto
  • ffi: ditto
  • ffl: ditto
  • ο¬…inky: ditto

Note that fixStuckGlobalRename.php was generally not appropriate because the global user had not been renamed. The whole operation had to be restarted, not just one job.

  • ῦΰῑΏ, β…΅5lasevilsion, ⅹシンバⅹ, β“•β“€β“’β“š β“¨β“žβ“€, β“˜β“’β“‘β“”β“£β“— β“Ÿβ“”β“‘β“”β“©, ⓙⓀⓐⓝ β“œ β“₯β“”β“‘β“£β“”β“›, β“›β“˜β“β“- -ⓒⓐⓓⓔⓝⓐ, fiamma86, fiammettina, filomena85, fittesaft, st~rowiki: manually renamed
  • st, ο¬…: manually renamed

I reviewed the list of proposed page moves. In some cases a redirect would have been moved to an inappropriate title, such as a redirect to the article about ligatures being moved to a title that is no longer a ligature. So I deleted those redirects. The list is at https://meta.wikimedia.org/wiki/Unicode_11_case_map_migration#Manual_deletes .

Change 842028 merged by jenkins-bot:

[mediawiki/core@master] In Language::ucfirst(), use title case instead of upper case

https://gerrit.wikimedia.org/r/842028

Change 849670 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@wmf/1.40.0-wmf.6] In Language::ucfirst(), use title case instead of upper case

https://gerrit.wikimedia.org/r/849670

Change 849671 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@wmf/1.40.0-wmf.7] In Language::ucfirst(), use title case instead of upper case

https://gerrit.wikimedia.org/r/849671

The script is now running with --run, i.e. the automatic renames and deletes are in progress.

Change 849670 merged by jenkins-bot:

[mediawiki/core@wmf/1.40.0-wmf.6] In Language::ucfirst(), use title case instead of upper case

https://gerrit.wikimedia.org/r/849670

Change 849671 merged by jenkins-bot:

[mediawiki/core@wmf/1.40.0-wmf.7] In Language::ucfirst(), use title case instead of upper case

https://gerrit.wikimedia.org/r/849671

Change 849724 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Temporary identity mappings for title case ligatures

https://gerrit.wikimedia.org/r/849724

Mentioned in SAL (#wikimedia-operations) [2022-10-27T02:06:51Z] <tstarling@deploy1002> Synchronized php-1.40.0-wmf.6/includes/language/Language.php: T292552 (duration: 03m 40s)

Mentioned in SAL (#wikimedia-operations) [2022-10-27T02:10:30Z] <tstarling@deploy1002> Synchronized php-1.40.0-wmf.7/includes/language/Language.php: T292552 (duration: 03m 39s)

Change 849724 merged by jenkins-bot:

[operations/mediawiki-config@master] Temporary identity mappings for title case ligatures

https://gerrit.wikimedia.org/r/849724

Mentioned in SAL (#wikimedia-operations) [2022-10-27T02:30:57Z] <tstarling@deploy1002> Synchronized wmf-config/UcfirstOverrides.php: T292552 allow title case ligatures (duration: 03m 36s)

Change 842243 merged by jenkins-bot:

[operations/mediawiki-config@master] Migrate to PHP 7.4 title case mapping, but retain Eszett override

https://gerrit.wikimedia.org/r/842243

Mentioned in SAL (#wikimedia-operations) [2022-10-27T02:56:11Z] <tstarling@deploy1002> Synchronized wmf-config/UcfirstOverrides.php: T292552 final configuration (duration: 03m 54s)

tstarling claimed this task.
tstarling updated the task description. (Show Details)
tstarling updated the task description. (Show Details)

Change 843588 abandoned by Tim Starling:

[mediawiki/core@master] [DNM] Quick hack for [[m:Unicode 11 case map migration]]

Reason:

https://gerrit.wikimedia.org/r/843588

Well, the script already renamed, and there are a couple of bugs.

  1. I can't remove renaming log entries from the watchlist by any way I know. It just stuck there. What can I do?
  2. Open this. Click on the last one, "Ξͺ́". You arrive here. There is no page there.

Thank you.

  1. I can't remove renaming log entries from the watchlist by any way I know. It just stuck there. What can I do?

Can you be more specific? It's unclear to me whether you are referring to watchlist entries (watchlist rows) or changes (recentchanges rows).

  1. Open this. Click on the last one, "Ξͺ́". You arrive here. There is no page there.

The problem here is that the hewiki page with page_id 1467608 has a title which is not in NFC form. The title is U+0399 U+0308 U+0301 but the NFC form is U+03AA U+0301.

Well, It's too late, the problem never was fixed, just expired, the watchlist entries are removed after one month.