Page MenuHomePhabricator

Some pages will become completely unreachable after PHP7 update due to Unicode changes
Open, HighPublic

Description

As detailed in T141723#5057472, mb_strtoupper, which we use to normalise titles, changes slightly in PHP7 with the Unicode update. As a result certain titles will have their normalised forms changed, and therefore will be unreachable if nothing is changed,

for example https://en.wikipedia.org/w/index.php?title=%C7%85&redirect=no takes you to article ID 7074938 in PHP5 HHVM, but if you enable the PHP7 beta feature, it takes you to 7074928, and the old article is now completely inaccessible.

Here are the changes (removed lines means the right hand side is no longer the result of mb_strtoupper, added lines are where the right hand side is a new result of mb_strtoupper):

--- a/resources/src/mediawiki.Title/phpCharToUpper.js
+++ b/resources/src/mediawiki.Title/phpCharToUpper.js
@@ -6,15 +6,8 @@
 	var toUpperMapping = {
 		'ß': 'ß',
 		'ʼn': 'ʼn',
-		'Dž': 'Dž',
-		'dž': 'Dž',
-		'Lj': 'Lj',
-		'lj': 'Lj',
-		'Nj': 'Nj',
-		'nj': 'Nj',
 		'ǰ': 'ǰ',
-		'Dz': 'Dz',
-		'dz': 'Dz',
+		'ɪ': 'Ɪ',
 		'ʝ': 'Ʝ',
 		'ͅ': 'ͅ',
 		'ΐ': 'ΐ',
@@ -26,6 +19,15 @@
 		'ᏻ': 'Ᏻ',
 		'ᏼ': 'Ᏼ',
 		'ᏽ': 'Ᏽ',
+		'ᲀ': 'В',
+		'ᲁ': 'Д',
+		'ᲂ': 'О',
+		'ᲃ': 'С',
+		'ᲄ': 'Т',
+		'ᲅ': 'Т',
+		'ᲆ': 'Ъ',
+		'ᲇ': 'Ѣ',
+		'ᲈ': 'Ꙋ',
 		'ẖ': 'ẖ',
 		'ẗ': 'ẗ',
 		'ẘ': 'ẘ',

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kchapman added a subscriber: kchapman.

@Joe did you get what you needed from CPT in IRC? If not please feel free to move back to our Inbox.

Joe added a comment.Apr 9 2019, 6:50 AM

So for the record, in terms of impact:

anomie>	_joe_: After filtering out *wiktionary ns0 and ns1, looks like 917 live pages across all wikis. enwiki has the most (160), followed by ruwiki (141), commonswiki (109), and kawiki (107). 77 distinct local usernames, only one is not SUL (I didn't check if any others were unattached though). 92 SUL accounts, apparently some have no local account (locally renamed for being harassing, maybe?). Also 249 deleted page titles (2208 deleted revisions)

This doesn't seem like a huge deal to me, and I guess we can fix the situations on a case-by-case basis if these are the numbers.

@Esanders is it a huge deal if, in light of the blast radius of this, we proceed with the php7 transition and a fix is applied afterwards?

Joe added a comment.Apr 9 2019, 6:52 AM

@Joe did you get what you needed from CPT in IRC? If not please feel free to move back to our Inbox.

Yes, we (SRE) will need the following things:

  1. An opinion on wether it's ok to live with the issue on the pages @Anomie found to be affected
  2. An implementation of a solution for such pages, or at least a proposed workflow around that.

Change 502546 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[mediawiki/core@master] [WiP] Add ability to override mb_strtoupper in Language::ucfirst

https://gerrit.wikimedia.org/r/502546

@Esanders is it a huge deal if, in light of the blast radius of this, we proceed with the php7 transition and a fix is applied afterwards?

I will defer to your judgement on the impact. Ideally we will synchronise any PHP fix with an equivalent fix in the JS (T141723).

Joe added a comment.Apr 10 2019, 8:33 AM

@Esanders is it a huge deal if, in light of the blast radius of this, we proceed with the php7 transition and a fix is applied afterwards?

I will defer to your judgement on the impact. Ideally we will synchronise any PHP fix with an equivalent fix in the JS (T141723).

My current plan would be:

  • At first, add a "backward compatibility" conversion table so that php7 can behave like HHVM
  • Once we know how to proceed in terms of changing page / resources names, if the transition to php7 is not completed, we can make another conversion table that fixes the behaviour of HHVM

Only if we remove the compat layer from php7 we'd need to sync that with the js fix, sure.

For now, I'd just like to be sure we can continue the php7 rollout.

Esanders added a comment.EditedApr 10 2019, 1:30 PM

Sounds good to me. We can validate the compatibility layer by re-running the scripts added https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/499196/ and checking there are no changes.

Change 502800 had a related patch set uploaded (by Esanders; owner: Esanders):
[mediawiki/core@master] Make generatePhpCharToUpperMappings.php a proper maintenance script

https://gerrit.wikimedia.org/r/502800

Change 504584 had a related patch set uploaded (by Reedy; owner: Giuseppe Lavagetto):
[mediawiki/core@wmf/1.34.0-wmf.1] Add ability to override mb_strtoupper in Language::ucfirst

https://gerrit.wikimedia.org/r/504584

Change 504584 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.1] Add ability to override mb_strtoupper in Language::ucfirst

https://gerrit.wikimedia.org/r/504584

Mentioned in SAL (#wikimedia-operations) [2019-04-17T16:18:47Z] <jforrester@deploy1001> Synchronized php-1.34.0-wmf.1/includes/DefaultSettings.php: T219279 Ability to set wgOverrideUcfirstCharacters part 1b (duration: 01m 03s)

Mentioned in SAL (#wikimedia-operations) [2019-04-17T16:20:58Z] <jforrester@deploy1001> Synchronized php-1.34.0-wmf.1/languages/Language.php: T219279 Ability to set wgOverrideUcfirstCharacters part 1 try two (duration: 01m 00s)

Change 502546 merged by jenkins-bot:
[mediawiki/core@master] Add ability to override mb_strtoupper in Language::ucfirst

https://gerrit.wikimedia.org/r/502546

@Joe did you get what you needed from CPT in IRC? If not please feel free to move back to our Inbox.

Yes, we (SRE) will need the following things:

  1. An opinion on wether it's ok to live with the issue on the pages @Anomie found to be affected
  2. An implementation of a solution for such pages, or at least a proposed workflow around that.

@Anomie or @tstarling could you respond to the above points?

Change 505487 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Add Language::ucfirst overrides for php 7.2

https://gerrit.wikimedia.org/r/505487

Joe added a comment.Apr 22 2019, 6:27 AM

@kchapman regarding point 1 above - I've prepared various patches, including https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/505487 that should act as a stopgap solution for now.

We still need a way to fix the titles of those pages, then we can fix the behaviour of HHVM there.

  1. An opinion on wether it's ok to live with the issue on the pages @Anomie found to be affected
  2. An implementation of a solution for such pages, or at least a proposed workflow around that.

Corey asked me now to work on this task, so I can now get started on doing #2.

For #1, it's hard to say since I don't know anything about the languages that might actually be affected. I can say that on wikis like enwiki it seems the articles affected are mostly about the letters themselves, the main question would be whether in any of these cases enwiki has the article at the lowercase title with a redirect from uppercase rather than vice versa. And I don't see many multi-character article titles in the list from other wikis either.

Joe added a comment.Apr 29 2019, 7:55 AM
  1. An opinion on wether it's ok to live with the issue on the pages @Anomie found to be affected
  2. An implementation of a solution for such pages, or at least a proposed workflow around that.

Corey asked me now to work on this task, so I can now get started on doing #2.
For #1, it's hard to say since I don't know anything about the languages that might actually be affected. I can say that on wikis like enwiki it seems the articles affected are mostly about the letters themselves, the main question would be whether in any of these cases enwiki has the article at the lowercase title with a redirect from uppercase rather than vice versa. And I don't see many multi-character article titles in the list from other wikis either.

Regarding #1, I'm going to merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/505487 today that should allow to reach those pages from php7 while we have a script to do #2. Once we do, we can fix the affected pages/users and do the inverse mapping for HHVM instead.

Change 505487 merged by jenkins-bot:
[operations/mediawiki-config@master] Add Language::ucfirst overrides for php 7.2

https://gerrit.wikimedia.org/r/505487

Mentioned in SAL (#wikimedia-operations) [2019-04-29T08:23:40Z] <oblivian@deploy1001> Synchronized wmf-config/Php72ToUpper.php: Adding unicode overrides table for php 7.2 T219279 (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2019-04-29T08:25:23Z] <oblivian@deploy1001> Synchronized wmf-config/CommonSettings.php: Enable unicode overrides table for php 7.2 T219279 (duration: 00m 53s)

Change 507596 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] maintenance: Script to rename titles for Unicode uppercasing changes

https://gerrit.wikimedia.org/r/507596

Change 502800 merged by jenkins-bot:
[mediawiki/core@master] Make generatePhpCharToUpperMappings.php a proper maintenance script

https://gerrit.wikimedia.org/r/502800

CPT needs to review regarding long term fixes for this.

Anomie added a comment.May 8 2019, 3:04 PM

The long-term plan, as I understand it, is that we'll run maintenance scripts to rename pages (see gerrit:507596) and users (see rEWMA9122f6c) that are affected by the change, then reverse rOMWC713a20a0f2dd: Add Language::ucfirst overrides for php 7.2 to have HHVM use PHP 7.2's uppercasing table. Note we may have to follow the same process in the future whenever we upgrade to a new version of PHP, as it seems upstream is intending to do a better job of tracking new versions of Unicode.

Current status is that gerrit:507596 needs review and I need to look at the script from rEWMA9122f6c to see what changes it needs to be used for this task.

Krinkle updated the task description. (Show Details)EditedMay 23 2019, 12:37 AM
Krinkle added a subscriber: Krinkle.

for example https://en.wikipedia.org/w/index.php?title=%C7%85&redirect=no takes you to article ID 7074938 in PHP5 HHVM, but if you enable the PHP7 beta feature, it takes you to 7074928, and the old article is now completely inaccessible.

This now work as expected both via HHVM and PHP 7 (serves page ID 7074938).

This now work as expected both via HHVM and PHP 7 (serves page ID 7074938).

That's thanks to the (temporary) workaround applied in T219279#5142875. We still need to fix things properly, as described in T219279#5167380.

Joe moved this task from Backlog to Externally Blocked on the serviceops board.Jun 20 2019, 9:48 AM
Joe removed Joe as the assignee of this task.Jun 21 2019, 7:07 AM

Change 507596 merged by jenkins-bot:
[mediawiki/core@master] maintenance: Script to rename titles for Unicode uppercasing changes

https://gerrit.wikimedia.org/r/507596

Change 522489 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/WikimediaMaintenance@master] RenameInvalidUsernames: Make more generic

https://gerrit.wikimedia.org/r/522489

Change 522489 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] RenameInvalidUsernames: Make more generic

https://gerrit.wikimedia.org/r/522489

Next steps here:

For the User-notice, perhaps something like:

Wikimedia wikis are going to use a newer version of the Unicode standard. The new Unicode version adds new uppercase mappings for some uncommon characters, for example "ʞ" will now uppercase to "Ʞ". This will make some existing pages and user names inaccessible. The affected pages and users will be renamed by a maintenance script. A list of current titles that will be renamed is at LINK.

Dalba added a subscriber: Dalba.Aug 5 2019, 5:35 AM
NicoV added a subscriber: NicoV.Aug 5 2019, 6:43 AM
Gorobay added a subscriber: Gorobay.Oct 3 2019, 2:29 PM

Many articles beginning with lowercase letters are redirects to articles about the letters themselves, in which case either the target is named with the capital letter or the capital letter also redirects to the same target. Giving those redirects the suffix “ (former Unicode lowercase)” is not useful, because they are only currently useful as workarounds for the outdated Unicode version in PHP. I propose letting such redirects become inaccessible without moving them anywhere.

And in some cases the actual article is at the lowercase-letter title, IIRC. The trick would be in determining which is which. Rather than do that for many different languages, we'll just move the pages to accessible titles and let each wiki's community decide whether they want to keep them, further rename them, or delete them.

Krinkle removed a subscriber: Krinkle.Oct 29 2019, 6:22 PM
Ottomata assigned this task to Anomie.Oct 29 2019, 7:40 PM
Ottomata added a subscriber: Ottomata.

@Anomie, assigning to you as it seems you are working on this. Feel free to undo or re-assign if I'm wrong.

Anomie removed Anomie as the assignee of this task.Oct 30 2019, 5:28 PM
Anomie added a subscriber: WDoranWMF.

I'm not actively working on it at the moment, so I'm going to unlick the cookie for the moment in case someone else wants to pick it up. If not, I'll try to pick it up again once I've taken care of my current projects.

Feel free to ping us if you think we should raise the priority. On that note, @WDoranWMF seems to be the current arbiter of priorities in Core Platform Team Workboards (Clinic Duty Team) so I'll ping him now.

After filtering out *wiktionary ns0 and ns1, looks like 917 live pages across all wikis...

@Anomie - Any idea how many pages on Wiktionary would be affected?

Anomie added a comment.EditedOct 30 2019, 7:42 PM

After filtering out *wiktionary ns0 and ns1, looks like 917 live pages across all wikis...

@Anomie - Any idea how many pages on Wiktionary would be affected?

Close to 0. Wiktionaries (and jbowiki) have $wgCapitalLinks set false, so the problem here only applies in the User and MediaWiki namespaces (and also Template and Module on zhwiktionary due to $wgCapitalLinkOverrides on that wiki). Possibly I should have filtered out more namespaces when I figured that 917.

The maintenance script handles this automatically, it uses NamespaceInfo to identify non-capitalized namespaces so it can ignore them.