Page MenuHomePhabricator

Some pages will become completely unreachable after PHP7 update due to Unicode changes
Open, HighPublic

Description

As detailed in T141723#5057472, mb_strtoupper, which we use to normalise titles, changes slightly in PHP7 with the Unicode update. As a result certain titles will have their normalised forms changed, and therefore will be unreachable if nothing is changed,

for example https://en.wikipedia.org/w/index.php?title=%C7%85&redirect=no takes you to article ID 7074938 in PHP5 HHVM, but if you enable the PHP7 beta feature, it takes you to 7074928, and the old article is now completely inaccessible.

Here are the changes (removed lines means the right hand side is no longer the result of mb_strtoupper, added lines are where the right hand side is a new result of mb_strtoupper):

--- a/resources/src/mediawiki.Title/phpCharToUpper.js
+++ b/resources/src/mediawiki.Title/phpCharToUpper.js
@@ -6,15 +6,8 @@
 	var toUpperMapping = {
 		'ß': 'ß',
 		'ʼn': 'ʼn',
-		'Dž': 'Dž',
-		'dž': 'Dž',
-		'Lj': 'Lj',
-		'lj': 'Lj',
-		'Nj': 'Nj',
-		'nj': 'Nj',
 		'ǰ': 'ǰ',
-		'Dz': 'Dz',
-		'dz': 'Dz',
+		'ɪ': 'Ɪ',
 		'ʝ': 'Ʝ',
 		'ͅ': 'ͅ',
 		'ΐ': 'ΐ',
@@ -26,6 +19,15 @@
 		'ᏻ': 'Ᏻ',
 		'ᏼ': 'Ᏼ',
 		'ᏽ': 'Ᏽ',
+		'ᲀ': 'В',
+		'ᲁ': 'Д',
+		'ᲂ': 'О',
+		'ᲃ': 'С',
+		'ᲄ': 'Т',
+		'ᲅ': 'Т',
+		'ᲆ': 'Ъ',
+		'ᲇ': 'Ѣ',
+		'ᲈ': 'Ꙋ',
 		'ẖ': 'ẖ',
 		'ẗ': 'ẗ',
 		'ẘ': 'ẘ',

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Joe added a comment.Apr 8 2019, 9:18 AM

So, I think Joe's proposal it to patch/revert the behaviour on PHP7 source code. I would like to propose the alternative, which is maintain it on mediawiki, for a short period of time, while upgrading the library. I think the new behaviour is something we want long term, so I propose to add a patch to mediawiki, that works < 7.0, to detect those usernames and titles with the behaviour change and apply some exceptions, or just reimplement mb_toupper/to_lower with the current behaviour, log the problems. Once the new and the old behaviour is the same, revert the function to the default behaviour.
This has the issue of potentially be a hit on performance (to be checked), but it could be more flexible with a config change rather than having to recompile php every time we want to change it (and will serve in future cases). We already handle collation, character set and other things on app side due to limitations of 3rd party software, so it wouldn't be a new thing we never do. Specially given it is supposed to go away with a particular deadline. Doing the workaround on mw may also help 3rd party users.
I am just giving an opinion, would need to check actual code to see if that is feasible or desireable.

I like this idea, in particular the part about logging problematic pages. However:

  • I expect a full monkey-patching of mb_toupper to be unfeasible. Although AFAICS there is very sparse usage of mb_strtoupper in core. But we'd need to patch all extensions as well?
  • We could just fix the normalized tiltes/usernames, but that would make mb_toupper() behave differently there and in the rest of the code, which could cause unexpected consequences

I'd like to hear the opinion of someone with more experience with MediaWiki's own code on this.

jcrespo added a comment.EditedApr 8 2019, 9:19 AM

I'd like to hear the opinion of someone with more experience with MediaWiki's own code on this.

+1

In an ideal world, I would expect there is a normalize_title() method everybody is using...

Joe claimed this task.Apr 8 2019, 10:35 AM
Joe added a project: serviceops.

I looked into the option of reverting the fixes in php 7.2, and it would mean basically reverting the following patches:

they revert cleanly and I am trying a build of php 7.2 with those patches removed.
I think this is ok as long as we don't keep this floating for too long.

FTR, I have rebuilt the php packages with this revert applied, they rebuild correctly and I have ran a smoke-test which seems to show the old behaviour is indeed restored. I am now running a more complete test re-purposing the quick test script @Esanders mentioned in https://phabricator.wikimedia.org/T141723#2701096

I'll prepare a build on boron with this patch and ask @MoritzMuehlenhoff for a review. I don't think creating a repository for this would particularly help, but I'm open to dissenting opinions :)

I wonder also if collation or other character-related updates will be affected by PHP update, such as T146341 CC @kaldari

Joe added a comment.Apr 8 2019, 11:23 AM

I wonder also if collation or other character-related updates will be affected by PHP update, such as T146341 CC @kaldari

No, ICU is unchanged with respect to HHVM.

Joe added a comment.Apr 8 2019, 1:57 PM

Even when backporting the above patches, I tried to run a simple script against HHVM and php7.2:

<?php

echo '"Codepoint", "Character", "mb_strtoupper", "String.toUpperCase"'."\n";

for ( $i = 0; $i < 65536; $i++ ) {
    $char = mb_convert_encoding( '&#' . $i . ';', 'UTF-8', 'HTML-ENTITIES' );
    $php = mb_strtoupper( $char );
    if ( $char == $php ) {
       echo '"\\u', str_pad( dechex( $i ), 4, '0', STR_PAD_LEFT), '", "' , $char, '", "', $php, '"', "\n";
    } 
}

this shows a somewhat large set of characters is still converted in the wrong way:

# total diffs with the current php7.2 package:

$ diff -u0 -ar unicode.php7 unicode.hhvm | grep -v ^@@  | wc -l
413

# diffs with the patched php7.2 package:
$ diff -u0 -ar unicode.php7.patched unicode.hhvm | grep -v ^@@  | wc -l
399

so we would still be left with a long list of characters to fix. This makes @jcrespo's suggestion of going the software way more enticing.

@Anomie suggested on IRC:

I like Jaime's proposal too; I said something similar in the #wikimedia-cpt channel to eprodromou the other day. In MediaWiki the patch would likely go in Language::ucfirst(), conceptually similar to what we already do in e.g. LanguageTr::ucfirst() to handle dotless-I in Turkish.
Joe added a comment.Apr 8 2019, 3:40 PM

I tried to write a simplistic implementation of such a function:

<?php
function my_strtoupper( $bytes ) {
    $not_capitalized = [0x0180, ...]; // this is the list of characters to test against
    $result = "";
    for ($i=0; $i<strlen($bytes)  - 1; $i++) {
        $char = mb_substr($bytes, $i, 1);
        $result .= ( in_array( mb_ord( $char ), $not_capitalized) ) ? $char : mb_strtoupper( $char );
    }
    return $result;
}

this is 2L slower than the native mb_strtoupper (where L is string length), but I think we can make it optional. It should be noted that if we can limit it to only act on the first character of the string, it's just 2x slower, and there is ample space to make this faster.

Joe added a comment.Apr 8 2019, 3:47 PM

It should be noted that if this seems too slow, we can still "fix" the behaviour in the C code of mbstring. I'd prefer avoiding that.

We can also limit this behaviour to php7, and only to non-ascii strings.

@Joe did you get what you needed from CPT in IRC? If not please feel free to move back to our Inbox.

Joe added a comment.Apr 9 2019, 6:50 AM

So for the record, in terms of impact:

anomie>	_joe_: After filtering out *wiktionary ns0 and ns1, looks like 917 live pages across all wikis. enwiki has the most (160), followed by ruwiki (141), commonswiki (109), and kawiki (107). 77 distinct local usernames, only one is not SUL (I didn't check if any others were unattached though). 92 SUL accounts, apparently some have no local account (locally renamed for being harassing, maybe?). Also 249 deleted page titles (2208 deleted revisions)

This doesn't seem like a huge deal to me, and I guess we can fix the situations on a case-by-case basis if these are the numbers.

@Esanders is it a huge deal if, in light of the blast radius of this, we proceed with the php7 transition and a fix is applied afterwards?

Joe added a comment.Apr 9 2019, 6:52 AM

@Joe did you get what you needed from CPT in IRC? If not please feel free to move back to our Inbox.

Yes, we (SRE) will need the following things:

  1. An opinion on wether it's ok to live with the issue on the pages @Anomie found to be affected
  2. An implementation of a solution for such pages, or at least a proposed workflow around that.

Change 502546 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[mediawiki/core@master] [WiP] Add ability to override mb_strtoupper in Language::ucfirst

https://gerrit.wikimedia.org/r/502546

@Esanders is it a huge deal if, in light of the blast radius of this, we proceed with the php7 transition and a fix is applied afterwards?

I will defer to your judgement on the impact. Ideally we will synchronise any PHP fix with an equivalent fix in the JS (T141723).

Joe added a comment.Apr 10 2019, 8:33 AM

@Esanders is it a huge deal if, in light of the blast radius of this, we proceed with the php7 transition and a fix is applied afterwards?

I will defer to your judgement on the impact. Ideally we will synchronise any PHP fix with an equivalent fix in the JS (T141723).

My current plan would be:

  • At first, add a "backward compatibility" conversion table so that php7 can behave like HHVM
  • Once we know how to proceed in terms of changing page / resources names, if the transition to php7 is not completed, we can make another conversion table that fixes the behaviour of HHVM

Only if we remove the compat layer from php7 we'd need to sync that with the js fix, sure.

For now, I'd just like to be sure we can continue the php7 rollout.

Esanders added a comment.EditedApr 10 2019, 1:30 PM

Sounds good to me. We can validate the compatibility layer by re-running the scripts added https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/499196/ and checking there are no changes.

Change 502800 had a related patch set uploaded (by Esanders; owner: Esanders):
[mediawiki/core@master] Make generatePhpCharToUpperMappings.php a proper maintenance script

https://gerrit.wikimedia.org/r/502800

Change 504584 had a related patch set uploaded (by Reedy; owner: Giuseppe Lavagetto):
[mediawiki/core@wmf/1.34.0-wmf.1] Add ability to override mb_strtoupper in Language::ucfirst

https://gerrit.wikimedia.org/r/504584

Change 504584 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.1] Add ability to override mb_strtoupper in Language::ucfirst

https://gerrit.wikimedia.org/r/504584

Mentioned in SAL (#wikimedia-operations) [2019-04-17T16:18:47Z] <jforrester@deploy1001> Synchronized php-1.34.0-wmf.1/includes/DefaultSettings.php: T219279 Ability to set wgOverrideUcfirstCharacters part 1b (duration: 01m 03s)

Mentioned in SAL (#wikimedia-operations) [2019-04-17T16:20:58Z] <jforrester@deploy1001> Synchronized php-1.34.0-wmf.1/languages/Language.php: T219279 Ability to set wgOverrideUcfirstCharacters part 1 try two (duration: 01m 00s)

Change 502546 merged by jenkins-bot:
[mediawiki/core@master] Add ability to override mb_strtoupper in Language::ucfirst

https://gerrit.wikimedia.org/r/502546

@Joe did you get what you needed from CPT in IRC? If not please feel free to move back to our Inbox.

Yes, we (SRE) will need the following things:

  1. An opinion on wether it's ok to live with the issue on the pages @Anomie found to be affected
  2. An implementation of a solution for such pages, or at least a proposed workflow around that.

@Anomie or @tstarling could you respond to the above points?

Change 505487 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Add Language::ucfirst overrides for php 7.2

https://gerrit.wikimedia.org/r/505487

Joe added a comment.Apr 22 2019, 6:27 AM

@kchapman regarding point 1 above - I've prepared various patches, including https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/505487 that should act as a stopgap solution for now.

We still need a way to fix the titles of those pages, then we can fix the behaviour of HHVM there.

  1. An opinion on wether it's ok to live with the issue on the pages @Anomie found to be affected
  2. An implementation of a solution for such pages, or at least a proposed workflow around that.

Corey asked me now to work on this task, so I can now get started on doing #2.

For #1, it's hard to say since I don't know anything about the languages that might actually be affected. I can say that on wikis like enwiki it seems the articles affected are mostly about the letters themselves, the main question would be whether in any of these cases enwiki has the article at the lowercase title with a redirect from uppercase rather than vice versa. And I don't see many multi-character article titles in the list from other wikis either.

Joe added a comment.Apr 29 2019, 7:55 AM
  1. An opinion on wether it's ok to live with the issue on the pages @Anomie found to be affected
  2. An implementation of a solution for such pages, or at least a proposed workflow around that.

Corey asked me now to work on this task, so I can now get started on doing #2.
For #1, it's hard to say since I don't know anything about the languages that might actually be affected. I can say that on wikis like enwiki it seems the articles affected are mostly about the letters themselves, the main question would be whether in any of these cases enwiki has the article at the lowercase title with a redirect from uppercase rather than vice versa. And I don't see many multi-character article titles in the list from other wikis either.

Regarding #1, I'm going to merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/505487 today that should allow to reach those pages from php7 while we have a script to do #2. Once we do, we can fix the affected pages/users and do the inverse mapping for HHVM instead.

Change 505487 merged by jenkins-bot:
[operations/mediawiki-config@master] Add Language::ucfirst overrides for php 7.2

https://gerrit.wikimedia.org/r/505487

Mentioned in SAL (#wikimedia-operations) [2019-04-29T08:23:40Z] <oblivian@deploy1001> Synchronized wmf-config/Php72ToUpper.php: Adding unicode overrides table for php 7.2 T219279 (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2019-04-29T08:25:23Z] <oblivian@deploy1001> Synchronized wmf-config/CommonSettings.php: Enable unicode overrides table for php 7.2 T219279 (duration: 00m 53s)

Change 507596 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] maintenance: Script to rename titles for Unicode uppercasing changes

https://gerrit.wikimedia.org/r/507596

Change 502800 merged by jenkins-bot:
[mediawiki/core@master] Make generatePhpCharToUpperMappings.php a proper maintenance script

https://gerrit.wikimedia.org/r/502800

CPT needs to review regarding long term fixes for this.

Anomie added a comment.May 8 2019, 3:04 PM

The long-term plan, as I understand it, is that we'll run maintenance scripts to rename pages (see gerrit:507596) and users (see rEWMA9122f6c) that are affected by the change, then reverse rOMWC713a20a0f2dd: Add Language::ucfirst overrides for php 7.2 to have HHVM use PHP 7.2's uppercasing table. Note we may have to follow the same process in the future whenever we upgrade to a new version of PHP, as it seems upstream is intending to do a better job of tracking new versions of Unicode.

Current status is that gerrit:507596 needs review and I need to look at the script from rEWMA9122f6c to see what changes it needs to be used for this task.

Krinkle updated the task description. (Show Details)EditedMay 23 2019, 12:37 AM
Krinkle added a subscriber: Krinkle.

for example https://en.wikipedia.org/w/index.php?title=%C7%85&redirect=no takes you to article ID 7074938 in PHP5 HHVM, but if you enable the PHP7 beta feature, it takes you to 7074928, and the old article is now completely inaccessible.

This now work as expected both via HHVM and PHP 7 (serves page ID 7074938).

This now work as expected both via HHVM and PHP 7 (serves page ID 7074938).

That's thanks to the (temporary) workaround applied in T219279#5142875. We still need to fix things properly, as described in T219279#5167380.

Joe moved this task from Backlog to Externally Blocked on the serviceops board.Jun 20 2019, 9:48 AM
Joe removed Joe as the assignee of this task.Jun 21 2019, 7:07 AM

Change 507596 merged by jenkins-bot:
[mediawiki/core@master] maintenance: Script to rename titles for Unicode uppercasing changes

https://gerrit.wikimedia.org/r/507596

Change 522489 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/WikimediaMaintenance@master] RenameInvalidUsernames: Make more generic

https://gerrit.wikimedia.org/r/522489

Change 522489 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] RenameInvalidUsernames: Make more generic

https://gerrit.wikimedia.org/r/522489

Next steps here:

  • 1. Determine the schedule to do these next steps.
  • 2. Generate a list of affected titles, and announce the upcoming change on User-notice.
  • 3. Run uppercaseTitlesForUnicodeTransition.php and renameInvalidUsernames.php on all wikis.
    • My current plan would be to use --suffix ' (former Unicode lowercase)' for handling of collisions.
  • 4. Reverse https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/505487 so HHVM uses PHP 7.2's mappings.
  • 5. Run uppercaseTitlesForUnicodeTransition.php and renameInvalidUsernames.php on all wikis again, in case someone created new problem titles while the scripts were running the first time.

For the User-notice, perhaps something like:

Wikimedia wikis are going to use a newer version of the Unicode standard. The new Unicode version adds new uppercase mappings for some uncommon characters, for example "ʞ" will now uppercase to "Ʞ". This will make some existing pages and user names inaccessible. The affected pages and users will be renamed by a maintenance script. A list of current titles that will be renamed is at LINK.

Dalba added a subscriber: Dalba.Aug 5 2019, 5:35 AM
NicoV added a subscriber: NicoV.Aug 5 2019, 6:43 AM