Page MenuHomePhabricator

Some pages will become completely unreachable after PHP7 update due to Unicode changes
Open, HighPublic

Description

As detailed in T141723#5057472, mb_strtoupper, which we use to normalise titles, changes slightly in PHP7 with the Unicode update. As a result certain titles will have their normalised forms changed, and therefore will be unreachable if nothing is changed,

for example https://en.wikipedia.org/w/index.php?title=%C7%85&redirect=no takes you to article ID 7074938 in PHP5 HHVM, but if you enable the PHP7 beta feature, it takes you to 7074928, and the old article is now completely inaccessible.

Here are the changes (removed lines means the right hand side is no longer the result of mb_strtoupper, added lines are where the right hand side is a new result of mb_strtoupper):

--- a/resources/src/mediawiki.Title/phpCharToUpper.js
+++ b/resources/src/mediawiki.Title/phpCharToUpper.js
@@ -6,15 +6,8 @@
 	var toUpperMapping = {
 		'ß': 'ß',
 		'ʼn': 'ʼn',
-		'Dž': 'Dž',
-		'dž': 'Dž',
-		'Lj': 'Lj',
-		'lj': 'Lj',
-		'Nj': 'Nj',
-		'nj': 'Nj',
 		'ǰ': 'ǰ',
-		'Dz': 'Dz',
-		'dz': 'Dz',
+		'ɪ': 'Ɪ',
 		'ʝ': 'Ʝ',
 		'ͅ': 'ͅ',
 		'ΐ': 'ΐ',
@@ -26,6 +19,15 @@
 		'ᏻ': 'Ᏻ',
 		'ᏼ': 'Ᏼ',
 		'ᏽ': 'Ᏽ',
+		'ᲀ': 'В',
+		'ᲁ': 'Д',
+		'ᲂ': 'О',
+		'ᲃ': 'С',
+		'ᲄ': 'Т',
+		'ᲅ': 'Т',
+		'ᲆ': 'Ъ',
+		'ᲇ': 'Ѣ',
+		'ᲈ': 'Ꙋ',
 		'ẖ': 'ẖ',
 		'ẗ': 'ẗ',
 		'ẘ': 'ẘ',

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dalba added a subscriber: Dalba.Aug 5 2019, 5:35 AM
NicoV added a subscriber: NicoV.Aug 5 2019, 6:43 AM
Gorobay added a subscriber: Gorobay.Oct 3 2019, 2:29 PM

Many articles beginning with lowercase letters are redirects to articles about the letters themselves, in which case either the target is named with the capital letter or the capital letter also redirects to the same target. Giving those redirects the suffix “ (former Unicode lowercase)” is not useful, because they are only currently useful as workarounds for the outdated Unicode version in PHP. I propose letting such redirects become inaccessible without moving them anywhere.

And in some cases the actual article is at the lowercase-letter title, IIRC. The trick would be in determining which is which. Rather than do that for many different languages, we'll just move the pages to accessible titles and let each wiki's community decide whether they want to keep them, further rename them, or delete them.

Krinkle removed a subscriber: Krinkle.Oct 29 2019, 6:22 PM
Ottomata assigned this task to Anomie.Oct 29 2019, 7:40 PM
Ottomata added a subscriber: Ottomata.

@Anomie, assigning to you as it seems you are working on this. Feel free to undo or re-assign if I'm wrong.

Anomie removed Anomie as the assignee of this task.Oct 30 2019, 5:28 PM
Anomie added a subscriber: WDoranWMF.

I'm not actively working on it at the moment, so I'm going to unlick the cookie for the moment in case someone else wants to pick it up. If not, I'll try to pick it up again once I've taken care of my current projects.

Feel free to ping us if you think we should raise the priority. On that note, @WDoranWMF seems to be the current arbiter of priorities in Platform Team Workboards (Clinic Duty Team) so I'll ping him now.

After filtering out *wiktionary ns0 and ns1, looks like 917 live pages across all wikis...

@Anomie - Any idea how many pages on Wiktionary would be affected?

Anomie added a comment.EditedOct 30 2019, 7:42 PM

After filtering out *wiktionary ns0 and ns1, looks like 917 live pages across all wikis...

@Anomie - Any idea how many pages on Wiktionary would be affected?

Close to 0. Wiktionaries (and jbowiki) have $wgCapitalLinks set false, so the problem here only applies in the User and MediaWiki namespaces (and also Template and Module on zhwiktionary due to $wgCapitalLinkOverrides on that wiki). Possibly I should have filtered out more namespaces when I figured that 917.

The maintenance script handles this automatically, it uses NamespaceInfo to identify non-capitalized namespaces so it can ignore them.

Suggestion for User-notice

Mediawiki is upgrading to a newer version of Unicode. Some characters that did not have an uppercase before do now. Titles beginning with one of these characters will be moved. A list of these title can be seen at https://phabricator.wikimedia.org/P10817. The titles will be by user "Maintenance script" beginning 13 April 2020. You may rename them ahead of time if you wish and the new title can be different from the one the script would rename it to.

Change 585898 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] uppercaseTitlesForUnicodeTransition.php: Output moves with --run too

https://gerrit.wikimedia.org/r/585898

Change 585899 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] uppercaseTitlesForUnicodeTransition.php: Improve handling of non-moves

https://gerrit.wikimedia.org/r/585899

Change 585900 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] uppercaseTitlesForUnicodeTransition.php: Delete useless redirects

https://gerrit.wikimedia.org/r/585900

The titles will be by user "Maintenance script" beginning 13 April 2020.

If someone can update global userpage of this account with up to date information, that would be awesome :)

Suggestion for User-notice

Mediawiki is upgrading to a newer version of Unicode. Some characters that did not have an uppercase before do now. Titles beginning with one of these characters will be moved. A list of these title can be seen at https://phabricator.wikimedia.org/P10817. The titles will be by user "Maintenance script" beginning 13 April 2020. You may rename them ahead of time if you wish and the new title can be different from the one the script would rename it to.

This was just sent out, and there are a number of commons files that will be renamed. Since the old title will not work anymore, we need to ensure that when the files are moved links are also updated; commons' move and replace does this. I'm going to start moving the commons files now, to the same target title, so that global uses are replaced, but help would be appreciated :)

Anomie added a comment.Apr 6 2020, 7:14 PM

Since the old title will not work anymore,

Don't file redirects work?

But please do go ahead and rename the files anyway, the more that can be done by humans the better.

Since the old title will not work anymore,

Don't file redirects work?

I assumed since the title would be unreachable it wouldn't work

The titles will be by user "Maintenance script" beginning 13 April 2020.

If someone can update global userpage of this account with up to date information, that would be awesome :)

{{done}} the English version, translations are out of date but better than nothing

Anomie added a comment.Apr 6 2020, 7:52 PM

I assumed since the title would be unreachable it wouldn't work

Hmm. The file redirect would probably work, except the maintenance script won't actually leave a redirect behind at the soon-to-be-unreachable title because it will soon be unreachable. So yeah, do the moves and cleanup of references manually if you can.

The planned process is like this:

  1. We move all the pages to the new uppercased titles. No redirects are left behind, so links to the lowercased titles will no longer work.
  2. We remove the override that's preventing MediaWiki from uppercasing these letters. Now all the links start working again, just like links to "example" automatically target "Example".
  3. We run the script again, which will hopefully find nothing to do (but might if someone did something in between when we did #1 and #2).
  4. Produce a list of any "former Unicode lowercase" titles actually created, for communities to clean up.

We hope that the time spent in between #1 and #2 will only be a few minutes. We could as well do #2 first and then #1, but links would still be broken in between the two steps.

DannyS712 added a comment.EditedApr 6 2020, 7:54 PM

I assumed since the title would be unreachable it wouldn't work

Hmm. The file redirect would probably work, except the maintenance script won't actually leave a redirect behind at the soon-to-be-unreachable title because it will soon be unreachable. So yeah, do the moves and cleanup of references manually if you can.

The planned process is like this:

  1. We move all the pages to the new uppercased titles. No redirects are left behind, so links to the lowercased titles will no longer work.
  2. We remove the override that's preventing MediaWiki from uppercasing these letters. Now all the links start working again, just like links to "example" automatically target "Example".
  3. We run the script again, which will hopefully find nothing to do (but might if someone did something in between when we did #1 and #2).
  4. Produce a list of any "former Unicode lowercase" titles actually created, for communities to clean up.

We hope that the time spent in between #1 and #2 will only be a few minutes. We could as well do #2 first and then #1, but links would still be broken in between the two steps.

Other than the files used at https://ja.wikipedia.org/wiki/五色台送信所, which is fully protected, none of the commons files have any other global usage. But, what happens to the redirects I left behind (including when I moved the files to a name other than the bot would have) - are they deleted?

I have moved all of the files on commonswiki

Anomie added a comment.Apr 6 2020, 8:44 PM

But, what happens to the redirects I left behind (including when I moved the files to a name other than the bot would have) - are they deleted?

If the uppercased target name doesn't exist, the redirect will be moved to that name.

If the uppercased target name does exist, the script will rename it to the name suffixed with "(former Unicode lowercase)".[1] If https://gerrit.wikimedia.org/r/c/mediawiki/core/+/585900 gets merged in time and the "(former Unicode lowercase)" would wind up pointing to the unsuffixed name or the unsuffixed name is a redirect to the same target as the suffixed name, the script will then delete the suffixed redirect.

[1]: That should be fixed for the file namespace, where extensions matter...

Change 586440 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] uppercaseTitlesForUnicodeTransition.php: Fix suffixing for files

https://gerrit.wikimedia.org/r/586440

But, what happens to the redirects I left behind (including when I moved the files to a name other than the bot would have) - are they deleted?

If the uppercased target name doesn't exist, the redirect will be moved to that name.

If the uppercased target name does exist, the script will rename it to the name suffixed with "(former Unicode lowercase)".[1] If https://gerrit.wikimedia.org/r/c/mediawiki/core/+/585900 gets merged in time and the "(former Unicode lowercase)" would wind up pointing to the unsuffixed name or the unsuffixed name is a redirect to the same target as the suffixed name, the script will then delete the suffixed redirect.

[1]: That should be fixed for the file namespace, where extensions matter...

I found this explanation a bit hard to follow. https://commons.wikimedia.org/wiki/User:DannyS712/sandbox shows all of the commons renames needed; some of the target names are redlinked because the extensions were tweaked (JPG -> jpg, PNG -> png, ogg -> ogg) or because the initial page was already a redirect (https://commons.wikimedia.org/w/index.php?title=File:%C9%A1ozaisyoyama0%C9%90.jpg&redirect=no) but otherwise the renames that would have been done via script have been carried out, leaving a redirect behind.

Change 585898 merged by jenkins-bot:
[mediawiki/core@master] uppercaseTitlesForUnicodeTransition.php: Output moves with --run too

https://gerrit.wikimedia.org/r/585898

Change 585899 merged by jenkins-bot:
[mediawiki/core@master] uppercaseTitlesForUnicodeTransition.php: Improve handling of non-moves

https://gerrit.wikimedia.org/r/585899

Change 586440 merged by jenkins-bot:
[mediawiki/core@master] uppercaseTitlesForUnicodeTransition.php: Fix suffixing for files

https://gerrit.wikimedia.org/r/586440

Change 585900 merged by jenkins-bot:
[mediawiki/core@master] uppercaseTitlesForUnicodeTransition.php: Delete useless redirects

https://gerrit.wikimedia.org/r/585900

Mentioned in SAL (#wikimedia-operations) [2020-04-16T14:30:46Z] <hknust> holger@mwmaint1002 Starting uppercaseTitlesForUnicodeTransition.php as part of T219279

Mentioned in SAL (#wikimedia-operations) [2020-04-16T14:51:17Z] <hknust> holger@mwmaint1002 END (Fail) uppercaseTitlesForUnicodeTransition.php as part of T219279

DannyS712 added a comment.EditedApr 16 2020, 2:56 PM

So on commons I found watchlist entries for a bunch of page moves that never occured, eg one that https://commons.wikimedia.org/w/index.php?title=File%3A%C9%A1obyounohasi.jpg&redirect=no moved, even though the script didn't move it

Edit: perhaps they simply didn't create log entries?

@DannyS712 - I can't parse what you're saying. What do watchlist entries have to do with these page moves? And why are you saying that https://commons.wikimedia.org/w/index.php?title=File%3A%C9%A1obyounohasi.jpg&redirect=no never occurred? I'm probably misunderstanding you, but your first sentence is very hard to understand.

@DannyS712 - I can't parse what you're saying. What do watchlist entries have to do with these page moves? And why are you saying that https://commons.wikimedia.org/w/index.php?title=File%3A%C9%A1obyounohasi.jpg&redirect=no never occurred? I'm probably misunderstanding you, but your first sentence is very hard to understand.

On my watchlist, I saw entries for pages being moved, including the page at https://commons.wikimedia.org/w/index.php?title=File%3A%C9%A1obyounohasi.jpg&redirect=no being moved, by the maintenance script. However, I was unable to find any move log for that file being moved by the script, only the one for my move of it last week. Is this clearer?

Oh yes, now I understand what you mean. That's very interesting. I wonder if unsuccessful move attempts trigger watchlist updates but not log entries.

@DannyS712 Look at https://commons.wikimedia.org/w/index.php?title=File%3A%EA%9E%ACobyounohasi%20(former%20Unicode%20lowercase).jpg&redirect=no. Since you had moved the file already, the rename resulted in a redirect to the existing, renamed file and then removed it. The ticket will remain open while I am going through the logs and track down a number of failed moves.

@holger.knust told me by email that most page moves on enwiki and frwiki failed due to AbuseFilter rate limiting. The simple way to fix that is to temporarily disable enwiki filter 68 and frwiki filter 332 while the script is executing. The filters are only hit once every few days, so there is not much risk of allowing actual vandalism.

@holger.knust told me by email that most page moves on enwiki and frwiki failed due to AbuseFilter rate limiting. The simple way to fix that is to temporarily disable enwiki filter 68 and frwiki filter 332 while the script is executing. The filters are only hit once every few days, so there is not much risk of allowing actual vandalism.

Can I suggest, instead, just not having abusefilters apply to system users on the wikis iin question?
If this isn't desired in the long run (maybe someone wants to tag massmessages based on content? who knows), then a quick hook handler in wmf's configs should be enough. CommonSettings.php already includes a number of direct hook handlers.

if ( $wgDBname === 'enwiki' || $wgDBname === 'frwiki' ) {
	$wgHooks['AbuseFilterShouldFilterAction'][] = function ( $vars, $title, $user, &$skipReasons ) {
		if ( $user->isSystemUser() ) {
			$skipReasons[] = 'System user';
			return false;
		}
		return true;
	};
}

isSystemUser() is not used for authorization, and I think it would be insecure to use it in that way. The implementation doesn't provide a strong guarantee that a user is not malicious.

Mentioned in SAL (#wikimedia-operations) [2020-04-27T20:28:40Z] <hknust> holger@mwmaint1002 Restarting uppercaseTitlesForUnicodeTransition.php as part of T219279 for 2 wikis

Mentioned in SAL (#wikimedia-operations) [2020-04-27T20:55:56Z] <hknust> holger@mwmaint1002 END (enwiki=success, frwiki=fail) uppercaseTitlesForUnicodeTransition.php as part of T219279

@tstarling enwiki worked. frwiki failed. I don't have permissions to view the 332 filter rule.

@tstarling enwiki worked. frwiki failed. I don't have permissions to view the 332 filter rule.

Hi, this frwiki filter is throttle for renames. I've made it temporary public and added exception for this account, it should not block the script again.

@tstarling enwiki worked. frwiki failed. I don't have permissions to view the 332 filter rule.

Hi, this frwiki filter is throttle for renames. I've made it temporary public and added exception for this account, it should not block the script again.

[unrelated] @Framawiki the condition & !('Page automatiquement déplacée lors du renommage de l’utilisateur' in summary) is probably no longer needed, since filters are skipped automatically when users are being renamed

Mentioned in SAL (#wikimedia-operations) [2020-04-28T13:30:58Z] <hknust> Restarting uppercaseTitlesForUnicodeTransition.php as part of T219279 for frwiki

Mentioned in SAL (#wikimedia-operations) [2020-05-01T13:06:27Z] <hknust> holger@mwmaint1002 Starting renameInvalidUsernames.php as part of T219279

Dalba removed a subscriber: Dalba.May 1 2020, 1:10 PM

Mentioned in SAL (#wikimedia-operations) [2020-05-01T14:18:13Z] <hknust> holger@mwmaint1002 finished renameInvalidUsernames.php (fail) as part of T219279

Where is this now? Did the maintenance script actually run?

@holger.knust : Could you please answer the last comment(s)? Thanks in advance!

Joe added a subscriber: AMooney.Tue, Nov 10, 8:50 AM

Gentle nudge, this really needs to be completed.

@WDoranWMF @AMooney do you have a timeline for completion of this task?

Mentioned in SAL (#wikimedia-operations) [2020-11-10T18:31:43Z] <hknust> holger mwmaint1002 Start T219279

holger.knust added a comment.EditedTue, Nov 10, 6:33 PM

Script execution failed with

holger@mwmaint1002:~/T219279_Unicode$ mwscript extensions/WikimediaMaintenance/renameInvalidUsernames.php loginwiki --list=userlist-2020-04-01.txt --reason="Unicode update. See T219279 for details" 
Reading from userlist-2020-04-01.txt
ɋ       Ɋ
Wikimedia\Rdbms\DBConnectionError from line 1419 of /srv/mediawiki/php-1.36.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: Unknown error (10.64.48.35)
#0 /srv/mediawiki/php-1.36.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(932): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#1 /srv/mediawiki/php-1.36.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(899): Wikimedia\Rdbms\LoadBalancer->getServerConnection(0, '\xC9\x8B', 4)
#2 /srv/mediawiki/php-1.36.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1045): Wikimedia\Rdbms\LoadBalancer->getConnection(-2, Array, '\xC9\x8B', 4)
#3 /srv/mediawiki/php-1.36.0-wmf.16/includes/GlobalFunctions.php(2460): Wikimedia\Rdbms\LoadBalancer->getMaintenanceConnectionRef(-2, Array, '\xC9\x8B')
#4 /srv/mediawiki/php-1.36.0-wmf.16/extensions/WikimediaMaintenance/renameInvalidUsernames.php(75): wfGetDB(-2, Array, '\xC9\x8B')
#5 /srv/mediawiki/php-1.36.0-wmf.16/extensions/WikimediaMaintenance/renameInvalidUsernames.php(50): RenameInvalidUsernames->rename('\xC9\x8A', '\xC9\x8B', NULL)
#6 /srv/mediawiki/php-1.36.0-wmf.16/maintenance/doMaintenance.php(106): RenameInvalidUsernames->execute()
#7 /srv/mediawiki/php-1.36.0-wmf.16/extensions/WikimediaMaintenance/renameInvalidUsernames.php(182): require_once('/srv/mediawiki/...')
#8 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#9 {main}

Mentioned in SAL (#wikimedia-operations) [2020-11-10T19:06:26Z] <hknust> holger mwmaint1002 Stop T219279

pc2010 seems to be lagging behind. This is a non-issue for production, given it is codfw, but noting it here because it may alert if the writing trends continue. This maintenance is the most likely explanation- as it started at the exact time the log indicates (but I am not 100% sure about it).

From looking at the code, it seems like the user list ought to have three fields, the first one being the name of the wiki. That appears to be missing. Someone can correct me on that later if they know better.