Page MenuHomePhabricator

IcuCollation sort keys depend on PHP/HHVM version
Closed, ResolvedPublic

Description

The sort key for most category collations other than "identity" or "uppercase" currently depends on which PHP or HHVM version is in use, not just on which ICU version is in use. Here is an explanation.

MediaWiki relies on the Collator::getSortKey() function provided by PHP's intl extension, a binding to the ICU library. The function was added by PHP developer @Smalyshev and is available since 5.3.2 (php/php-src@65c1421). Until 5.3.15 and 5.4.5, PHP Bug #62070 caused a "String is not zero-terminated" warning in debug builds. (Until rMW8fc4dea67935fb0f, MediaWiki worked around this bug by appending the empty string while suppressing warnings.)

However, the bug was fixed not by increasing the buffer size by one byte and manually terminating the string, but rather by returning a string length one byte shorter than before, relying on the fact that to allow comparison using strcmp(), all sort keys end with a null terminator byte. This is a backwards incompatible change that affects comparisons to sort keys stored in the database, when multiple pages have the same sort key.

Whether PHP should include the null byte or not is unclear. Though PHP generally does not count a null terminator as part of a string's length or make it user-accessible, ICU documentation includes a hex "00" at the end of each example sort key. And while PyICU (for Python) does include the null byte in the binary string it returns, Unicode::ICU::Collator (for Perl) does not. So I decided it would be better not to break backwards compatibility a second time, and instead file PHP Doc Bug #72271 and HHVM Issue #7106. An HHVM patch to make behavior consistent with recent PHP versions was proposed, then merged on May 31 (facebook/hhvm@b049baa).

MediaWiki should choose one or the other though. It could follow what recent versions of PHP do, by stripping off the null byte if there is one, so we don't have to keep the workaround forever. This would also minimize the impact on those who have already upgraded to PHP 5.5 or newer. However, Wikimedia runs HHVM, so would have to fix their sort keys. A temporary migration flag could be added for Wikimedia's use, and turned off only once Wikimedia is ready to do the migration.

Event Timeline

Change 293910 had a related patch set uploaded (by PleaseStand):
IcuCollation: Remove null terminator from sort key if present

https://gerrit.wikimedia.org/r/293910

ICU docs say:

ICU sort key is a pre-processed sequence of bytes generated from a Unicode string. The weights for each comparison level are concatenated, separated by a "0x01" byte. The entire sequence is terminated with a 0x00 byte for convenience in C APIs.

So the null byte is there not as part of the key, but just as a convenience measure. Thus, I think it's OK to not show it to the end user. From the same, it probably follows 0x0 can not be inside the proper sort key.

Stripping it may be ok for newer , but may present an issue to those running older PHP/HHVM versions. Depends on how exactly the keys are used - what the lifecycle is, I guess.

Stripping it may be ok for newer , but may present an issue to those running older PHP/HHVM versions. Depends on how exactly the keys are used - what the lifecycle is, I guess.

The sort keys are stored in the categorylinks database table (as cl_sortkey) and can be regenerated by running php maintenance/updateCollation.php --force. The --force option causes sort keys to be regenerated even if the cl_collation value matches the current collation in use.

Following an ICU upgrade (T86096), the script was recently run on all Wikimedia sites using a UCA collation and finished in not much longer than a day. My patch includes a temporary migration flag to allow the database update to be scheduled separately from MediaWiki updates, after it is reasonably certain that rolling back to a previous branch is not necessary.

Though category pages did not work correctly in the meantime, this update should not be as disruptive. Specifically, CategoryViewer does not include sort keys in pagination links but rather the strings they are based on (sort key prefix, then %0A, then page title). It uses >= and < SQL operators, which should work reasonably OK when older sort keys end with null bytes and new ones do not, given the lack of proper handling of duplicate sort keys (provided that cl_sortkey is in fact VARBINARY).

However, the "categorymembers" API module, which uses hex-encoded sort keys for continuation when sort=sortkey, would be affected. It might be possible to add some hacks there to reduce the disruption if necessary.

(FlaggedRevs's Special:UnreviewedPages looks like it also would be affected when filtering by category, except that it's already broken because sort keys, as stored in the database, are not well-formed UTF-8 strings that are in NFC and contain only characters allowed by MediaWiki. And as for Category::getMembers(), no caller I know of provides an offset.)

(FlaggedRevs's Special:UnreviewedPages [...] it's already broken because sort keys, as stored in the database, are not well-formed UTF-8 strings that are in NFC and contain only characters allowed by MediaWiki. [...])

Filed that as T138042: Special:UnreviewedPages, when filtering by category, is broken on wikis using UCA category collations.

Could you not just do the update with some replication-safe equivalent of

UPDATE categorylinks SET cl_sortkey=LEFT(cl_sortkey, LENGTH(cl_sortkey)-1) WHERE cl_sortkey LIKE '%\0'

Also, is it even necessary have the backwards compatibility flag or to do a database update? Like PleaseStand says, it doesn't affect category ordering (except perhaps when the keys are equal). API clients will see a different sort key, but those clients presumably use an strcmp-like algorithm and so will be similarly unaffected in practice.

Could you not just do the update with some replication-safe equivalent of

UPDATE categorylinks SET cl_sortkey=LEFT(cl_sortkey, LENGTH(cl_sortkey)-1) WHERE cl_sortkey LIKE '%\0'

Yes, that would be an option, though would probably require a new maintenance script.

Also, is it even necessary have the backwards compatibility flag or to do a database update? Like PleaseStand says, it doesn't affect category ordering (except perhaps when the keys are equal). API clients will see a different sort key, but those clients presumably use an strcmp-like algorithm and so will be similarly unaffected in practice.

My main reason for a $wgAppendNullToUcaSortKeys flag is that if WMF has to roll back to an earlier MediaWiki version (one that, under HHVM 3.12, does include the null byte), CategoryViewer may skip over pages for which the sort keys do not end with a null byte, because it regenerates a sort key from the (cl_sortkey_prefix, page_title) pair provided in the HTTP request. However, the patch could be applied to the previous release branch if necessary, so the flag is probably not strictly necessary.

Also, duplicate sort keys do happen sometimes, so they probably should not be ignored entirely. For example, on frwiki, there are 1736 cases in which the same (cl_to, cl_type, cl_sortkey) occurs more than once.

Though interestingly enough, on that wiki, most of the sort keys shorter than the 230 byte maximum already do not end with a null byte. As part of the recent ICU upgrade, was updateCollation.php --force run on frwiki? If so, was it run using HHVM or using PHP 5.5? If category pagination is in fact already broken on Wikimedia sites using UCA collations (and it is on frwiki, compare API pagination to web pagination), there may be no good reason to include the flag at all, and more reason to create a script that will quickly fix the issue.

[...] more reason to create a script that will quickly fix the issue.

Or rather, trimming the null byte off in IcuCollation should address the major problem with CategoryViewer's pagination on frwiki. No maintenance script would need to be run for that particular issue. The script would be to ensure that the duplicate sort key case is handled consistently throughout a site.

[...] The script would be to ensure that the duplicate sort key case is handled consistently throughout a site.

And reasonably correctly, if CategoryViewer is fixed as described in T138031.

Or rather, trimming the null byte off in IcuCollation should address the major problem with CategoryViewer's pagination on frwiki.

As I stated, this is assuming that updateCollation.php --force was recently run on frwiki, though using PHP 5.5 or 5.6 instead of HHVM. If the script was recently run using HHVM, there may be a different cause.

Or rather, trimming the null byte off in IcuCollation should address the major problem with CategoryViewer's pagination on frwiki.

As I stated, this is assuming that updateCollation.php --force was recently run on frwiki, though using PHP 5.5 or 5.6 instead of HHVM. If the script was recently run using HHVM, there may be a different cause.

Yes, all scripts running with mwscript are being run with php 5.5as there is a bug with mwscript and HHVM that has not been fixed AFAIK, so all maintenance scripts are run via zend at the moment.

@PleaseStand I think we can simply backport the patch to our current version of HHVM, by the way, in order to make the behaviour consistent. We won't need to remove the null byte then, AIUI.

@PleaseStand I think we can simply backport the patch to our current version of HHVM, by the way, in order to make the behaviour consistent. We won't need to remove the null byte then, AIUI.

Yes, if Wikimedia servers get this patch, there would be no null byte to trim off. Then any workaround in MediaWiki would only be for non-Wikimedia sites, which may use a version of HHVM that does not have the fix. I don't know whether any use UCA category collations, which are not enabled by default.

The issue of whether/how to clean up the affected sort keys in the SQL databases still remains. I suggested running updateCollation.php --force again, because that script already exists. @tstarling suggested a different query (presumably using a new maintenance script), or even just leaving the affected sort keys unchanged (which for current/existing MediaWiki code is not expected to pose a major problem).

Joe triaged this task as Low priority.Jul 14 2016, 7:47 AM

I've built new HHVM packages based on 3.12.7 and also merged the upstream commit b049baada6722c839337012e80661738888e0806.

With 3.12.1+dfsg-1~wmf2 :

jmm@mw2239:~$ hhvm -f coll.php
int(1)

With 3.12.7+dfsg-1+wmf1 :

jmm@mw2239:~$ hhvm -f coll.php
int(0)

This is not yet rolled out (still needs more tests and a build for our trusty servers)

Mentioned in SAL [2016-08-16T17:46:18Z] <moritzm> uploaded hhvm 3.12.7+dfsg+wmf1 for jessie-wikimedia to carbon (also includes a fix for T137642)

Mentioned in SAL [2016-08-17T09:31:30Z] <moritzm> uploaded hhvm 3.12.7+dfsg+wmf1~trusty1 for trusty-wikimedia to carbon (also includes a fix for T137642)

That change is now live on mw1017 if anyone wants to test.

Change 293910 abandoned by PleaseStand:
IcuCollation: Remove null terminator from sort key if present

Reason:
HHVM 3.15.0 contains the fix, and currently, we only support 3.18.5 or later, so abandoning this change.

https://gerrit.wikimedia.org/r/293910