As part of the HHVM/trusty upgrade, we had to rebuild precise's libicu (4.8) for trusty and build HHVM against it, for the whole duration where we were mixing appservers/jobrunners from both the old and the new stack. Once that is completed, we should rebuild HHVM against trusty's libicu (5.2), simultaneously upgrade the whole fleet, then (I believe) execute the updateCollation.php maintenance script.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
mediawiki/extensions/Scribunto | master | +206 -2 K | Update ustring data tables | |
utfnormal | master | +1 K -120 | Update to Unicode 6.3.0 |
Event Timeline
Yup, updateCollation.php does need running when we've upgraded :)
I think we need to run it everywhere... Can you confirm @tstarling ?
Also, we will probably need to get @Springle involved for the larger wikis too
You may want to give wikis some warning before doing this, there may be a short window where categories will be messed up.
updateCollation.php --force (the --force is important) will only need to be run on wikis that have $wgCategoryCollation to something with 'uca' in its name (about 75 wikis). Wikis using the default collation don't need it run.
ICU 54 (5.4) has already been released, would be nice if we could go straight to that and skip some updateCollation.php the next time we upgrade. There were no existing Debian/Ubuntu packages for it yet last I checked, though :(
Since I have no time to work on the imagescalers/videoscalers and I don't think I'll be able to work on this for the forseeable future, I'll release the ticket in hope someone else will have time to tackle this.
Since we're using precise's version of icu, working on this is part of ops's quarterly goal to get rid of precise ;-)
I don't really understand how it's linked to icu 48, though:
The version of hhvm that we currently have on the appservers (3.6.5+dfsg1-1+wm7) is linked against icu 48:
jmm@mw1045:~$ ldd /usr/bin/hhvm | grep icu
libicui18n.so.48 => /usr/lib/x86_64-linux-gnu/libicui18n.so.48 (0x00007fa429fdb000) libicuuc.so.48 => /usr/lib/x86_64-linux-gnu/libicuuc.so.48 (0x00007fa429c6f000) libicudata.so.48 => /usr/lib/x86_64-linux-gnu/libicudata.so.48 (0x00007fa4218d4000)
But the source package only build-depends on libicu-dev, which when building on trusty would be satisfied by the 52.1 package from trusty. What am I missing here?
After every ICU upgrade, we need to run a long-running (takes a few days on largest wikis, IIRC) maintenance script on a couple dozen wikis (see task description), which is why we avoided upgrading ICU in the past and why we'd probably want to do it in the biggest steps possible (sorry if that's obvious and not at all the question you were asking).
Since we are at the point where there are no precise machines left running php, we should really build HHVM with libicu52 and make the conversion now.
The process will be:
- Removing libicu48 from the trusty repository
- Rebuilding HHVM (and possibly all extensions) linked to the newer libicu
- Installing it fleet-wide (during this phase, what problems can we expect?)
- Run the updateCollation script (specifics about this would be welcome)
I have no idea what the user impact could be and if it would be advisable to send out a notification of sorts, but this is really long overdue.
After ICU is upgraded, but before the updateCollation script finishes, articles newly added to categories may appear out-of-order on category listing pages. The headings on them might be wrong in funny ways, too. Nothing else should be affected.
Tagging with User-notice to have this included in the Tech News when it's about to happen. Note that this will only affect wikis using UCA category collations (notably, not English Wikipedia; biggest affected wikis are French and Polish Wikipedias, and also Russian Wikipedia but their categories are all messed up already due to T88088). Full list: https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php;dc2fbfc7adb455c3031601d61fd33835f8ede9cd$13601 (warning, large page).
- Run the updateCollation script (specifics about this would be welcome)
According to @jcrespo running this script on large wikis (and we have some wikis in the queue to switch to UCA collations :/ ) would currently put unacceptable load on the databases (I might be paraphrasing his words wrong, but the gist is that we can't do it). This is blocked on T130692 right now, which in turn in blocked on T128353.
As an aside, I have a patch which would make a change in the format of cl_collation (https://gerrit.wikimedia.org/r/#/c/272419/ ), which would require running updateCollation. It would be good if we could synchronize getting that patch merge (Assuming it does get merge) with the updateCollation.php needed for this bug.
I am now building a package linked to libicu52 for trusty, as the preparation work seems to be done.
I completed the first two steps, on monday I'll install the newer package on the inactive datacenter to get a first idea of how things are going; at the same time, I would like to upgrade beta and run the update collation there; it is reasonable to guess we'll be able to do the rolling upgrade in 2-3 days and then run the script.
@Bawolff as soon as I have a firm date I'll let you know, so that your patch can be merged when we run the updateCollation script.
Mentioned in SAL [2016-05-23T06:55:12Z] <_joe_> uploaded a new hhvm package for trusty linked to libicu52, T86096
Mentioned in SAL [2016-05-23T07:00:02Z] <_joe_> installed the new hhvm package on mw2017, T86096
Mentioned in SAL [2016-05-23T08:56:50Z] <_joe_> deployment-prep: starting upgrade of HHVM to a version linked to libicu52, T86096
Mentioned in SAL [2016-05-23T09:12:49Z] <_joe_> deployment-prep: all hhvm hosts in beta upgraded to run on the newer libicu; now running updateCollation.php (T86096)
So, today I upgraded beta and the test hosts in codfw (mw2017 and mw2099);
I would like to perform the rolling upgrade of production on Thursday May 26th at 7:00 UTC and start the updateCollations.php script as soon as I am done.
This of course depends on T58041 being resolved before of that.
If I'm reading the chart at http://site.icu-project.org/download correctly, it seems we're going from Unicode 6.0 in libicu48 to Unicode 6.3 in libicu52. That means we probably need to update utfnormal to match. @Legoktm, @brion, is that right?
I should also update Scribunto's versions of the Unicode data tables, even though it takes a bit of hoop jumping to use them (instead of callbacks into PHP) on Wikimedia wikis.
Related, I have switched the CI instances to HHVM just after @Joe switched deployment-prep.
@Anomie is this a blocker? I thought unicode standards didn't break anything between minor versions, and from what I can see both here:
https://en.wikibooks.org/wiki/Unicode/Versions
and here
http://unicode.org/versions/Unicode6.3.0/ (although most of it is beyond my knowledge on the matter)
seem to suggest no breaking change should've happened as far as the standard is concerned.
Also, looking at the ICU changelog I don't see any obvious breakage that should happen as far as collation is concerned, but I am going to test that anyways.
Change 290446 had a related patch set uploaded (by Anomie):
Update ustring data tables
To tell the truth, I have no idea if it should be a blocker or not. But I went ahead and figured out how to update utfnormal, so the patches are there now.
I'd guess it's probably not a blocker since we probably use the intl extension rather than those data tables on the cluster, and the same for the Scribunto bits. So it's probably more of a cleanliness thing.
Yes, the utfnormal code generally uses ICU (through the intl extension) when available. However, I notice that UtfNormal\Validator::cleanUp() makes a bad assumption: that ICU normalizes according to the same or older version of Unicode as the pure PHP implementation does. This might result in inconsistent normalization when some inputs contain invalid UTF-8 sequences yet do not contain any character having, according to Unicode 6.0, a non-zero canonical combining class or an NFC_QC value of No or Maybe.
Keeping the Unicode data up to date (to at least the newest released version used by any version of ICU that may be in use) would avoid this, and so would always normalizing using ICU, regardless of whether the PHP implementation thinks the string is in NFC (at least when the string contains an "unassigned" code point). Since this branch is only taken when the intl extension is loaded and MediaWiki now requires PHP 5.5, we probably could use $string = UConverter::transcode( $string, 'UTF-8', 'UTF-8' ); instead of the slow quickIsNFCVerify() code (or just ignore the return value if compatibility with older PHP versions needs to be retained).
More specifically, those inputs, in order for normalization to differ, would have to contain at least one newly added character that meets either of the following conditions (see code of quickIsNFCVerify()):
- The character has a non-zero canonical combining class. (47 added between 6.0 and 6.3)
- The character has an NFC_QC value of No or Maybe. (3 added between 6.0 and 6.3)
This combined set of 50 characters is, in PCRE regex format: [\x{8e4}-\x{8fe}\x{1bab}\x{1cf4}\x{a674}-\x{a67b}\x{a69f}\x{aaf6}\x{fa2e}\x{fa2f}\x{11100}-\x{11102}\x{11127}\x{11133}\x{11134}\x{111c0}\x{116b6}\x{116b7}]
@PleaseStand thanks for the detailed analisys.
So, although this inconsistency doesn't frankly seem like a blocker for upgrading libicu at the moment, it would be great if @Anomie's patches get merged before we perform the upgrade.
Or did I miss something?
We just need to run it on the wikis which use an ICU collation, which is all 95 wikis which have a non-default $wgCategoryCollation in InitialiseSettings.php. Thankfully that does not include enwiki or a few other large wikis.
I would suggest splitting the list up by section (i.e. master) and running one thread per section.
Run it with --force and no other options.
Thanks @tstarling, this is effectively my plan. I extracted the lists with the following (shameful) bash script, which I paste for future convenience:
#!/bin/bash set -e set -u workdir=$(mktemp -d) # Download InitializeSettings echo "Downloading files in the working directory ${workdir}" filename="${workdir}/settings.php" wget https://noc.wikimedia.org/conf/InitialiseSettings.php.txt -O "${filename}" -o /dev/null #download the dblists for shardfile in `seq -f s%g.dblist 1 7`; do wget "https://noc.wikimedia.org/conf/${shardfile}" -O "${workdir}/${shardfile}" -o /dev/null done start_line=$(grep -hn wgCategoryCollation ${filename} | sed 's/:.*//') end_line=$(( $start_line+200 )) wikis=$(awk "(NR>=${start_line}) && (\$3 ~ /uca/){print \$1} NR==${end_line}{exit}" "${filename}" | sed s/\'//g) echo "Writing output to ${PWD}/icu" mkdir -p ${PWD}/icu for wiki in $wikis; do shardfile=$(cd $workdir && grep -l ^$wiki\$ s?.dblist ); echo $wiki >> icu/${shardfile}; done rm -rf $workdir
and the result is that, as of this morning, we need to run updateCollation.php as follows:
- 7 wikis in s2
- 81 wikis in s3
- 2 wikis in s6
- 4 wikis in s7
so the list is pretty unbalanced towards s3, where we might be constrained to run serially anyways.
I will discuss options with the DBAs to see if we can somewhat parallelize the work on s3, and to assess the size of the work.
As usual, one shouldn't base his evaluations on "number of wikis". I extracted the total number of rows in the categorylinks tables for the various shards.
Here are the results:
s2 - 46.0 M rows
s3 - 31.6 M rows
s6 - 46.7 M rows
s7 - 18.8 M rows
so it's quite well balanced already if we run the script serially within one shard.
Given a conservative expected speed of 1.4 M rows/hour (the observed speed in my test was around 1.8 M rows/hour, see T58041) we would be done with the transition in ~ 33 hours.
@matmarex I have based my evaluation on the latest version of InitializeSettings.php, so I think my figures are pretty accurate.
Mentioned in SAL [2016-05-26T05:50:14Z] <_joe_> starting upgrades of hhvm to newer libicu in codfw (T86096)
Mentioned in SAL [2016-05-26T06:41:32Z] <_joe_> upgrading hhvm on the eqiad canaries, T86096
Mentioned in SAL [2016-05-26T07:29:00Z] <_joe_> upgrading hhvm on the eqiad imagescalers, T86096
Mentioned in SAL [2016-05-26T07:36:48Z] <_joe_> upgrading hhvm on eqiad jobrunners, tin + terbium (T86096)
Mentioned in SAL [2016-05-26T07:50:09Z] <_joe_> upgrading hhvm on eqiad's api cluster, (T86096)
Mentioned in SAL [2016-05-26T08:15:33Z] <_joe_> upgrading hhvm on eqiad's appserver cluster, (T86096)
Mentioned in SAL [2016-05-26T09:28:11Z] <_joe_> all traffic serving appservers are now running with libicu52 (T86096)
Upgrade is done and scripts are running. Sadly, while some are exceeding my conservative evaluation of performance, frwiki is running around 1.4 M rows/hour, so I would expect the scripts to be done by tomorrow afternoon at the earliest.
s7 has been completed at 21.50 - as expected, being the smallest sized shard.
It went on at a decent speed of ~ 1.6 M records/hour, so we're well in line with our projections.
Status update - only the following wikis are still being converted/waiting for conversion:
- frwiki
- svwiki
- thwiki
- ruwiki
Any anomaly on another wiki would be unexpected at this point.