Page MenuHomePhabricator

Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table)
Open, HighPublic

Description

Right now, whenever we want to upgrade the libicu version we link to in production, this implies a long process, that goes as follows:

  • Warn all communities that they will see some form of sorting weirdness in categories for some time
  • Change the version of ICU we link to everywhere (this is very tricky without changing linux distribution versions, as we need to rebuild quite a lot of packages)
  • Run updateCollation.php on all wikis with a collation defined. Last time (see T189295) it took a week to run on enwiki.

There are two problems with the current approach:

  • It's a lot of toil for the SRE teams, which need to do specialized rebuilding of a lot of packages, and to run and monitor scripts on 100s of wikis
  • More importantly, it's a disservice to users who will see badly-sorted categories for a week

We need a smarter way to do this.

A couple ideas I had:

  1. One simple approach that would reduce the disservice to users would be to just add additional colums to the categoryLinks table, and precompute the new values before we perform the switch, and just switch which colums we read when we start using
  2. To improve on the preceding idea, we could spawn a job, whenever we have to recalculate the collation for categoryLinks, that will asynchronously fill in the values in the additional columns by running a small executable we can prepare.

Event Timeline

One simple approach that would reduce the disservice to users would be to just add additional colums to the categoryLinks table, and precompute the new values before we perform the switch, and just switch which colums we read when we start using

That approach sounds good to me, we should prefix it with the ICU major version (like currently we need to move from 57 to 63), then the next time we need to migrate to a new ICU release we can simply spin up a bullseye image, pre-generate the collation data for the new ICU and then bullseye mediawiki images can simply use the 67 collation data.

Numerous tasks for stuff like this, which kind of includes T37378: Support multiple collations at the same time (not exactly the same), but could help in similar situations, including transition between 1 collation and another on a wiki (again, the script takes a long time to run, so can be time of wrong collation being used), or even allowing user-preferences etc (though different use case, and probably a different implementation too).

One thing to note is the sortkey is used in an index, so both variants would have to be indexed too, if we wanted to swap between them in a dynamic way.

One simple approach that would reduce the disservice to users would be to just add additional colums to the categoryLinks table, and precompute the new values before we perform the switch, and just switch which colums we read when we start using

That approach sounds good to me, we should prefix it with the ICU major version (like currently we need to move from 57 to 63), then the next time we need to migrate to a new ICU release we can simply spin up a bullseye image, pre-generate the collation data for the new ICU and then bullseye mediawiki images can simply use the 67 collation data.

If we did that, then we'd be asking for schema changes every time, causing DBA work (as well some amount of schema drift), and then having to have a dynamic column prefix/similar in the MW code to know which to use...

And we'd still have the index problem too.

And obviously the column and index would then need dropping at a later date...

It's not a small table on many wikis! :)

So at least in one case (in the API from a quick search), we'd have to be able to switch the index being used too

-- A binary string obtained by applying a sortkey generation algorithm
-- (Collation::getSortKey()) to page_title, or cl_sortkey_prefix . "\n"
-- . page_title if cl_sortkey_prefix is nonempty.
cl_sortkey varbinary(230) NOT NULL default '',

-- A prefix for the raw sortkey manually specified by the user, either via
-- [[Category:Foo|prefix]] or {{defaultsort:prefix}}.  If nonempty, it's
-- concatenated with a line break followed by the page title before the sortkey
-- conversion algorithm is run.  We store this so that we can update
-- collations without reparsing all pages.
-- Note: If you change the length of this field, you also need to change
-- code in LinksUpdate.php. See T27254.
cl_sortkey_prefix varchar(255) binary NOT NULL default '',

and

-- We always sort within a given category, and within a given type.  FIXME:
-- Formerly this index didn't cover cl_type (since that didn't exist), so old
-- callers won't be using an index: fix this?
CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to,cl_type,cl_sortkey,cl_from);
Joe raised the priority of this task from Medium to High.Sep 28 2020, 6:58 AM

The way it worked in T37378 was to include cl_collation in the primary key. So the data in the table was duplicated, but with cl_sortkey depending on cl_collation. I think it would have worked, it's just that the ball was dropped during code review. The catch with this is that LinksUpdate will drop rows for collations it doesn't know about. The work of the script would be partially undone by edits between the start of the script execution and the PHP version switch. We would either have to put up with that, or have a special-case hack in LinksUpdate::doIncrementalUpdate(), or PHP would have to be linked against multiple versions of ICU so that MediaWiki can insert rows for all collation versions simultaneously.

A simpler idea, closer to what @Joe is suggesting is to duplicate the whole categorylinks table. You can't add or drop an index in O(1) time but you can rename a table.

We could use Shellbox RPC (plus a cache) to provide the sort key from a different version of PHP/ICU. That would make the T37378 approach more feasible. MediaWiki would write both collations on edit/refreshlinks, and there would also be a script running writing the same thing.

A simpler idea, closer to what @Joe is suggesting is to duplicate the whole categorylinks table. You can't add or drop an index in O(1) time but you can rename a table.

[...]

We could use Shellbox RPC (plus a cache) to provide the sort key from a different version of PHP/ICU.

Based on the conversation we had yesterday in the Platform Engineering team, these two ideas seem to provide the easiest and safest option. MediaWiki will only have to know about writing to two tables, and about using shellbox to get the alternative sort key.

Perhaps we can even write a little extension to take care of the extra work in LinksUpdate, we we don't drag this special case into core: the LinksUpdateAfterInsert core hook seems nearly sufficient, we would just have to add a LinksUpdateAfterDelete hook.

One concern that came up was the question whether the tables can really be swapped out "hot" while under load using a simple rename. Perhaps it would be best to go read only for a minute, disable the dual write logic, pause the job queue workes, and then do the swap.

daniel renamed this task from Allow easier ICU transitions in MediaWiki to Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table).Nov 30 2020, 12:28 PM

It might sound like a promotion but I think before getting this done (in any way, duplicating the table, dynamic schema change on the fly, etc.), categorylinks should be normalized first (T222224: RFC: Normalize MediaWiki link tables). This table is 200GB in commons (and pretty big in other large wiki) and duplicating it (and updating both in real time) is going to be pretty costly if they are not normalized. This can be just normalizing the title+ns or also other columns of categorylinks too. The current update is done now (Am I wrong?) and I hope the next one is a little bit far in the future.