Page MenuHomePhabricator

Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table)
Closed, ResolvedPublic

Description

Right now, whenever we want to upgrade the libicu version we link to in production, this implies a long process, that goes as follows:

  • Warn all communities that they will see some form of sorting weirdness in categories for some time
  • Change the version of ICU we link to everywhere (this is very tricky without changing linux distribution versions, as we need to rebuild quite a lot of packages)
  • Run updateCollation.php on all wikis with a collation defined. Last time (see T189295) it took a week to run on enwiki.

There are two problems with the current approach:

  • It's a lot of toil for the SRE teams, which need to do specialized rebuilding of a lot of packages, and to run and monitor scripts on 100s of wikis
  • More importantly, it's a disservice to users who will see badly-sorted categories for a week

We need a smarter way to do this.

A couple ideas I had:

  1. One simple approach that would reduce the disservice to users would be to just add additional colums to the categorylinks table, and precompute the new values before we perform the switch, and just switch which colums we read when we start using
  2. To improve on the preceding idea, we could spawn a job, whenever we have to recalculate the collation for categorylinks, that will asynchronously fill in the values in the additional columns by running a small executable we can prepare.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
Resolved toan
ResolvedLucas_Werkmeister_WMDE
ResolvedJoe
ResolvedJdforrester-WMF
ResolvedLadsgroup
InvalidNone
ResolvedReedy
OpenNone
Resolvedtstarling
ResolvedJdforrester-WMF
ResolvedPRODUCTION ERRORLegoktm
Resolvedtstarling
ResolvedJoe
Resolvedtstarling

Event Timeline

One simple approach that would reduce the disservice to users would be to just add additional colums to the categoryLinks table, and precompute the new values before we perform the switch, and just switch which colums we read when we start using

That approach sounds good to me, we should prefix it with the ICU major version (like currently we need to move from 57 to 63), then the next time we need to migrate to a new ICU release we can simply spin up a bullseye image, pre-generate the collation data for the new ICU and then bullseye mediawiki images can simply use the 67 collation data.

Numerous tasks for stuff like this, which kind of includes T37378: Support multiple collations at the same time (not exactly the same), but could help in similar situations, including transition between 1 collation and another on a wiki (again, the script takes a long time to run, so can be time of wrong collation being used), or even allowing user-preferences etc (though different use case, and probably a different implementation too).

One thing to note is the sortkey is used in an index, so both variants would have to be indexed too, if we wanted to swap between them in a dynamic way.

One simple approach that would reduce the disservice to users would be to just add additional colums to the categoryLinks table, and precompute the new values before we perform the switch, and just switch which colums we read when we start using

That approach sounds good to me, we should prefix it with the ICU major version (like currently we need to move from 57 to 63), then the next time we need to migrate to a new ICU release we can simply spin up a bullseye image, pre-generate the collation data for the new ICU and then bullseye mediawiki images can simply use the 67 collation data.

If we did that, then we'd be asking for schema changes every time, causing DBA work (as well some amount of schema drift), and then having to have a dynamic column prefix/similar in the MW code to know which to use...

And we'd still have the index problem too.

And obviously the column and index would then need dropping at a later date...

It's not a small table on many wikis! :)

So at least in one case (in the API from a quick search), we'd have to be able to switch the index being used too

-- A binary string obtained by applying a sortkey generation algorithm
-- (Collation::getSortKey()) to page_title, or cl_sortkey_prefix . "\n"
-- . page_title if cl_sortkey_prefix is nonempty.
cl_sortkey varbinary(230) NOT NULL default '',

-- A prefix for the raw sortkey manually specified by the user, either via
-- [[Category:Foo|prefix]] or {{defaultsort:prefix}}.  If nonempty, it's
-- concatenated with a line break followed by the page title before the sortkey
-- conversion algorithm is run.  We store this so that we can update
-- collations without reparsing all pages.
-- Note: If you change the length of this field, you also need to change
-- code in LinksUpdate.php. See T27254.
cl_sortkey_prefix varchar(255) binary NOT NULL default '',

and

-- We always sort within a given category, and within a given type.  FIXME:
-- Formerly this index didn't cover cl_type (since that didn't exist), so old
-- callers won't be using an index: fix this?
CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to,cl_type,cl_sortkey,cl_from);
Joe raised the priority of this task from Medium to High.Sep 28 2020, 6:58 AM

The way it worked in T37378 was to include cl_collation in the primary key. So the data in the table was duplicated, but with cl_sortkey depending on cl_collation. I think it would have worked, it's just that the ball was dropped during code review. The catch with this is that LinksUpdate will drop rows for collations it doesn't know about. The work of the script would be partially undone by edits between the start of the script execution and the PHP version switch. We would either have to put up with that, or have a special-case hack in LinksUpdate::doIncrementalUpdate(), or PHP would have to be linked against multiple versions of ICU so that MediaWiki can insert rows for all collation versions simultaneously.

A simpler idea, closer to what @Joe is suggesting is to duplicate the whole categorylinks table. You can't add or drop an index in O(1) time but you can rename a table.

We could use Shellbox RPC (plus a cache) to provide the sort key from a different version of PHP/ICU. That would make the T37378 approach more feasible. MediaWiki would write both collations on edit/refreshlinks, and there would also be a script running writing the same thing.

A simpler idea, closer to what @Joe is suggesting is to duplicate the whole categorylinks table. You can't add or drop an index in O(1) time but you can rename a table.

[...]

We could use Shellbox RPC (plus a cache) to provide the sort key from a different version of PHP/ICU.

Based on the conversation we had yesterday in the Platform Engineering team, these two ideas seem to provide the easiest and safest option. MediaWiki will only have to know about writing to two tables, and about using shellbox to get the alternative sort key.

Perhaps we can even write a little extension to take care of the extra work in LinksUpdate, we we don't drag this special case into core: the LinksUpdateAfterInsert core hook seems nearly sufficient, we would just have to add a LinksUpdateAfterDelete hook.

One concern that came up was the question whether the tables can really be swapped out "hot" while under load using a simple rename. Perhaps it would be best to go read only for a minute, disable the dual write logic, pause the job queue workes, and then do the swap.

daniel renamed this task from Allow easier ICU transitions in MediaWiki to Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table).Nov 30 2020, 12:28 PM

It might sound like a promotion but I think before getting this done (in any way, duplicating the table, dynamic schema change on the fly, etc.), categorylinks should be normalized first (T222224: RFC: Normalize MediaWiki link tables). This table is 200GB in commons (and pretty big in other large wiki) and duplicating it (and updating both in real time) is going to be pretty costly if they are not normalized. This can be just normalizing the title+ns or also other columns of categorylinks too. The current update is done now (Am I wrong?) and I hope the next one is a little bit far in the future.

After a bit more discussion with Tim and Daniel, here's how it will work.

First, some assumptions:

  1. We don't want to support multiple collations at once since it will at least duplicate the size of categorylinks tables, we don't want that.
  2. This process is only needed in WMF production, third-parties are OK with running updateCollation.php after update.

Changes needed to be made:

  1. We will implement everything in WikimediaMaintenance extension.
  2. We will have RemoteCollation class which will use Shellbox RPC feature to shell out and execute the collation remotely.
  3. WikimediaMaintenance will implement LinksUpdateComplete hook, so whenever category links are added or removed in categorylinks table, we will add/remove the same links in the future_categorylinks table, but we will collate using a RemoveCollation, thus we'd have the new PHP version collation.
  4. We will update the updateCollation.php script to be able to read from old collation table and write to the new collation table.

So, the proposed process:

  1. We duplicate the categorylinks table, so now we have a future_categorylinks table. It's empty.
  2. We enable our LinksUpdate hook. This is an mw-config deploy. Now all the changes to the category links are recorded in both old and new tables.
  3. We run the updateCollation.php script, reading from old table, writing to the new table. This is looooooong.
  4. Now we have 2 tables, categorylinks and future_categorylinks. These 2 tables hopefully contain the same links. Might have some drift between them, but it's probably not critical and we can reconcile it.
  5. Now we can upgrade the PHP version. Once done, we can swap future_categorylinks and categorylinks tables. E.g. rename categorylinks to future_categorylinks and future_categorylinks to categorylinks. This can be done in a single command.
  6. Script can be written to reconcile the drift between he two tables. For all pages which have differences in the two tables we can schedule a refreshLinks job, making sure the state ends up being totally consistent.
  7. We disable LinksUpdate hook and drop the second table.

Problems/questions with this approach:

  1. On step 5, during the upgrade and while the cluster is running mixed php versions we might have some sorting weirdness still, since core still uses local collation. How long do we usually run mixed versions? If this period is relatively brief, that's probably fine. If it's a prolonged period of time, we might consider using RemoteCollation in core as well, but that will have a potential performance price and definitely will have complexity price.
  2. I'm not sure about swapping tables trick. Have we ever done that?
  3. There's still a lot of manual steps, but the user-visible weirdness is minimized.

Problems/questions with this approach:

  1. With the current size of categorylinks (T222224: RFC: Normalize MediaWiki link tables). You basically can't duplicate the table in commons.

Swapping would be fine (I leave it to Manuel to fully confirm it though) but the problem is that since the table is not being read, just swapping it in a second will bring everything down since the system starts to read from disk (which is ten times slower) and we can't have both tables in memory because they are too big for innodb buffer pool. Specially given that innodb is per row.

One rather easy way to do it is through temp_ tables that hold only the collation value. That way you don't need to duplicate that massive table. We might be able to even normalize collation if there is a lot of duplication happening (I need to measure it)

You basically can't duplicate the table in commons.

Not sure I understand why not? We're starting 'fresh' with the new table. The script to populate it will be very long, but it will be long anyway. Are we gonna run out of disk?

One rather easy way to do it is through temp_ tables that hold only the collation value. That way you don't need to duplicate that massive table. We might be able to even normalize collation if there is a lot of duplication happening (I need to measure it)

I've been thinking about it, but if we were doing it for the sake of supporting multiple collations, we'd need to either add an ID to category link, or have cl_from, cl_to, cl_collation and cl_sortkey in the categorylinks_sort table. I don't think there's a lot of duplication, since the sortkey is basically a hash of category member title, so it's all the pages that belong to some category. There's way more pages then categories, so I'd think there's a lot of sort keys. So in the end it seems like normalized sort keys will not be too much smaller than original categorylinks. If we try to store multiple collations, the normalized sorting table will also become too big.

You basically can't duplicate the table in commons.

Not sure I understand why not? We're starting 'fresh' with the new table. The script to populate it will be very long, but it will be long anyway. Are we gonna run out of disk?

Yup, I don't know if all of s4 use the new spec (8TB) but even if one node uses the old spec (2TB), we would very likely run out of space, it was 200GB years ago, now it's even bigger. It's also not just about running out of disk, it's about getting dangerously close to it.

One rather easy way to do it is through temp_ tables that hold only the collation value. That way you don't need to duplicate that massive table. We might be able to even normalize collation if there is a lot of duplication happening (I need to measure it)

I've been thinking about it, but if we were doing it for the sake of supporting multiple collations, we'd need to either add an ID to category link, or have cl_from, cl_to, cl_collation and cl_sortkey in the categorylinks_sort table. I don't think there's a lot of duplication, since the sortkey is basically a hash of category member title, so it's all the pages that belong to some category. There's way more pages then categories, so I'd think there's a lot of sort keys. So in the end it seems like normalized sort keys will not be too much smaller than original categorylinks. If we try to store multiple collations, the normalized sorting table will also become too big.

Adding id of collation to categorylinks sounds good to me but now that I look at the table, this masterpiece of design needs a lot of love. cl_type and cl_collation must get normalized (or just replaced with numerical value) before even we can touch this table.

I'm not sure about cl_sortkey not being duplicated. We had a similar situation with comment. At least, in enwiki lots of sort keys are name of a country, I expect a lot of duplication there but tbh, that's way down the list of categorylinks issues.

Let me think a bit in more depth about this issue and how it entangles with the rest of categorylinks table issues. I will come back with some ideas.

Problems/questions with this approach:

  1. On step 5, during the upgrade and while the cluster is running mixed php versions we might have some sorting weirdness still, since core still uses local collation. How long do we usually run mixed versions? If this period is relatively brief, that's probably fine. If it's a prolonged period of time, we might consider using RemoteCollation in core as well, but that will have a potential performance price and definitely will have complexity price.

With the switch to kubernetes, the time should be in the order of minutes at most, while we do a deployment.

  1. I'm not sure about swapping tables trick. Have we ever done that?
  2. There's still a lot of manual steps, but the user-visible weirdness is minimized.

The manual steps can be mostly automated (like running updateCollation.php efficiently on all wikis). It's the user-visible weirdness that we need to avoid in the first place, so this solution seems ok to me.

It's the user-visible weirdness that we need to avoid in the first place, so this solution seems ok to me.

We don't have too big an issue onwiki while stuff is shuffling around. A couple questions, a response that "yes this was known and no it will be fine in a week and let us know if otherwise" is usually enough for most people.

Change 732450 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/core@master] WIP: add ability to do collation remotely using shellbox

https://gerrit.wikimedia.org/r/732450

Change 732997 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/libs/Shellbox@master] Add RPCClient interface and local fallback implementation

https://gerrit.wikimedia.org/r/732997

Change 733018 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/core@master] ShellboxClientFactory: add RPCClient getters

https://gerrit.wikimedia.org/r/733018

Change 732997 abandoned by Ppchelko:

[mediawiki/libs/Shellbox@master] Add RPCClient interface and local fallback implementation

Reason:

https://gerrit.wikimedia.org/r/732997

Change 733082 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/libs/Shellbox@master] Add RPCClient interface and local fallback implementation

https://gerrit.wikimedia.org/r/733082

Change 732997 restored by Ppchelko:

[mediawiki/libs/Shellbox@master] Add RPCClient interface and local fallback implementation

https://gerrit.wikimedia.org/r/732997

Change 733082 abandoned by Ppchelko:

[mediawiki/libs/Shellbox@master] Add RPCClient interface and local fallback implementation

Reason:

https://gerrit.wikimedia.org/r/733082

Change 732997 merged by jenkins-bot:

[mediawiki/libs/Shellbox@master] Add RPCClient interface and local fallback implementation

https://gerrit.wikimedia.org/r/732997

Change 733018 merged by jenkins-bot:

[mediawiki/core@master] ShellboxClientFactory: add RPCClient getters

https://gerrit.wikimedia.org/r/733018

Change 743278 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] [WIP] LinksUpdate refactor

https://gerrit.wikimedia.org/r/743278

Change 745752 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] RemoteIcuCollation

https://gerrit.wikimedia.org/r/745752

Change 745754 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Write to multiple categorylinks tables on update

https://gerrit.wikimedia.org/r/745754

Change 745752 merged by jenkins-bot:

[mediawiki/core@master] RemoteIcuCollation

https://gerrit.wikimedia.org/r/745752

Change 747712 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Add table copying feature to updateCollation.php

https://gerrit.wikimedia.org/r/747712

Change 732450 abandoned by Ppchelko:

[mediawiki/core@master] WIP: add ability to do collation remotely using shellbox

Reason:

https://gerrit.wikimedia.org/r/732450

Change 743278 merged by jenkins-bot:

[mediawiki/core@master] LinksUpdate refactor

https://gerrit.wikimedia.org/r/743278

Change 745754 merged by jenkins-bot:

[mediawiki/core@master] Write to multiple categorylinks tables on update

https://gerrit.wikimedia.org/r/745754

Change 747712 merged by jenkins-bot:

[mediawiki/core@master] maintenance: Add table copying feature to updateCollation.php

https://gerrit.wikimedia.org/r/747712

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).
Krinkle subscribed.

(Triaging on perf board as prio, given subtask of PHP 7.4 Objective from SRE; We don't have it under an OKR for perf yet.)

I think this was completed in January, except for deployment which is not covered by the task.