Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Sep 21 2020, 11:09 AM

Description

Right now, whenever we want to upgrade the libicu version we link to in production, this implies a long process, that goes as follows:

Warn all communities that they will see some form of sorting weirdness in categories for some time
Change the version of ICU we link to everywhere (this is very tricky without changing linux distribution versions, as we need to rebuild quite a lot of packages)
Run updateCollation.php on all wikis with a collation defined. Last time (see T189295) it took a week to run on enwiki.

There are two problems with the current approach:

It's a lot of toil for the SRE teams, which need to do specialized rebuilding of a lot of packages, and to run and monitor scripts on 100s of wikis
More importantly, it's a disservice to users who will see badly-sorted categories for a week

We need a smarter way to do this.

A couple ideas I had:

One simple approach that would reduce the disservice to users would be to just add additional colums to the categorylinks table, and precompute the new values before we perform the switch, and just switch which colums we read when we start using
To improve on the preceding idea, we could spawn a job, whenever we have to recalculate the collation for categorylinks, that will asynchronously fill in the values in the additional columns by running a small executable we can prepare.

Details

Subject	Repo	Branch	Lines +/-
maintenance: Add table copying feature to updateCollation.php	mediawiki/core	master	+221 -85
Write to multiple categorylinks tables on update	mediawiki/core	master	+154 -35
LinksUpdate refactor	mediawiki/core	master	+1 K -822
WIP: add ability to do collation remotely using shellbox	mediawiki/core	master	+433 -119
RemoteIcuCollation	mediawiki/core	master	+263 -0
ShellboxClientFactory: add RPCClient getters	mediawiki/core	master	+54 -2
Add RPCClient interface and local fallback implementation	mediawiki/libs/Shellbox	master	+251 -21
Add RPCClient interface and local fallback implementation	mediawiki/libs/Shellbox	master	+254 -21

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		None	T248925 Make MediaWiki release tarball compatible with PHP 8.0
Resolved		Jdforrester-WMF	T300463 Make PHP 8.0 voting on MW master
Resolved		Jdforrester-WMF	T313563 Bump lcobucci/jwt & league/uri-components for php8
Resolved		Jdforrester-WMF	T313564 Bump onoi/message-reporter in vendor.git to 1.4.2 for php 8 support
Resolved		Jdforrester-WMF	T247658 Make Wikimedia Production MediaWiki compatible with PHP 7.4
Resolved		• toan	T243590 Fix WikibaseDataModel CI for php 7.4
Resolved		Lucas_Werkmeister_WMDE	T316923 Restore skipped test in ReferenceListTest.php
Resolved		Joe	T318918 Undeploy patch to use old PHP serialization in PHP 7.4
Resolved		Jdforrester-WMF	T264168 Drop PHP 7.2 support from Wikibase master branch, once Wikimedia production is on 7.4
Resolved		Ladsgroup	T270740 Drop hacky support for doctrine/dbal class renaming
Invalid		None	T303505 [S] Remove Deprecated methods "serialize" and "unserialize" after php production upgrade to PHP 7.4
Resolved		Reedy	T251043 Cleanup css-sanitizer when it only requires PHP >= 7.4
Open		None	T166010 The Great Namespaceization Effort
Resolved		tstarling	T277618 var_dump() on various objects writes gigabytes of data and takes minutes to run
Resolved		Jdforrester-WMF	T261872 Drop PHP 7.2 & 7.3 support from MediaWiki master branch, once Wikimedia production is on 7.4
Resolved	PRODUCTION ERROR	Legoktm	T293568 PHP Notice: Undefined offset in wikimedia/remex-html when rendering rest.php error page
Resolved		tstarling	T297667 mysqli/mysqlnd memory leak
Resolved		Joe	T271736 Migrate WMF production from PHP 7.2 to PHP 7.4
Resolved		tstarling	T263437 Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table)

Event Timeline

Joe created this task.Sep 21 2020, 11:09 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2020, 11:09 AM

Aklapper added a project: SRE.Sep 21 2020, 11:11 AM

One simple approach that would reduce the disservice to users would be to just add additional colums to the categoryLinks table, and precompute the new values before we perform the switch, and just switch which colums we read when we start using

That approach sounds good to me, we should prefix it with the ICU major version (like currently we need to move from 57 to 63), then the next time we need to migrate to a new ICU release we can simply spin up a bullseye image, pre-generate the collation data for the new ICU and then bullseye mediawiki images can simply use the 67 collation data.

Numerous tasks for stuff like this, which kind of includes T37378: Support multiple collations at the same time (not exactly the same), but could help in similar situations, including transition between 1 collation and another on a wiki (again, the script takes a long time to run, so can be time of wrong collation being used), or even allowing user-preferences etc (though different use case, and probably a different implementation too).

One thing to note is the sortkey is used in an index, so both variants would have to be indexed too, if we wanted to swap between them in a dynamic way.

In T263437#6479386, @MoritzMuehlenhoff wrote:

One simple approach that would reduce the disservice to users would be to just add additional colums to the categoryLinks table, and precompute the new values before we perform the switch, and just switch which colums we read when we start using

That approach sounds good to me, we should prefix it with the ICU major version (like currently we need to move from 57 to 63), then the next time we need to migrate to a new ICU release we can simply spin up a bullseye image, pre-generate the collation data for the new ICU and then bullseye mediawiki images can simply use the 67 collation data.

If we did that, then we'd be asking for schema changes every time, causing DBA work (as well some amount of schema drift), and then having to have a dynamic column prefix/similar in the MW code to know which to use...

And we'd still have the index problem too.

And obviously the column and index would then need dropping at a later date...

It's not a small table on many wikis! :)

So at least in one case (in the API from a quick search), we'd have to be able to switch the index being used too

-- A binary string obtained by applying a sortkey generation algorithm
-- (Collation::getSortKey()) to page_title, or cl_sortkey_prefix . "\n"
-- . page_title if cl_sortkey_prefix is nonempty.
cl_sortkey varbinary(230) NOT NULL default '',

-- A prefix for the raw sortkey manually specified by the user, either via
-- [[Category:Foo|prefix]] or {{defaultsort:prefix}}.  If nonempty, it's
-- concatenated with a line break followed by the page title before the sortkey
-- conversion algorithm is run.  We store this so that we can update
-- collations without reparsing all pages.
-- Note: If you change the length of this field, you also need to change
-- code in LinksUpdate.php. See T27254.
cl_sortkey_prefix varchar(255) binary NOT NULL default '',

and

-- We always sort within a given category, and within a given type.  FIXME:
-- Formerly this index didn't cover cl_type (since that didn't exist), so old
-- callers won't be using an index: fix this?
CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to,cl_type,cl_sortkey,cl_from);

• crusnov triaged this task as Medium priority.Sep 23 2020, 7:03 PM

Joe raised the priority of this task from Medium to High.Sep 28 2020, 6:58 AM

Joe added projects: MediaWiki-General, Platform Engineering.

daniel moved this task from Inbox to Tech Planning Review on the Platform Engineering board.Sep 29 2020, 8:10 PM

• AMooney assigned this task to tstarling.Oct 19 2020, 9:04 PM

The way it worked in T37378 was to include cl_collation in the primary key. So the data in the table was duplicated, but with cl_sortkey depending on cl_collation. I think it would have worked, it's just that the ball was dropped during code review. The catch with this is that LinksUpdate will drop rows for collations it doesn't know about. The work of the script would be partially undone by edits between the start of the script execution and the PHP version switch. We would either have to put up with that, or have a special-case hack in LinksUpdate::doIncrementalUpdate(), or PHP would have to be linked against multiple versions of ICU so that MediaWiki can insert rows for all collation versions simultaneously.

A simpler idea, closer to what @Joe is suggesting is to duplicate the whole categorylinks table. You can't add or drop an index in O(1) time but you can rename a table.

We could use Shellbox RPC (plus a cache) to provide the sort key from a different version of PHP/ICU. That would make the T37378 approach more feasible. MediaWiki would write both collations on edit/refreshlinks, and there would also be a script running writing the same thing.

In T263437#6562792, @tstarling wrote:

A simpler idea, closer to what @Joe is suggesting is to duplicate the whole categorylinks table. You can't add or drop an index in O(1) time but you can rename a table.

[...]

We could use Shellbox RPC (plus a cache) to provide the sort key from a different version of PHP/ICU.

Based on the conversation we had yesterday in the Platform Engineering team, these two ideas seem to provide the easiest and safest option. MediaWiki will only have to know about writing to two tables, and about using shellbox to get the alternative sort key.

Perhaps we can even write a little extension to take care of the extra work in LinksUpdate, we we don't drag this special case into core: the LinksUpdateAfterInsert core hook seems nearly sufficient, we would just have to add a LinksUpdateAfterDelete hook.

One concern that came up was the question whether the tables can really be swapped out "hot" while under load using a simple rename. Perhaps it would be best to go read only for a minute, disable the dual write logic, pause the job queue workes, and then do the swap.

Peachey88 subscribed.Nov 20 2020, 10:01 AM

Ladsgroup subscribed.Nov 30 2020, 12:18 PM

daniel renamed this task from Allow easier ICU transitions in MediaWiki to Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table).Nov 30 2020, 12:28 PM

It might sound like a promotion but I think before getting this done (in any way, duplicating the table, dynamic schema change on the fly, etc.), categorylinks should be normalized first (T222224: RFC: Normalize MediaWiki link tables). This table is 200GB in commons (and pretty big in other large wiki) and duplicating it (and updating both in real time) is going to be pretty costly if they are not normalized. This can be just normalizing the title+ns or also other columns of categorylinks too. The current update is done now (Am I wrong?) and I hope the next one is a little bit far in the future.

daniel edited projects, added Platform Engineering Roadmap Decision Making; removed Platform Engineering.Dec 1 2020, 4:23 PM

CCicalese_WMF moved this task from Untriaged to Code Jam on the Platform Engineering Roadmap Decision Making board.Feb 24 2021, 4:58 PM

daniel added a project: Platform Engineering Code Jam-2021.Jul 29 2021, 12:26 PM

daniel moved this task from Inbox to Not Ready, later on the Platform Engineering Code Jam-2021 board.Jul 29 2021, 12:35 PM

Krinkle added a project: Performance-Team (Radar).Oct 4 2021, 11:50 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

After a bit more discussion with Tim and Daniel, here's how it will work.

First, some assumptions:

We don't want to support multiple collations at once since it will at least duplicate the size of categorylinks tables, we don't want that.
This process is only needed in WMF production, third-parties are OK with running updateCollation.php after update.

Changes needed to be made:

We will implement everything in WikimediaMaintenance extension.
We will have RemoteCollation class which will use Shellbox RPC feature to shell out and execute the collation remotely.
WikimediaMaintenance will implement LinksUpdateComplete hook, so whenever category links are added or removed in categorylinks table, we will add/remove the same links in the future_categorylinks table, but we will collate using a RemoveCollation, thus we'd have the new PHP version collation.
We will update the updateCollation.php script to be able to read from old collation table and write to the new collation table.

So, the proposed process:

We duplicate the categorylinks table, so now we have a future_categorylinks table. It's empty.
We enable our LinksUpdate hook. This is an mw-config deploy. Now all the changes to the category links are recorded in both old and new tables.
We run the updateCollation.php script, reading from old table, writing to the new table. This is looooooong.
Now we have 2 tables, categorylinks and future_categorylinks. These 2 tables hopefully contain the same links. Might have some drift between them, but it's probably not critical and we can reconcile it.
Now we can upgrade the PHP version. Once done, we can swap future_categorylinks and categorylinks tables. E.g. rename categorylinks to future_categorylinks and future_categorylinks to categorylinks. This can be done in a single command.
Script can be written to reconcile the drift between he two tables. For all pages which have differences in the two tables we can schedule a refreshLinks job, making sure the state ends up being totally consistent.
We disable LinksUpdate hook and drop the second table.

Problems/questions with this approach:

On step 5, during the upgrade and while the cluster is running mixed php versions we might have some sorting weirdness still, since core still uses local collation. How long do we usually run mixed versions? If this period is relatively brief, that's probably fine. If it's a prolonged period of time, we might consider using RemoteCollation in core as well, but that will have a potential performance price and definitely will have complexity price.
I'm not sure about swapping tables trick. Have we ever done that?
There's still a lot of manual steps, but the user-visible weirdness is minimized.

Problems/questions with this approach:

With the current size of categorylinks (T222224: RFC: Normalize MediaWiki link tables). You basically can't duplicate the table in commons.

Swapping would be fine (I leave it to Manuel to fully confirm it though) but the problem is that since the table is not being read, just swapping it in a second will bring everything down since the system starts to read from disk (which is ten times slower) and we can't have both tables in memory because they are too big for innodb buffer pool. Specially given that innodb is per row.

One rather easy way to do it is through temp_ tables that hold only the collation value. That way you don't need to duplicate that massive table. We might be able to even normalize collation if there is a lot of duplication happening (I need to measure it)

You basically can't duplicate the table in commons.

Not sure I understand why not? We're starting 'fresh' with the new table. The script to populate it will be very long, but it will be long anyway. Are we gonna run out of disk?

One rather easy way to do it is through temp_ tables that hold only the collation value. That way you don't need to duplicate that massive table. We might be able to even normalize collation if there is a lot of duplication happening (I need to measure it)

I've been thinking about it, but if we were doing it for the sake of supporting multiple collations, we'd need to either add an ID to category link, or have cl_from, cl_to, cl_collation and cl_sortkey in the categorylinks_sort table. I don't think there's a lot of duplication, since the sortkey is basically a hash of category member title, so it's all the pages that belong to some category. There's way more pages then categories, so I'd think there's a lot of sort keys. So in the end it seems like normalized sort keys will not be too much smaller than original categorylinks. If we try to store multiple collations, the normalized sorting table will also become too big.

In T263437#7407319, @Pchelolo wrote:

You basically can't duplicate the table in commons.

Not sure I understand why not? We're starting 'fresh' with the new table. The script to populate it will be very long, but it will be long anyway. Are we gonna run out of disk?

Yup, I don't know if all of s4 use the new spec (8TB) but even if one node uses the old spec (2TB), we would very likely run out of space, it was 200GB years ago, now it's even bigger. It's also not just about running out of disk, it's about getting dangerously close to it.

One rather easy way to do it is through temp_ tables that hold only the collation value. That way you don't need to duplicate that massive table. We might be able to even normalize collation if there is a lot of duplication happening (I need to measure it)

I've been thinking about it, but if we were doing it for the sake of supporting multiple collations, we'd need to either add an ID to category link, or have cl_from, cl_to, cl_collation and cl_sortkey in the categorylinks_sort table. I don't think there's a lot of duplication, since the sortkey is basically a hash of category member title, so it's all the pages that belong to some category. There's way more pages then categories, so I'd think there's a lot of sort keys. So in the end it seems like normalized sort keys will not be too much smaller than original categorylinks. If we try to store multiple collations, the normalized sorting table will also become too big.

Adding id of collation to categorylinks sounds good to me but now that I look at the table, this masterpiece of design needs a lot of love. cl_type and cl_collation must get normalized (or just replaced with numerical value) before even we can touch this table.

I'm not sure about cl_sortkey not being duplicated. We had a similar situation with comment. At least, in enwiki lots of sort keys are name of a country, I expect a lot of duplication there but tbh, that's way down the list of categorylinks issues.

Let me think a bit in more depth about this issue and how it entangles with the rest of categorylinks table issues. I will come back with some ideas.

In T263437#7406472, @Pchelolo wrote:

Problems/questions with this approach:

On step 5, during the upgrade and while the cluster is running mixed php versions we might have some sorting weirdness still, since core still uses local collation. How long do we usually run mixed versions? If this period is relatively brief, that's probably fine. If it's a prolonged period of time, we might consider using RemoteCollation in core as well, but that will have a potential performance price and definitely will have complexity price.

With the switch to kubernetes, the time should be in the order of minutes at most, while we do a deployment.

I'm not sure about swapping tables trick. Have we ever done that?

There's still a lot of manual steps, but the user-visible weirdness is minimized.

The manual steps can be mostly automated (like running updateCollation.php efficiently on all wikis). It's the user-visible weirdness that we need to avoid in the first place, so this solution seems ok to me.

Krinkle updated the task description. (Show Details)Oct 11 2021, 6:32 PM

jijiki added a parent task: T271736: Migrate WMF production from PHP 7.2 to PHP 7.4.Oct 18 2021, 1:38 PM

jijiki awarded a token.

Izno subscribed.Oct 18 2021, 3:32 PM

It's the user-visible weirdness that we need to avoid in the first place, so this solution seems ok to me.

We don't have too big an issue onwiki while stuff is shuffling around. A couple questions, a response that "yes this was known and no it will be fine in a week and let us know if otherwise" is usually enough for most people.

MoritzMuehlenhoff mentioned this in T271736: Migrate WMF production from PHP 7.2 to PHP 7.4.Oct 19 2021, 5:37 PM

Change 732450 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/core@master] WIP: add ability to do collation remotely using shellbox

https://gerrit.wikimedia.org/r/732450

gerritbot added a project: Patch-For-Review.Oct 20 2021, 10:49 PM

Change 732997 had a related patch set uploaded (by Ppchelko; author: Ppchelko):