Page MenuHomePhabricator

updateCollation.php script prohibitively slow for very large wikis
Closed, ResolvedPublic

Description

updateCollation.php script is awfully slow. It took over a week for
fr.wp (T56680), it'll probably take months if we ever decide to run
it on en.wp. That kinda sucks.

I'm not sure what can be done, or if it's just a problem on WMF configuration,
or what, so I'm just filing this and asking for comments. Please resolve
as INVALID if we in fact can't do anything about this.

Possible causes:

  • The workaround from T47970 which makes it use an index that's not entirely perfect for the task (but likely good enough, no idea how much slower that makes the script).
  • Slave synchronisation, in which case maybe we can do something with ops involvement? Don't ask me.
  • There's just too much data being sent back-and-forth between PHP and the database, in which case we can't do anything (unless we implement collating entirely database-side, which I've been told is a bad idea).

I'm CC-ing competent people. Help?

Related Objects

StatusSubtypeAssignedTask
Resolvedhashar
Resolvedhashar
ResolvedJoe
ResolvedLegoktm
Declined demon
ResolvedJoe
ResolvedReedy
ResolvedNone
OpenNone
OpenFeatureNone
Resolvedtomasz
Resolvedkaldari
Resolvedkaldari
DeclinedNone
DeclinedNone
ResolvedNone
ResolvedJoe
ResolvedNone
ResolvedJoe
ResolvedNone
ResolvedJoe
Resolvedkaldari
Resolvedjcrespo
ResolvedVolans
ResolvedPRODUCTION ERRORaaron
InvalidNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 272416 merged by jenkins-bot:
Add new index to make updateCollation.php painless

https://gerrit.wikimedia.org/r/272416

matmarex added a subscriber: gerritbot.

The patch should be approved and merged first, then I would apply the index live to the databases (on a different task, which may take some time), then we would run the script.

Change 272416 merged by jenkins-bot:
Add new index to make updateCollation.php painless

https://gerrit.wikimedia.org/r/272416

Thanks Jaime, the patch is merged now. I filed T130692 about adding the new indexes.

Testwiki has been updated with the new index. We should run updateCollation there and see how fast it is.

Testwiki has been updated with the new index. We should run updateCollation there and see how fast it is.

It's probably not representative with the amount of categorylinks rows it has in comparison to other big wikis

Yeah, looks like testwiki only has 40K rows.

jcrespo claimed this task.

I am no longer a blocker for large-scale testing. Thanks to the new mariadb10 servers, this change is now easy (easy =/= fast, this may take days to apply to all wikis when that is required).

But you should be able to at least check its correctness. If I am around, please ping me for the maintenance so I can evaluate potential lag issues.

jcrespo removed jcrespo as the assignee of this task.

Sorry, resolved by mistake.

The following queries were found as slow (>~10 seconds to execute) on s3 master:

Hits 	Tmax 	Tavg 	Tsum 	Hosts 	Users 	Schemas
546	28	21	11,687	db1075	wikiadmin	ruwiktionary
SELECT /* UpdateCollation::execute */ cl_from, cl_to, cl_sortkey_prefix, cl_collation, cl_sortkey, cl_type+0 AS "cl_type_numeric", page_namespace, page_title FROM `categorylinks`, `page` WHERE (cl_collation > 'uca-ru' OR (cl_collation = 'uca-ru' AND cl_to > 'Nomina_feminina') OR (cl_collation = 'uca-ru' AND cl_to = 'Nomina_feminina' AND cl_type > 1) OR (cl_collation = 'uca-ru' AND cl_to = 'Nomina_feminina' AND cl_type = 1 AND cl_from > '180587')) AND (cl_from = page_id) ORDER BY cl_collation, cl_to, cl_type, cl_from LIMIT 100 /* 41bc78be50cf89cb95f818d642787e15 db1075 ruwiktionary 19s */
1	26	26	26	db1075	wikiadmin	ruwiktionary
SELECT /* UpdateCollation::execute */ cl_from, cl_to, cl_sortkey_prefix, cl_collation, cl_sortkey, cl_type+0 AS "cl_type_numeric", page_namespace, page_title FROM `categorylinks`, `page` WHERE (cl_from = page_id) ORDER BY cl_collation, cl_to, cl_type, cl_from LIMIT 100 /* 3f310ddb21c2f635a6fd8c551c76f3ee db1075 ruwiktionary 26s */

The new index is not being used:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: index
possible_keys: PRIMARY
          key: name_title
      key_len: 261
          ref: NULL
         rows: 1100761
        Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: categorylinks
         type: ref
possible_keys: cl_from,cl_timestamp,cl_sortkey,cl_collation_ext
          key: cl_from
      key_len: 4
          ref: ruwiktionary.page.page_id
         rows: 4
        Extra: Using where
2 rows in set (0.00 sec)

Doing this query, however, takes 0 seconds:

SELECT STRAIGHT_JOIN /* UpdateCollation::execute */ cl_from, cl_to, cl_sortkey_prefix, cl_collation, cl_sortkey, cl_type+0 AS "cl_type_numeric", page_namespace, page_title FROM `categorylinks` JOIN `page` WHERE (cl_collation > 'uca-ru' OR (cl_collation = 'uca-ru' AND cl_to > 'Nomina_feminina/ru') OR (cl_collation = 'uca-ru' AND cl_to = 'Nomina_feminina/ru' AND cl_type > 1) OR (cl_collation = 'uca-ru' AND cl_to = 'Nomina_feminina/ru' AND cl_type = 1 AND cl_from > '186127')) AND (cl_from = page_id) ORDER BY cl_collation, cl_to, cl_type, cl_from LIMIT 100;

MariaDB  db1075 ruwiktionary > EXPLAIN SELECT STRAIGHT_JOIN /* UpdateCollation::execute */ cl_from, cl_to, cl_sortkey_prefix, cl_collation, cl_sortkey, cl_type+0 AS "cl_type_numeric", page_namespace, page_title FROM `categorylinks` JOIN `page` WHERE (cl_collation > 'uca-ru' OR (cl_collation = 'uca-ru' AND cl_to > 'Nomina_feminina/ru') OR (cl_collation = 'uca-ru' AND cl_to = 'Nomina_feminina/ru' AND cl_type > 1) OR (cl_collation = 'uca-ru' AND cl_to = 'Nomina_feminina/ru' AND cl_type = 1 AND cl_from > '186127')) AND (cl_from = page_id) ORDER BY cl_collation, cl_to, cl_type, cl_from LIMIT 100\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: categorylinks
         type: range
possible_keys: cl_from,cl_timestamp,cl_sortkey,cl_collation_ext
          key: cl_collation_ext
      key_len: 291
          ref: NULL
         rows: 4992443
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: ruwiktionary.categorylinks.cl_from
         rows: 1
        Extra: 
2 rows in set (0.00 sec)

We are literally talking more than 2600x faster. Please make sure you throttle updates at that speed (I saw lag happening even at slow speeds), otherwise, I will have to abort the maintenance. Also, should not you "SELECT FOR UPDATE", to prevent the selected rows from being deleted before you edit them?

Is anyone looking into @jcrespo's reccomendations? We will need to run this script on all wikis sooner later than later as we're still running on an unmaintained (upstream) version of libicu on all of our appservers, and we're waiting to do that since a long time (see T86096).

@jcrespo to clarify, you are recomending we add straight join to the query always?

Please make sure you throttle updates at that speed 

The script calls wfWaitForSlaves() regularly. Are you saying it should do so more often? Or should we have additional throttling beyond waiting for slaves.

SELECT FOR UPDATE

is presumably not being used to avoid locking things. Very minor ref integrity issues are in this case considered acceptable since the data being updated all functionally depends on other data (so not canonical). If you advise me that there would be no performance/lock contention concern with select for update, we can add that.

In T58041#2266715, @Joe wrote:

Is anyone looking into @jcrespo's reccomendations? We will need to run this script on all wikis sooner later than later as we're still running on an unmaintained (upstream) version of libicu on all of our appservers, and we're waiting to do that since a long time (see T86096).

Presumably that is me and matmarex (with my volunteer hat. I assume Matmarex is also with his volunteer hat). Comm tech is also sort of interested in this feature, but i dont think they are taking responsibility for it. Personally the amount of time i have to devote to this is rather limitted right now (yadda yadda... why is there no mw core team..)

Yes, Kaldari is planning to work on this. He's away for a week, but I think he'll pick this up late next week.

The script calls wfWaitForSlaves() regularly. Are you saying it should do so more often? Or should we have additional throttling beyond waiting for slaves.

I am saying that, despite that, there was a 10-second lag at the start of the process. Why, I do not know, but if it is a one-time-thing I suppose it is ok, if the rest of the process creates no lag.

Very minor ref integrity issues are in this case considered acceptable

If the potential isses are known, and done on purpose, then it is ok.

you are recomending we add straight join to the query always

It is a workaround (index hints are evil for many reasons), but if that allows faster execution, so be it.

Change 288032 had a related patch set uploaded (by Brian Wolff):
Use STRAIGHT_JOIN on updateCollation.php per jcrespo

https://gerrit.wikimedia.org/r/288032

Change 288036 had a related patch set uploaded (by Brian Wolff):
use slave for row estimate in updateCollation.php

https://gerrit.wikimedia.org/r/288036

I am saying that, despite that, there was a 10-second lag at the start of the process. Why, I do not know, but if it is a one-time-thing I suppose it is ok, if the rest of the process creates no lag.

I took another look at the script, it seems to do a count(*) right at the beginning to generate a progress bar. I'm going to guess that's the culprit.

Very minor ref integrity issues are in this case considered acceptable

If the potential isses are known, and done on purpose, then it is ok.

OTOH, its entirely possibly the code was originally written that way in error. All things being equal, it would certainly be better to use SELECT FOR UPDATE if that won't cause any problems. Its just not horrible if minor ref integ issues happen.

It is a workaround (index hints are evil for many reasons), but if that allows faster execution, so be it.

On my local test wiki I would have to actually use FORCE_INDEX to make it use the index in all cases (Sometimes it seems to decide that its cheaper to do a full table scan and filesort the results). I'm assuming that's due to my local wiki having a) a tiny categorylinks table, b) having a really old version of mysql. I assume that doing something like FORCE_INDEX is not needed here.

Anyhow, see the above two patches.

Change 288032 merged by jenkins-bot:
Use STRAIGHT_JOIN on updateCollation.php per jcrespo

https://gerrit.wikimedia.org/r/288032

Change 288036 merged by jenkins-bot:
use slave for row estimate in updateCollation.php

https://gerrit.wikimedia.org/r/288036

Language-related folks: This was brought up in today's Scrum of Scrums by Operations. Help if you can, please (they were unsure of who to ask specifically).

Language-related folks: This was brought up in today's Scrum of Scrums by Operations. Help if you can, please (they were unsure of who to ask specifically).

What remains to be done? Since the patches were merged this morning, at this stage the next step would be to test again which falls on operations shoulders.

@greg To me this looks like database usage optimization related stuff which isn't our area of expertise.

Language-related folks: This was brought up in today's Scrum of Scrums by Operations. Help if you can, please (they were unsure of who to ask specifically).

What remains to be done? Since the patches were merged this morning, at this stage the next step would be to test again which falls on operations shoulders.

I am happy to help, since I'm the person who is working on what is blocked by this; how do you suggest we should test the script? I am pretty sure it's safe to run on any wiki without causing disruption if I have to stop it abruplty; is that correct?

@Joe Yeah. I suggest trying ruwikisource, ruwiktionary and ruwiki per T129411, which need this script to be ran anyway.

(The recently merged patches would have to be deployed first if you want to do this soon.)

In T58041#2288418, @Joe wrote:

Language-related folks: This was brought up in today's Scrum of Scrums by Operations. Help if you can, please (they were unsure of who to ask specifically).

What remains to be done? Since the patches were merged this morning, at this stage the next step would be to test again which falls on operations shoulders.

I am happy to help, since I'm the person who is working on what is blocked by this; how do you suggest we should test the script? I am pretty sure it's safe to run on any wiki without causing disruption if I have to stop it abruplty; is that correct?

Yes that's correct, script can be killed at any point and it will not cause problems.

Basically, we just need to verify the current version uses the correct cl_collation_ext index (Note that current version = just committed yesterday, so not yet in the currently deployed wmf branch). jcrespo was previously using the script on ruwiktionary when he encountered issues, so that would be the ideal project to test on.

Just run updateCollation.php --force on ruwiktionary to verify that:

  • Slave lag does not spike
  • Things don't seem super slow (e.g. It shouldn't be more then a second between the script outputting Selecting next 100 rows... and processing...)

When I was reading the script earlier, I misread the code and thought it was waiting for slaves every 20 UPDATEs. Its actually doing it every 2000 UPDATEs (r97146). So if its causing slave lag we might need to increase the frequency of the wait for slaves call.

jcrespo was previously using the script on ruwiktionary

I've never run this script anywhere, you probably are confusing me with someone else?

Change 288364 had a related patch set uploaded (by Brian Wolff):
Make updateCollation wait for slaves every 500 (instead of 2000)

https://gerrit.wikimedia.org/r/288364

jcrespo was previously using the script on ruwiktionary

I've never run this script anywhere, you probably are confusing me with someone else?

My apologies. I misinterpreted T58041#2262583 as saying you were running it, but T129411 clarifies that it was rkaldari who was running it, and you were monitoring.

Change 288364 had a related patch set uploaded (by Brian Wolff):
Make updateCollation wait for slaves every 500 (instead of 2000)

https://gerrit.wikimedia.org/r/288364

So sorry, looks like I made a real mess of this bug...

Anyways @jcrespo indicated last time around, that the process was causing some slave lag. I originally thought that didn't make sense because when I read the script, i thought it was wfWaitForSlave() every 20 writes, which seemed extremely conservative. However, I fail at reading, and It was actually doing every 2000, which actually seems kind of high. Hence yet another patch to the script, in order to bring that down to every 500 writes. (From what I understand, slave lag can also be caused by filesorting large tables, which was happening due to mysql not using the index, hence maybe it wasn't related to the number of updates. I don't really know enough about mysql replication to say, but that patch definitely shouldn't hurt).

So one more patch before being ready to test again.

@Bawolff thanks for the work on this; when you're done with the last patch I'd say we should cherry-pick changes to the current wmf branch and go on with SWATTING them: the script in its current form is broken and it needs fixing.

Reads on the master cannot cause lag. Large updates, where large means "slow", even if only a few records are updated, can; also those that generate too many writes at once. I am not really concerned at all by processes that take days to apply- I do those every day, but they must minimize disruption to user's writes. Lag is impossible to avoid- it exists by definition of having asynchronous slaves- but we should check for it every few steps (1000 updates or 1 second is the typical answers I give when asked how frequently it should be checked).

I think the index created is a good step towards also making things faster, and that is why I am giving all support I can with the schema changes and recommendations, but I an not very involved with mediawiki code itself. Fut feel free to ask and send reviews. I also offered to be present during the run of the script, but those having an NDA should be familiar with tools such as https://tendril.wikimedia.org/chart?hosts=db1047&vars=seconds_behind_master&mode=value and https://tendril.wikimedia.org/report/slow_queries?host=^db1057&user=wikiuser&schema=wik&qmode=eq&query=&hours=1 . I know this tool is awkward, but I am working on making (some of it) it available on graphite soon.

Change 288364 merged by jenkins-bot:
Make updateCollation wait for slaves every 500 (instead of 2000)

https://gerrit.wikimedia.org/r/288364

Change 288384 had a related patch set uploaded (by Brian Wolff):
Make updateCollation wait for slaves every 500 (instead of 2000)

https://gerrit.wikimedia.org/r/288384

Change 288385 had a related patch set uploaded (by Brian Wolff):
Use STRAIGHT_JOIN on updateCollation.php per jcrespo

https://gerrit.wikimedia.org/r/288385

Change 288386 had a related patch set uploaded (by Brian Wolff):
use slave for row estimate in updateCollation.php

https://gerrit.wikimedia.org/r/288386

Change 288385 merged by jenkins-bot:
Use STRAIGHT_JOIN on updateCollation.php per jcrespo

https://gerrit.wikimedia.org/r/288385

Change 288386 merged by jenkins-bot:
use slave for row estimate in updateCollation.php

https://gerrit.wikimedia.org/r/288386

Change 288384 merged by jenkins-bot:
Make updateCollation wait for slaves every 500 (instead of 2000)

https://gerrit.wikimedia.org/r/288384

Change 288411 had a related patch set uploaded (by Brian Wolff):
Use STRAIGHT_JOIN on updateCollation.php per jcrespo

https://gerrit.wikimedia.org/r/288411

Change 288412 had a related patch set uploaded (by Brian Wolff):
Make updateCollation wait for slaves every 500 (instead of 2000)

https://gerrit.wikimedia.org/r/288412

Change 288413 had a related patch set uploaded (by Brian Wolff):
use slave for row estimate in updateCollation.php

https://gerrit.wikimedia.org/r/288413

Change 288411 merged by jenkins-bot:
Use STRAIGHT_JOIN on updateCollation.php per jcrespo

https://gerrit.wikimedia.org/r/288411

Change 288412 merged by jenkins-bot:
Make updateCollation wait for slaves every 500 (instead of 2000)

https://gerrit.wikimedia.org/r/288412

Change 288413 merged by jenkins-bot:
use slave for row estimate in updateCollation.php

https://gerrit.wikimedia.org/r/288413

In T58041#2288620, @Joe wrote:

@Bawolff thanks for the work on this; when you're done with the last patch I'd say we should cherry-pick changes to the current wmf branch and go on with SWATTING them: the script in its current form is broken and it needs fixing.

@Joe Ok, so all patches are merged and swatted. Next step is to test the script again and see if it works quickly this time.

Just got back from vacation. I see you guys have been busy! I can go ahead and try running this on ruwiktionary and see if we have better luck this time...

Running on ruwiktionary now. It's flying through it!

@kaldari nice! So we can finally run it for every wiki once I upgrade the libicu version HHVM is linked to, I guess!

Remember to add a new request of reopen T130692 for pending reindexing work.

Nemo_bis subscribed.

What is left to do in MediaWiki core? T130692 is a Wikimedia servers task, can't block a MediaWiki task.

In core, we now need to verify that the script is no longer prohibitively slow for large wikis. The easiest way to do that is to run it on a few large Wikimedia wikis, which is blocked by that task.

@jcrespo which wiki should we test the script on? I see from T130692 that the schema change has been applied to quite a few shards already.

I spoke with @jcrespo on IRC and we agreed on testing how long a forced run of the script will take on some databases in s2; I will start testing on ptwiki.

Mentioned in SAL [2016-05-23T10:32:29Z] <_joe_> running updateCollations.php --force on ptwiki, T58041

The script ran on ptwiki (around 9 M rows) with --force in less than 5 hours, and with no database impact.

In T58041#2318772, @Joe wrote:

The script ran on ptwiki (around 9 M rows) with --force in less than 5 hours, and with no database impact.

Not bad. Assuming it scales linearly, enwiki with 105 M rows would take around two days, and frwiki with 33 M rows around 15 hours (and per T56680, it took over a week back in the day). I think that's no longer prohibitively slow, and we could close this bug?

kaldari claimed this task.
kaldari moved this task from Ready to Q1 2018-19 on the Community-Tech-Sprint board.