Page MenuHomePhabricator

Issue with maintenance script: SELECTing revisions with high rev_id is painfully slow
Closed, ResolvedPublic5 Estimated Story Points

Description

For T163562 we need to run the populateIpChanges.php maintenance script (T175105). This typically goes reasonably fast:

mysql:wikiadmin@db1084 [commonswiki]> EXPLAIN SELECT /* PopulateIpChanges::doDBUpdates */ rev_id, rev_timestamp, rev_user_text FROM `revision` WHERE rev_id >= 1000000 AND rev_user = 0 ORDER BY rev_id ASC LIMIT 200;
+------+-------------+----------+-------+------------------------+---------+---------+------+------+-------------+
| id   | select_type | table    | type  | possible_keys          | key     | key_len | ref  | rows | Extra       |
+------+-------------+----------+-------+------------------------+---------+---------+------+------+-------------+
|    1 | SIMPLE      | revision | index | PRIMARY,user_timestamp | PRIMARY | 4       | NULL | 7336 | Using where |
+------+-------------+----------+-------+------------------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)


mysql:wikiadmin@db1084 [commonswiki]> SELECT /* PopulateIpChanges::doDBUpdates */ rev_id, rev_timestamp, rev_user_text FROM `revision` WHERE rev_id >= 1000000 AND rev_user = 0 ORDER BY rev_id ASC LIMIT 200;
...
200 rows in set (0.32 sec)

But runs very, very slowly when we reach revisions with a high rev_id:

mysql:wikiadmin@db1084 [commonswiki]> EXPLAIN SELECT /* PopulateIpChanges::doDBUpdates */ rev_id, rev_timestamp, rev_user_text FROM `revision` WHERE rev_id >= 75000000 AND rev_user = 0 ORDER BY rev_id ASC LIMIT 200;
+------+-------------+----------+-------+------------------------+---------+---------+------+------+-------------+
| id   | select_type | table    | type  | possible_keys          | key     | key_len | ref  | rows | Extra       |
+------+-------------+----------+-------+------------------------+---------+---------+------+------+-------------+
|    1 | SIMPLE      | revision | index | PRIMARY,user_timestamp | PRIMARY | 4       | NULL | 7335 | Using where |
+------+-------------+----------+-------+------------------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)


mysql:wikiadmin@db1084 [commonswiki]> SELECT /* PopulateIpChanges::doDBUpdates */ rev_id, rev_timestamp, rev_user_text FROM `revision` WHERE rev_id >= 75000000 AND rev_user = 0 ORDER BY rev_id ASC LIMIT 200;
...
200 rows in set (44.86 sec)

The EXPLAINs however are near identical.

See also https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=ab089fbcbcaaff33193f518ca8298ad4&host=family%3Adb1097&user=&schema=&hours=1 (not sure if this is a permalink)

The commonswiki script will probably finish within the next 12-18 hours, and the other group 1 wikis probably won't take forever. However as you could imagine this isn't going to work for enwiki. Not sure how to improve the query, or what a better approach would be.

Event Timeline

I'm not a developer, so my question can be stupid, but still: you need to collect the revisions for particular IP ranges. Why didn't you choose make the script check range by range instead revision by revision, using revision_userindex? The selection should be much faster like this, and also all the revisions made by registered users will not be checked, which will do the process even more faster.

@IKhitron revision_userindex is a made up view that it is cloud wikireplicas only- production has very similar content, but has different objects on the database (columns, indexes, etc.), so it needs a bit of a separate assessment.

@MusikAnimal I will do some profiling and will get back to you.

Didn't understand almost a word. But if what you said was that revision_userindex has not all the needed data, you can get at least the needed revision numbers from there, and get the data from revision indexing by those revision numbers.
Because if the problem is indeed high indexes, the indexes of ips in revision_userindex are very low, as I remember.

@MusikAnimal So this is a couple of things- MySQL here may be confusing, but it is not the one at fault (it is executing the queries in the fastest way possible, and the rev_id used has little effect on the performance (although indirectly, there could be a correlation on high vs low numbers due to edit patterns), but it is not the issue here.

First thing, is that rows in EXPLAIN is approximate (or a guesstimate, before the query gets executed) intended to be good enough for index selection, but not accurate for other measures. This estimation is exact when good indexes can be used, but as we will see later, it is not the case here. We can actually get the real number of rows read (post-execution) by doing:

root@db1084[commonswiki]> FLUSH STATUS\G       
Query OK, 0 rows affected (0.00 sec)

root@db1084[commonswiki]> SELECT /* PopulateIpChanges::doDBUpdates */ rev_id, rev_timestamp, rev_user_text FROM `revision` WHERE rev_id >= 1000000 AND rev_user = 0 ORDER BY rev_id ASC LIMIT 200;
200 rows in set (0.31 sec)

root@db1084[commonswiki]> nopager
PAGER set to stdout
root@db1084[commonswiki]> SHOW STATUS like 'Hand%';
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 0      |
| Handler_icp_attempts       | 0      |
| Handler_icp_match          | 0      |
| Handler_mrr_init           | 0      |
| Handler_mrr_key_refills    | 0      |
| Handler_mrr_rowid_refills  | 0      |
| Handler_prepare            | 0      |
| Handler_read_first         | 1      |
| Handler_read_key           | 0      |
| Handler_read_last          | 0      |
| Handler_read_next          | 794383 |
| Handler_read_prev          | 0      |
| Handler_read_retry         | 0      |
| Handler_read_rnd           | 0      |
| Handler_read_rnd_deleted   | 0      |
| Handler_read_rnd_next      | 0      |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_tmp_update         | 0      |
| Handler_tmp_write          | 0      |
| Handler_update             | 0      |
| Handler_write              | 0      |
+----------------------------+--------+
26 rows in set (0.18 sec)

(you may not have permissions to flush status, but you can run the last command twice and substrat the results (until we script it).

That means 800K rows are read (not a good query).

For the second query we have:

root@db1084[commonswiki]> FLUSH STATUS;
Query OK, 0 rows affected (0.00 sec)

root@db1084[commonswiki]> pager cat > /dev/null
PAGER set to 'cat > /dev/null'
root@db1084[commonswiki]> SELECT /* PopulateIpChanges::doDBUpdates */ rev_id, rev_timestamp, rev_user_text FROM `revision` WHERE rev_id >= 75000000 AND rev_user = 0 ORDER BY rev_id ASC LIMIT 200;
200 rows in set (25.38 sec)

root@db1084[commonswiki]> nopager
PAGER set to stdout
root@db1084[commonswiki]> SHOW STATUS like 'Hand%';
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| Handler_commit             | 1        |
| Handler_delete             | 0        |
| Handler_discover           | 0        |
| Handler_external_lock      | 0        |
| Handler_icp_attempts       | 0        |
| Handler_icp_match          | 0        |
| Handler_mrr_init           | 0        |
| Handler_mrr_key_refills    | 0        |
| Handler_mrr_rowid_refills  | 0        |
| Handler_prepare            | 0        |
| Handler_read_first         | 1        |
| Handler_read_key           | 0        |
| Handler_read_last          | 0        |
| Handler_read_next          | 67197515 |
| Handler_read_prev          | 0        |
| Handler_read_retry         | 0        |
| Handler_read_rnd           | 0        |
| Handler_read_rnd_deleted   | 0        |
| Handler_read_rnd_next      | 0        |
| Handler_rollback           | 0        |
| Handler_savepoint          | 0        |
| Handler_savepoint_rollback | 0        |
| Handler_tmp_update         | 0        |
| Handler_tmp_write          | 0        |
| Handler_update             | 0        |
| Handler_write              | 0        |
+----------------------------+----------+
26 rows in set (0.13 sec)

Over 67 million rows scanned!

Why does it happen? Your query uses the PRIMARY index, which means it can filter efficiently by rev_id, but not by user (it needs to do a full table scan to get the other condition (rev_user = 0). What is the problem with that? That your query is not deterministic in how many rows it is going to read to get the results. It could take only 200 rows, or it could take the full table (as anonymous edits are very common, broadly half of the table). Probably there is not going to be a better index (creating an index on rev_user, rev_id, while possible, it would not be very efficient, as it would only filter out 50% of the rows. The way to solve this is to have smaller, more reliable chunk to iterate, for example:

root@db1084[commonswiki]> SELECT /* PopulateIpChanges::doDBUpdates */ rev_id, rev_timestamp, rev_user_text FROM `revision` WHERE rev_id BETWEEN 1000000 AND 1500000 AND rev_user = 0 ORDER BY rev_id ASC LIMIT 200;
200 rows in set (0.01 sec)

root@db1084[commonswiki]> SELECT /* PopulateIpChanges::doDBUpdates */ rev_id, rev_timestamp, rev_user_text FROM `revision` WHERE rev_id BETWEEN 75000000 AND 75500000 AND rev_user = 0 ORDER BY rev_id ASC LIMIT 200;
200 rows in set (0.02 sec)

You may get less than 200 results, in which case you jump to the next chunk; if you get 200, then you use the last id returned as the new starting pont (+1), BETWEEN is inclusive. End when you reach max(rev_id). This will take more work but it will give you upper-bound queries.

There are other things that we could do by playing with the contribution slaves and its paritioning/slightly different index, but this would be my first recommendation, as it would work without requiring schema changes and on all slaves.

As a reminder, if you are planning to do maintenance scripts, remember to read from the "vslow" slaves so that main query traffic is not affected.

Change 378480 had a related patch set uploaded (by MusikAnimal; owner: MusikAnimal):
[mediawiki/core@master] Speed up populateIpChanges maintenance script.

https://gerrit.wikimedia.org/r/378480

Many thanks for your help, @jcrespo! That approach is actually what I started with, but I saw it wasn't doing a full 200 revisions at a time so I reworked it. Always go with your gut instinct, so they say =P

@IKhitron Thanks for your input :) To make sure it's clear, we're not querying for ranges with this maintenance script. Instead we're copying every single revision made by an IP on a given wiki to the new ip_changes table. This is on production, not the replicas, so there is no revision_userindex. Instead we have the core tables as defined in tables.sql. Also the term "index" here refers to database indexing, as opposed to rev_id which serves as the primary key for the revision table.

@MaxSem Could we kill the currently running script? It's now taking over a minute to copy each block of 200 edits, and we still have another 400,000 or so edits to go :/ I'm guessing running these super slow queries on DB_MASTER isn't so great.

I don't know how the deployment script you used works, but hopefully you can tell it to start deploying to wikis following Commons? Then for Commons, we can just run it manually and pass in the rev-id of wherever it left off.

I don't know how the deployment script you used works, but hopefully you can tell it to start deploying to wikis following Commons? Then for Commons, we can just run it manually and pass in the rev-id of wherever it left off.

Not really... But as you've subclassed LoggedUpdateMaintenance, it should just skip those wikis very quickly that it's finished on...

For example

reedy@tin:~$ mwscript populateIpChanges.php aawiki
Copying IP revisions to ip_changes, from rev_id 0 to rev_id 4447
...checking 200 revisions for IP edits that need copying, starting with rev_id 0
...checking 200 revisions for IP edits that need copying, starting with rev_id 4059
135 IP revisions copied.
reedy@tin:~$ mwscript populateIpChanges.php aawiki
...Update 'populate ip_changes' already logged as completed.

Not really... But as you've subclassed LoggedUpdateMaintenance, it should just skip those wikis very quickly that it's finished on...

Got it. Well maybe we should kill the current script, run it just on Commons passing in rev-id (and max-rev-id, probably), then run the group-wide deployment script thing. That way it will simply skip Commons.

Change 378480 merged by jenkins-bot:
[mediawiki/core@master] Speed up populateIpChanges maintenance script.

https://gerrit.wikimedia.org/r/378480

Change 378736 had a related patch set uploaded (by Reedy; owner: MusikAnimal):
[mediawiki/core@wmf/1.30.0-wmf.18] Speed up populateIpChanges maintenance script.

https://gerrit.wikimedia.org/r/378736

Change 378736 merged by jenkins-bot:
[mediawiki/core@wmf/1.30.0-wmf.18] Speed up populateIpChanges maintenance script.

https://gerrit.wikimedia.org/r/378736

Mentioned in SAL (#wikimedia-operations) [2017-09-18T16:40:04Z] <reedy@tin> Synchronized php-1.30.0-wmf.18/maintenance/populateIpChanges.php: perf improvements T175962 (duration: 00m 46s)

Change 378798 had a related patch set uploaded (by MusikAnimal; owner: MusikAnimal):
[mediawiki/core@master] Attempt to fix populateIpChanges maintenance script

https://gerrit.wikimedia.org/r/378798

Change 378798 merged by jenkins-bot:
[mediawiki/core@master] Attempt to fix populateIpChanges maintenance script

https://gerrit.wikimedia.org/r/378798

Change 379057 had a related patch set uploaded (by MaxSem; owner: MusikAnimal):
[mediawiki/core@wmf/1.30.0-wmf.19] Attempt to fix populateIpChanges maintenance script

https://gerrit.wikimedia.org/r/379057

Change 379057 merged by jenkins-bot:
[mediawiki/core@wmf/1.30.0-wmf.19] Attempt to fix populateIpChanges maintenance script

https://gerrit.wikimedia.org/r/379057

Change 379142 had a related patch set uploaded (by MusikAnimal; owner: MusikAnimal):
[mediawiki/core@master] Yet another attempt to fix the populateIpChanges script

https://gerrit.wikimedia.org/r/379142

Hi, @MusikAnimal, can I ask you something? You're trying again and again. Good luck, commons is a very large wiki indeed. But can you skip it meanwhile and run the rest wikis? Our wiki should run a couple of hours after commons, but it takes a lot of time waiting for commons. Thank you.

Hi, @MusikAnimal, can I ask you something? You're trying again and again. Good luck, commons is a very large wiki indeed. But can you skip it meanwhile and run the rest wikis? Our wiki should run a couple of hours after commons, but it takes a lot of time waiting for commons. Thank you.

It wasn't working for any wiki, except those in group 0. But not to worry, I think we've got it figured out this time! Thanks for your patience.

I meant running small wikis slowly could take a lot of time, but still to be finished. Great, hope you thoughts will become real. Thank you very much.

Change 379142 merged by jenkins-bot:
[mediawiki/core@master] Yet another attempt to fix the populateIpChanges script

https://gerrit.wikimedia.org/r/379142

Change 379271 had a related patch set uploaded (by MaxSem; owner: MusikAnimal):
[mediawiki/core@wmf/1.30.0-wmf.19] Yet another attempt to fix the populateIpChanges script

https://gerrit.wikimedia.org/r/379271

Change 379271 merged by jenkins-bot:
[mediawiki/core@wmf/1.30.0-wmf.19] Yet another attempt to fix the populateIpChanges script

https://gerrit.wikimedia.org/r/379271

Mentioned in SAL (#wikimedia-operations) [2017-09-20T19:02:15Z] <catrope@tin> Finished scap: Patches for T176302, T175962, T173533 (duration: 21m 29s)

Change 379346 had a related patch set uploaded (by Jforrester; owner: MusikAnimal):
[mediawiki/core@REL1_30] Attempt to fix populateIpChanges maintenance script

https://gerrit.wikimedia.org/r/379346

Hi, @MusikAnimal, FYI: today the first group 1 hewiki has at least a part of the data ready. I hope it has all, but I can't check this.

Hi, @MusikAnimal, FYI: today the first group 1 hewiki has at least a part of the data ready. I hope it has all, but I can't check this.

Yes, it should all be there. Let me know if anything looks wrong. Thanks

Thank you very much for your work.

Btw, are you going to add this table to labs?

kaldari claimed this task.
kaldari subscribed.

This seems to be resolved now. Script is running on group2.

kaldari set the point value for this task to 5.Sep 26 2017, 11:22 PM
kaldari moved this task from Ready to Q1 2018-19 on the Community-Tech-Sprint board.

Change 381387 had a related patch set uploaded (by Legoktm; owner: MusikAnimal):
[mediawiki/core@REL1_30] Yet another attempt to fix the populateIpChanges script

https://gerrit.wikimedia.org/r/381387

Change 379346 merged by jenkins-bot:
[mediawiki/core@REL1_30] Attempt to fix populateIpChanges maintenance script

https://gerrit.wikimedia.org/r/379346

Change 381387 merged by jenkins-bot:
[mediawiki/core@REL1_30] Yet another attempt to fix the populateIpChanges script

https://gerrit.wikimedia.org/r/381387

Btw, are you going to add this table to labs?

See T173891