DBA review of purgeOldLogIPData.php
Closed, ResolvedPublic

Description

I'd like the DBAs to have a look at https://phabricator.wikimedia.org/diffusion/EABF/browse/master/maintenance/purgeOldLogIPData.php and check if it is safe to run on production. We're in the process of making IP data avalaible to CheckUsers, but data older than 90 days has to be removed per our data retention policies. Granted, AbuseFilter has been nonetheless storing that information indefinitelly since it got installed in the abuse_filter_log table (afl_ip field) so we need to clean those rows before enabling access and, later, have the script run periodically which can be done via a puppet cron job.

The first run Wikimedia wiki will involve tens of thousands of rows to update, so I'm guessing that a deployment window should be scheduled on Wikitech to do the first run. However, I'd like to check with you if special precautions should be taken prior and during the running of this script. Would DB overload, do you need to backup the data first, etc?

Thanks.

Huji triaged this task as High priority.Feb 11 2018, 12:11 AM

Considering that we have been storing IP data in abuse_filter_log for a very long time (i.e. WMF has violated its own retention policy), I am assigning a high priority here.

Please also see my comment at T186870#3960970 which, if calculated correctly, indicates we will be dealing with an update of hundreds of millions of rows. Yet another reason for this to be High priority.

Yeah. I guess I got short in my counter:

MariaDB [enwiki_p]> select count(*) from abuse_filter_log;
+----------+
| count(*) |
+----------+
| 20402866 |
+----------+
1 row in set (23.06 sec)

of those, are older than 90 days:

MariaDB [enwiki_p]> select count(*) from abuse_filter_log where afl_timestamp<='20171113000000';                                                                                                                  +----------+
| count(*) |
+----------+
| 19714424 |
+----------+
1 row in set (6.70 sec)

Data increases on enwiki some sort between 8 and 9 thousands of log entries per day.

Spanish WIkipedia data:

MariaDB [eswiki_p]> select count(*) from abuse_filter_log;
+----------+
| count(*) |
+----------+
|  8485964 |
+----------+
1 row in set (5.30 sec)

MariaDB [eswiki_p]> select count(*) from abuse_filter_log where afl_timestamp<='20171113000000';
+----------+
| count(*) |
+----------+
|  8103122 |
+----------+
1 row in set (9.83 sec)
jcrespo added a subscriber: jcrespo.EditedFeb 11 2018, 10:00 PM

Please set an ORDER BY, the LIMIT without an order by can lead to different results on masters and replicas- while you can argue that the same thing will be eventually deleted even if the query has a different order on different slaves, but there is always the risk of, in combination with other queries, to break replication. This idea of "undeterministic queries are in general banned from mediawiki" is a policy: https://www.mediawiki.org/wiki/Development_policy#Database_policy unless it can be justified for exceptional reasons.

The following is just advice, but it will help speed up the process- getting a sample query at the beginning and the end for the worse case scenario and running EXPLAIN on it will make sure it has the right performance; but you probably can do it faster-- tell me if you need help. We should make sure each write is fast (e.g. it only has to read $batch rows , not the whole table on each query. This is not a problem for us- if a transaction takes more than X amount of time, it is killed- but simply a heads up, which takes very little to be checked in advance. If the whole run is going to take more than 1 hour, please schedule it on Deployments ("week of" section) so it doesn't overlap with other ongoing maintenance. Our preference would be, for write to be run on Wednesdays, just after the weekly full backup is done, but that is a minor issue.

Last but not least- there has was las week some issues with wfWaitForSlaves(); lately (I think, I am not 100% sure), I please make sure the issues with that have been fully resolved, don't add them here, but Demon or Aaron may have more idea on that- if it has been fully fixed.

In addition to what Jaime said, I would suggest we run it (if possible) on s5 or s6 just for start, to try to isolate things in case there are issues before running it on big wikis.

@Marostegui You mean foreachwikiindblist s5 extensions/AbuseFilter/maintenance/purgeOldLogIPData.php, then s6 and so on?

@Marostegui You mean foreachwikiindblist s5 extensions/AbuseFilter/maintenance/purgeOldLogIPData.php, then s6 and so on?

Yeah, at least for the first run. If it goes fine on s5 and s6, we can probably go ahead and run it across all the wikis.
Maybe I am being to careful here, but with the amount of rows we have to delete, better be safe than sorry I guess :)

Reedy added a subscriber: Reedy.Feb 12 2018, 10:48 AM

Please set an ORDER BY, the LIMIT without an order by can lead to different results on masters and replicas- while you can argue that the same thing will be eventually deleted even if the query has a different order on different slaves, but there is always the risk of, in combination with other queries, to break replication. This idea of "undeterministic queries are in general banned from mediawiki" is a policy: https://www.mediawiki.org/wiki/Development_policy#Database_policy unless it can be justified for exceptional reasons.

The script doesn't actually do a delete, it updates the row and blanks the column.

So it's getting a batch (of 200 as defined initially in the script), and then setting afl_ip = '' on the rows with those ids that are pulled...

I'm presuming the ORDER BY is still wanted, so patch incoming for that

Change 409839 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/extensions/AbuseFilter@master] Add ORDER BY to purgeOldLogIPData select query

https://gerrit.wikimedia.org/r/409839

MarcoAurelio closed this task as Resolved.Feb 12 2018, 11:16 PM

Given the recent events I think this review is now done. Closing. If mistaken, please reopen.

Teles added a subscriber: Teles.Feb 15 2018, 7:02 PM

Change 409839 abandoned by Reedy:
Add ORDER BY to purgeOldLogIPData select query

Reason:
However, the index isn't there on all WMF wikis (especially not enwiki, which was the slooooowest). But htat's a different task that's been filed

https://gerrit.wikimedia.org/r/409839