AbuseFilter should actively prune old IP data
Closed, ResolvedPublic

Description

In CheckUser we have a config variable called $wgCUDMaxAge which is both used in the maintenance script maintenance/purgeOldData.php to remove old IP data in batch (once they are beyond the maximum age based on data retention policy). It is also used in CheckUser.hooks.php inside the maybePruneIPData() method, where in 1 out of every 50 queries it is used to also prune old CU data in an active fashion.

A similar config variable already exists for AbuseFilter, called $wgAbuseFilterLogIPMaxAge and it is used maintenance/purgeOldLogIPData.php. However, a mechanism similar to maybePruneIPData() does not exist for AbuseFilter.

There has not been a huge demand for it because AbuseFilter's IP data is not available to any user group on WMF wikis as of now. But once T152934 and T160357 are resolved, it will become more important to have an active purging mechanism.

At the same time, while CheckUser is rarely executed, AbuseFilter is very frequently executed; on one hand, it means that active pruning could take care of almost all of the purging for AbuseFilter IP data, but on the other hand, it could mean that the impact of it on processing time can be more pronounced.

Huji created this task.Oct 27 2017, 2:09 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 27 2017, 2:09 AM

Okay some sort of update after the comments here and there:

  • maintenance/purgeOldData.php has been fixed, it apparently works.
  • cron on Wikimedia Puppet running daily was already there, now that the script is fixed it should work again (purge_abusefilter.pp)
  • we've started logging the runs of this script from now on to avoid not noticing when the script breaks again, cf. https://gerrit.wikimedia.org/r/#/c/410072/

@Dzahn Hi. Can we check the logs and see how the run of the script for today was? Thanks!

This seems to be ongoing, but probably will take a long time to finish for its first execution. This is actually desirable- we want this to happen, but not as fast that affects other writes.

Glad to know it's happening. I expected that this would take some time. Two years of old data to purge on 700+ wikis needs its time :)

The selects seem to have bad performance >30 seconds to execute each batch of 200- this probably will need optimization, but I will wait to see if it is only due to the large amount of rows pending or something else.

Hits 	Tmax 	Tavg 	Tsum 	Hosts 	Users 	Schemas
16	42	38	608	db1052	wikiadmin	enwiki
SELECT /* PurgeOldLogIPData::execute www-data@terbiu... */ afl_id FROM `abuse_filter_log` WHERE (afl_ip <> "") AND (afl_timestamp < '20171115041336') LIMIT 200 /* 3c556baa448ba4b4d671c3107e574c95 db1052 enwiki 33s */

We should also defragment those tables after purge to recover filesystem space.

Bawolff added a subscriber: Bawolff.

It was fast on Beta. Probably due to the smaller db and data to clean...

Dzahn has reduced the frequency of the script to run weekly instead of daily for now, until the script is optimized; otherwise the script won't even finish. Once the script is optimized, we can consider enabling it again to run daily.

The selects seem to have bad performance

Should I open a wew task for this?

Reedy added a subscriber: Reedy.EditedFeb 14 2018, 8:45 AM

It seems to be getting through enwiki at a rate of 1M per every 12-24H (wasn't paying exact attention). It's upto early November 2016 atm

I've cleared the backlog on the rest of the wikis; running through a reversed all.dblist as far as enwiki. They're all done

CREATE TABLE /*$wgDBprefix*/abuse_filter_log (
	afl_id BIGINT unsigned NOT NULL AUTO_INCREMENT,
	afl_filter varchar(64) binary NOT NULL,
	afl_user BIGINT unsigned NOT NULL,
	afl_user_text varchar(255) binary NOT NULL,
	afl_ip varchar(255) not null,
	afl_action varbinary(255) not null,
	afl_actions varbinary(255) not null,
	afl_var_dump BLOB NOT NULL,
	afl_timestamp binary(14) NOT NULL,
	afl_namespace int NOT NULL,
	afl_title varchar(255) binary NOT NULL,
	afl_wiki varchar(64) binary NULL,
	afl_deleted tinyint(1) NOT NULL DEFAULT 0,
	afl_patrolled_by int unsigned NULL,
	afl_rev_id int unsigned,
	afl_log_id int unsigned,

	PRIMARY KEY (afl_id),
	KEY filter_timestamp (afl_filter,afl_timestamp),
	KEY user_timestamp (afl_user,afl_user_text,afl_timestamp),
	KEY (afl_timestamp),
	KEY page_timestamp (afl_namespace, afl_title, afl_timestamp),
	KEY ip_timestamp (afl_ip, afl_timestamp),
	KEY (afl_rev_id),
	KEY (afl_log_id),
	KEY wiki_timestamp (afl_wiki, afl_timestamp)
) /*$wgDBTableOptions*/;

So, we have an index explicitly matched for this... KEY ip_timestamp (afl_ip, afl_timestamp), but at least on one tested host, it's picking afl_ip... Which doesn't exit as an index above.

mysql:wikiadmin@db1066 [enwiki]> EXPLAIN SELECT afl_id FROM abuse_filter_log WHERE afl_ip <> "" AND afl_timestamp < '20171116083956' LIMIT 200;
+------+-------------+------------------+-------+----------------------+--------+---------+------+---------+--------------------------------------------------------+
| id   | select_type | table            | type  | possible_keys        | key    | key_len | ref  | rows    | Extra                                                  |
+------+-------------+------------------+-------+----------------------+--------+---------+------+---------+--------------------------------------------------------+
|    1 | SIMPLE      | abuse_filter_log | range | afl_timestamp,afl_ip | afl_ip | 257     | NULL | 8189781 | Using index condition; Using where; Rowid-ordered scan |
+------+-------------+------------------+-------+----------------------+--------+---------+------+---------+--------------------------------------------------------+
1 row in set (0.12 sec)

So if we have a look what indexes *do* exist on enwiki at least...

mysql:wikiadmin@db1066 [enwiki]> show indexes from abuse_filter_log;
+------------------+------------+----------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table            | Non_unique | Key_name       | Seq_in_index | Column_name   | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------------+------------+----------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| abuse_filter_log |          0 | PRIMARY        |            1 | afl_id        | A         |    19022925 |     NULL | NULL   |      | BTREE      |         |               |
| abuse_filter_log |          1 | afl_timestamp  |            1 | afl_timestamp | A         |    19022925 |     NULL | NULL   |      | BTREE      |         |               |
| abuse_filter_log |          1 | afl_namespace  |            1 | afl_namespace | A         |        4264 |     NULL | NULL   |      | BTREE      |         |               |
| abuse_filter_log |          1 | afl_namespace  |            2 | afl_title     | A         |     6340975 |     NULL | NULL   |      | BTREE      |         |               |
| abuse_filter_log |          1 | afl_ip         |            1 | afl_ip        | A         |     2377865 |     NULL | NULL   |      | BTREE      |         |               |
| abuse_filter_log |          1 | afl_rev_id     |            1 | afl_rev_id    | A         |     9511462 |     NULL | NULL   | YES  | BTREE      |         |               |
| abuse_filter_log |          1 | afl_log_id     |            1 | afl_log_id    | A         |           2 |     NULL | NULL   | YES  | BTREE      |         |               |
| abuse_filter_log |          1 | user_timestamp |            1 | afl_user      | A         |     6340975 |     NULL | NULL   |      | BTREE      |         |               |
| abuse_filter_log |          1 | user_timestamp |            2 | afl_user_text | A         |    19022925 |     NULL | NULL   |      | BTREE      |         |               |
| abuse_filter_log |          1 | user_timestamp |            3 | afl_timestamp | A         |    19022925 |     NULL | NULL   |      | BTREE      |         |               |
| abuse_filter_log |          1 | afl_filter     |            1 | afl_filter    | A         |       56615 |     NULL | NULL   |      | BTREE      |         |               |
| abuse_filter_log |          1 | afl_filter     |            2 | afl_timestamp | A         |    19022925 |     NULL | NULL   |      | BTREE      |         |               |
+------------------+------------+----------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
12 rows in set (0.00 sec)

Looks like https://github.com/wikimedia/mediawiki-extensions-AbuseFilter/blob/master/db_patches/patch-fix-indexes.sql hasn't been applied...

filter_timestamp, user_timestamp, page_timestamp and ip_timestamp should be added. afl_filter, afl_user, afl_namespace and afl_ip should've been dropped.

Will file a task

Reedy closed this task as Resolved.Feb 15 2018, 9:41 AM
Reedy claimed this task.

Even enwiki is up to date now!

And what about the >30 secs thing Jaime mentioned above? T179131#3966793 -- would that continue to happen? I don't think it should take such a large time...

And what about the >30 secs thing Jaime mentioned above? T179131#3966793 -- would that continue to happen? I don't think it should take such a large time...

When I ran the script again this morning, it pruned a few thousand rows, and it wasn't waiting 30s+ per batch

I'll run it again in about 24 hours and time how long the whole thing takes

Goat Great 👍🏻

Ok, so the first batch was a bit slow (not really a surprise), but... 7190 updates in 1 minute and 10 seconds doesn't seem bad going to me.

And that's about the needed amount of purging for 24 hours on enwiki

reedy@terbium:~$ date & time mwscript extensions/AbuseFilter/maintenance/purgeOldLogIPData.php --wiki=enwiki && date
[1] 28759
Fri Feb 16 10:08:35 UTC 2018
Purging old IP Address data from abuse_filter_log...
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7190
7190 rows.
Done.
[1]+  Done                    date

real    1m9.710s
user    0m0.404s
sys     0m0.088s
Fri Feb 16 10:09:45 UTC 2018
reedy@terbium:~$

I'd say this is a pretty good working, yes.