|Resolved||Jalexander||T160357 Allow those with CheckUser right to access AbuseLog private information on WMF projects|
|Resolved||Reedy||T179131 AbuseFilter should actively prune old IP data|
|Resolved||MarcoAurelio||T186870 Purge old IP data from AbuseFilter on the Beta Cluster|
I choose beta eswiki since it is closed and I guess I can more safely do that there. Test plan consisted in:
First: query how many rows do we have with private data:
wikiadmin@deployment-db04[eswiki]> select count(afl_ip) from abuse_filter_log; +---------------+ | count(afl_ip) | +---------------+ | 1695 | +---------------+ 1 row in set (0.00 sec)
Second: see the oldest and newest abusefilter log entry with wikiadmin@deployment-db04[eswiki]> select afl_id, afl_timestamp, afl_ip from abuse_filter_log order by afl_timestamp desc; Oldest is from 20140718015832 and newest is from 20170425131421. That means all data is older than 90 days so all afl_ip data should go.
Third: run the script:
maurelio@deployment-tin:~$ mwscript extensions/AbuseFilter/maintenance/purgeOldLogIPData.php --wiki=eswiki Purging old IP Address data from abuse_filter_log... 200 400 600 800 1000 1200 1400 1600 1695 1695 rows. Done.
Fourth: check if the data is really gone: wikiadmin@deployment-db04[eswiki]> select afl_id, afl_timestamp, afl_ip from abuse_filter_log order by afl_timestamp desc; shows no data on afl_ip field.
So I guess the script works as expected.
Is that method fine?
Also, I'm not sure if it is possible but it is somewhat kind of affordable to run this manually on all Beta Cluster wikis, but doing so on all WMF Production wikis will be a pain. Apparently foreachwikiindblist 'all-labs.dblist' <script here> expects a maintenance script from the mediawiki core maintenance folder, not from an extension...
I'm not a Wikimedia sysadmin, so I might be wrong. My idea is that for production wikis I assume that the first run should be scheduled with @greg and be run in batches of wikis or so (start with small.dblist and so on, or another method). In the first run we'll have tens of thousands of entries to clear Wikimedia-wide (note that AF is there logging since 2012?), and that will take some time, or maybe disrupt DBs. Further runs should be scheduled with a cron on the puppet IMHO, but they will not be that heavy IMHO even if we run it daily. To be sure, before running the script in Production I'd do a DBA review of the script and co-schedule a deployment window so we don't trap our fingers.
However we should either disabling afl_ip logging on Beta Cluster, or restrict who can access that info or both. In any case, a puppet cron should be set there to do this regularly.
That said, on WMF production it has 2 long years of cleaning to do after we fixed the $this->requireExtension thing. If we were to run this manually, I'd say to run it on terbium since it'll be a long-running script. However the cron job should be fixed so it should run today at 01:15 UTC. I'll ask around and see if, at least for this round, they could keep the logs for doublechecking.