Page MenuHomePhabricator

Purge all old data from EventLogging master
Closed, ResolvedPublic13 Story Points

Description

T156933 is about intelligently purging/redacting PII data from the EventLogging MySQL slave databases. We also need to ensure that old data is removed from the master database (db1046). This is more for storage purposes than for privacy ones.

I see that the eventlogging_sync.sh replication script used to have a very crude way of doing this. We should make a simple but comprehensive version that is separate from the replication script.

This should work for EventLogging Analytics tables, as well as the eventbus ones that I am about to start importing into MySQL for T150369.

Event Timeline

Ottomata created this task.Jun 20 2017, 3:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 20 2017, 3:57 PM

eventlogging_cleaner.py should be capable of purging db1046 (simply not doing any whitelisting/update-with-nulls) but we can have another script if simpler.

+1. Seems that we should be able to use the same script ignoring whitelisting.

Just added to the eventlogging_cleaner script the --no-whitelist parameter to support this use case.

Nuria edited projects, added Analytics-Kanban; removed Analytics.Jun 23 2017, 4:04 PM
mforns renamed this task from Purge all old data from master to Purge all old data from EventLogging master.Jul 11 2017, 12:31 PM
elukey edited projects, added Analytics; removed Analytics-Kanban.Aug 3 2017, 3:52 PM
Nuria moved this task from Incoming to Dashiki on the Analytics board.Aug 7 2017, 4:16 PM
fdans moved this task from Dashiki to Backlog (Later) on the Analytics board.Oct 2 2017, 4:05 PM
elukey edited projects, added Analytics-Kanban; removed Analytics.Dec 20 2017, 1:28 PM

Logged activity on the parent task instead of this one :)

https://gerrit.wikimedia.org/r/398869
https://gerrit.wikimedia.org/r/399149
https://gerrit.wikimedia.org/r/399153

Mentioned in SAL (#wikimedia-operations) [2017-12-19T16:34:39Z] <elukey> manually started eventlogging cleaner on db1107 to purge/sanitize data up to 90 days ago (tmux is running for user eventlogcleaner) - T108850

So the first clean up run (Jan 1st 2009 to 90 days ago) is in progress now and it will last some days. The daily purging cron has been installed.

elukey claimed this task.Dec 20 2017, 1:31 PM
elukey moved this task from Next Up to Ready to Deploy on the Analytics-Kanban board.

First cleanup completed! From now on the cron script should purge data daily as it happens on db1108.

The only registered warnings are the following:

eventlogcleaner@db1107:~$ /usr/bin/flock --verbose -n /var/lock/eventlogging_cleaner /usr/local/bin/eventlogging_cleaner --whitelist /etc/eventlogging/whitelist.tsv --older-than 90 --start-ts-file /var/run/eventlogging_cleaner --batch-size 10000 --sleep-between-batches 2>> /var/log/eventlogging/eventlogging_cleaner.log
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'event_userId' cannot be null")
  result = self._query(query)
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'event_recipientUserId' cannot be null")
  result = self._query(query)
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'event_pageTitle' cannot be null")
  result = self._query(query)
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'event_readMoreList' cannot be null")
  result = self._query(query)
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'event_findText' cannot be null")
  result = self._query(query)
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'event_pageID' cannot be null")
  result = self._query(query)
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'event_appInstallID' cannot be null")
  result = self._query(query)
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'event_source' cannot be null")
  result = self._query(query)
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'event_imageTitle' cannot be null")
  result = self._query(query)
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'event_fromSaved' cannot be null")
  result = self._query(query)
/usr/lib/python3/dist-packages/pymysql/cursors.py:166: Warning: (1048, "Column 'userAgent' cannot be null")
  result = self._query(query)

@mforns before closing the task let's discuss the above warnings :)

elukey added a comment.Jan 1 2018, 2:48 PM

I think that some tables don't have the fields that need to be NULLED with IS_NULLABLE = yes, so these warnings pop up. IIRC we had the same issue with the log db when we tried to purge a slave db in the past.

Example:

MariaDB [(none)]> SELECT DISTINCT TABLE_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME IN ('event_userId') AND IS_NULLABLE='NO' AND TABLE_SCHEMA='log';
+---------------------------------------------------+
| TABLE_NAME                                        |
+---------------------------------------------------+
| CentralAuth_5690875                               |
| EchoInteraction_15065746
[..cut..]

MariaDB [(none)]> describe log.EchoInteraction_15065746;
+------------------------+--------------+------+-----+---------+-------+
| Field                  | Type         | Null | Key | Default | Extra |
+------------------------+--------------+------+-----+---------+-------+
| uuid                   | char(32)     | NO   | PRI | NULL    |       |
| clientIp               | varchar(191) | YES  |     | NULL    |       |
| timestamp              | varchar(14)  | NO   | MUL | NULL    |       |
| userAgent              | varchar(191) | YES  |     | NULL    |       |
| webHost                | varchar(191) | YES  |     | NULL    |       |
| wiki                   | varchar(191) | NO   |     | NULL    |       |
| event_action           | varchar(191) | NO   |     | NULL    |       |
| event_context          | varchar(191) | YES  |     | NULL    |       |
| event_editCount        | bigint(20)   | YES  |     | NULL    |       |
| event_eventId          | bigint(20)   | YES  |     | NULL    |       |
| event_mobile           | tinyint(1)   | YES  |     | NULL    |       |
| event_notifWiki        | varchar(191) | YES  |     | NULL    |       |
| event_notificationType | varchar(191) | YES  |     | NULL    |       |
| event_userId           | bigint(20)   | NO   |     | NULL    |       |
| event_version          | varchar(191) | NO   |     | NULL    |       |
+------------------------+--------------+------+-----+---------+-------+

The whitelist for EchoInteraction is:

EchoInteraction	clientValidated
EchoInteraction	event_action
EchoInteraction	event_context
EchoInteraction	event_mobile
EchoInteraction	event_notificationType
EchoInteraction	event_version
EchoInteraction	isTruncated
EchoInteraction	webHost
EchoInteraction	wiki

event_userId is not part of the whitelist (so going to be NULLED) but is not nullable. This will probably require a series of alter tables and a re-run of the script.

Following what done in the ticket linked above, I created the following list: https://phabricator.wikimedia.org/P6511

@Marostegui, @mforns: we should announce downtime for the Eventlogging service (only the mysql consumer/insert part) and then perform the above ALTER tables on db1107. Thoughts?

I have not reviewed the full list of ALTERs, you have way more knowledge than me about what is needed on those tables :-)
But yes, announcing some downtime and performing all the alters sounds good to me.

elukey added a comment.Jan 2 2018, 9:59 AM

I have not reviewed the full list of ALTERs, you have way more knowledge than me about what is needed on those tables :-)
But yes, announcing some downtime and performing all the alters sounds good to me.

Sure! If you didn't see anything weird and the format looks good I'll review them with Marcel and then schedule the maintenance for this week.

@elukey @Marostegui
Oh yes, I didn't remember the non-nullable fields.... Good catch, Luca.
The list of alter tables looks good to me!
I'd say we can move forward.

Sent an email to announce maintenance for tomorrow (Jan 03).

Mentioned in SAL (#wikimedia-operations) [2018-01-03T08:59:13Z] <elukey> stop eventlogging mysql insertion on eventlog1001 to allow db1107 maintenance - T168414

elukey added a comment.Jan 3 2018, 2:44 PM

Maintenance is ongoing and it will probably last for a couple of days.

Mentioned in SAL (#wikimedia-operations) [2018-01-06T08:09:45Z] <elukey> re-enable eventlogging mysql consumers after database maintenance - T168414

Mentioned in SAL (#wikimedia-operations) [2018-01-07T08:52:02Z] <elukey> re-enabled puppet on db110[78] - eventlogging_sync restarted on db1108 (analytics-slave) - T168414

elukey added a comment.Jan 7 2018, 8:57 AM

Wanted to sanity check the data on db1107 (el master) after the mysql consumers added the missing data from the past days:

MariaDB [log]> select count(*) from Edit_17541122 where timestamp like '20180103%';
+----------+
| count(*) |
+----------+
|   346559 |
+----------+
1 row in set (0.24 sec)

MariaDB [log]> select count(*) from Edit_17541122 where timestamp like '20180104%';
+----------+
| count(*) |
+----------+
|   349567 |
+----------+
1 row in set (0.26 sec)

MariaDB [log]> select count(*) from Edit_17541122 where timestamp like '20180105%';
+----------+
| count(*) |
+----------+
|   333856 |
+----------+
1 row in set (0.24 sec)

MariaDB [log]> select count(*) from Edit_17541122 where timestamp like '20180106%';
+----------+
| count(*) |
+----------+
|   324058 |
+----------+
1 row in set (0.23 sec)

MariaDB [log]> select count(*) from NavigationTiming_17216284 where timestamp like '20180103%';
+----------+
| count(*) |
+----------+
|   372561 |
+----------+
1 row in set (0.25 sec)

MariaDB [log]> select count(*) from NavigationTiming_17216284 where timestamp like '20180104%';
+----------+
| count(*) |
+----------+
|   370729 |
+----------+
1 row in set (0.35 sec)

MariaDB [log]> select count(*) from NavigationTiming_17216284 where timestamp like '20180105%';
+----------+
| count(*) |
+----------+
|   359509 |
+----------+
1 row in set (0.30 sec)

MariaDB [log]> select count(*) from NavigationTiming_17216284 where timestamp like '20180106%';
+----------+
| count(*) |
+----------+
|   366302 |
+----------+
1 row in set (0.26 sec)

For the record, before restarting eventlogging sync on db1108 (el slave) this was the max(timestamp) for the above tables: 20180103085656.

@mforns I think that we are good after this maintenance, let me know if you have other ideas and/or want to verify the db's consistency!

elukey set the point value for this task to 13.Jan 17 2018, 2:56 PM

The first run completed without any errors, and then another one (cleaning up only daily data) ran as well setting the following:

INFO: line 617: Update /var/run/eventlogging_cleaner with the current end_ts 20171019110001

I also checked the min(timestamp) of each table and compared some of them with the whitelist, everything seems consistent. I'd call this task done :)

elukey moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Jan 17 2018, 2:58 PM
Nuria closed this task as Resolved.Jan 18 2018, 4:59 PM