Page MenuHomePhabricator

Set up auto-purging after 90 days {tick}
Closed, ResolvedPublic0 Estimated Story Points

Description

We need to set up an automatic purging system in EventLogging databases:

  1. By default, all records older than 90 days should be immediately deleted for all tables. This should automatically include tables created in the future.
  2. The system must support some kind of white-listing (we still need to decide what and how). If a table is white-listed, keep the white-listed fields and auto-purge the others normally.
  3. clientIp and userAgent fields (eventCapsule) can not be white-listed, they must be always deleted after 90 days.

We need to decide how to implement this in both master and analytics-storage replica.
The schema+fields whitelist needs still to be created from the EL audit spreadsheed, column N:
https://docs.google.com/spreadsheets/d/1mQtbsbGHLbGsHeNaYFCdg6X4K1VSxjUStBpx4GmVbhk/edit#gid=1610723354

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mforns moved this task from Next Up to Paused on the Analytics-Kanban board.
jcrespo raised the priority of this task from High to Needs Triage.Sep 9 2015, 10:47 AM

@jcrespo

Sorry for the confusion, but I've added some lines to the white-list file.


This is the valid version. The other version is now incomplete.

If this happens again, I'll think of another way of storing this white-llist, to avoid more confusion.

Just wanted to add this note here, too.
The activation in production of this auto-purging should wait until we notify all schema owners with the date this will happen.

@jcrespo
I know it's early to know when this can happen. But as soon as you have an estimate, could you please pass it to me, so I can start reaching out to the schema owners to notify that the auto-purging will be activated?
Thanks a lot!

We should implement the auto-purging per schema, not per revision. Because otherwise, every time someone edits a schema, we'd revert it back to the default purging strategy.

So the white-list, instead of:

TABLE_NAME                COLUMN_NAME
EchoInteraction_5539940   clientValidated
EchoInteraction_5539940   event_action
...

Should be:

SCHEMA_NAME       COLUMN_NAME
EchoInteraction   clientValidated
EchoInteraction   event_action
...

Here's the updated white-list TSV:

When auto-purging i.e. the table EchoInteraction_5539940, we should query the white-list for EchoInteraction and keep the data for all white-listed columns. Note that the white-list may contain column names that do not belong to EchoInteraction_5539940, because they belong to other revisions of the same schema. But this should be OK, those can be ignored.

@Aklapper I am not working on this, but people that said they are not going to work on this (Analytics: Radar) keeps assigning it to me. Please mediate.

jcrespo triaged this task as Medium priority.Apr 22 2016, 4:10 PM
jcrespo moved this task from Triage to Pending comment on the DBA board.

we still need to decide what and how

If it is very dynamic, I would recommend setting a table with table names, that way an event could read it and actuate depending on that. Alternatively, and maybe a better option would be to maintain a list of tables on puppet for a cron job- that would have better visibility.

@jcrespo
I think you're right, the latter seems to have better visibility and also permits to CR changes, which will be good, given that we'll be changing security settings. Let me know if I can help you with that.

@mforns Please send an RC to operations/puppet with a plan-text list of tables, or any other info that will be needed to create a cron job to perform the purge.

I will then use that list to change the purging system from an event to a db maintenance job that keeps the analytics hosts in a good shape.

I removed myself from the task, because I created another task in our kanban for that.
Nevertheless, I am working right now in putting together that white-list of schema/fields.
I just need some time to contact the creators of new schemas that were not discussed in the audit last year. When I'm done, I will update this task. I will add my task as a blocker of this one. Thanks!

@jcrespo

I finished updating the white-list. Sorry for the delay, I needed to make sure that changes to schemas between 2015's audit and today were included and that the respective schema owners were notified and agreed with the purging.

The file is here:

How to understand this file:

  • It's a TSV file.
  • It has 2 columns: schema and field.
  • It is a white-list, meaning that all schema-field pairs listed are to be kept indefinitely in the database. The rest of schema-field pairs not listed are to be auto-purged after 90 days.
  • The schema column does not include the revision, just the schema name. This means the white-list applies to all revisions of a given schema. This is done on purpose, so that new revisions of a white-listed schema will continue to be considered. Otherwise, every time a schema got modified, we would need to add its fields to the white-list.
  • Note that new schemas or new fields in existing schemas will be automatically auto-purged after 90 days by default. If a schema owner wants their schema or fields to be kept indefinitely, they can contact analytics and we'll update the white-list.

@jcrespo
Please, let me know if the white-list is what you expected and what can I do to help you in the next steps. Thanks!

Please create a change review on puppet/operations (anywhere, I will move it somewhere else) instead of attaching it here, I will take it from here.

Change 298721 had a related patch set uploaded (by Mforns):
Add white-list for EventLogging auto-purging

https://gerrit.wikimedia.org/r/298721

@jcrespo

Please, see above the patch I created. When the file has a permanent location in puppet, I will add some documentation on EventLogging's Wikitech page. Thanks!

@jcrespo
I did some additions to the white list in the gerrit change set. But I still need the confirmation of 1 schema owner. Please do not activate the purging until I get that confirmation. I will post on this task and let you know.

@jcrespo
Ok, it's confirmed. We can proceed :]

@Nuria, @mforns, I have updated the documentation at wikitech:Analytics/EventLogging to reflect these changes. It would also be good to update the more detailed docs at wikitech:Analytics/EventLoging/Data retention and auto-purging.

One question: the main event logging page says that all EL data in Hadoop is purged after 90 days. Is this correct, and if so, are there any plans to apply the whitelist to that data?

@Neil_P._Quinn_WMF @Nuria
Thanks Neil for spotting this. I updated the docs on data retention and auto-purging.

One question: the main event logging page says that all EL data in Hadoop is purged after 90 days. Is this correct, and if so, are there any plans to apply the whitelist to that data?

Yes, the autopurging scripts for Hadoop are in place already and they purge all the data after 90 days. About applying the white list to Hadoop, this has been discussed a couple times, but not prioritized. If there is a use case for it, we definitely can consider it.

Nuria edited projects, added Analytics; removed Analytics-Kanban.
Nuria moved this task from Radar to Operational Excellence Future on the Analytics board.

Change 356383 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py

https://gerrit.wikimedia.org/r/356383

Change 356383 merged by Elukey:
[operations/puppet@production] role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py

https://gerrit.wikimedia.org/r/356383

Change 364691 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mariadb::analytics::custom_repl_slave: deploy the EL whitelist

https://gerrit.wikimedia.org/r/364691

Change 364691 merged by Elukey:
[operations/puppet@production] role::mariadb::analytics::custom_repl_slave: deploy the EL whitelist

https://gerrit.wikimedia.org/r/364691

Change 298721 abandoned by Elukey:
Add white-list for EventLogging auto-purging

https://gerrit.wikimedia.org/r/298721

Hi!

What's the status of this?

Going to be done before EOQ :)

elukey moved this task from In Progress to Done on the User-Elukey board.
elukey moved this task from Done to In Progress on the User-Elukey board.

Change 398869 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::el::master: apply data sanitization policies

https://gerrit.wikimedia.org/r/398869

Change 398869 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::el::master: apply data sanitization policies

https://gerrit.wikimedia.org/r/398869

Change 399149 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::eventlogging: fix group/user dependencies

https://gerrit.wikimedia.org/r/399149

Change 399149 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::eventlogging: fix group/user dependencies

https://gerrit.wikimedia.org/r/399149

Change 399153 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_purging_whitelist.tsv: remove old table

https://gerrit.wikimedia.org/r/399153

Change 399153 merged by Elukey:
[operations/puppet@production] eventlogging_purging_whitelist.tsv: remove old table

https://gerrit.wikimedia.org/r/399153

Mentioned in SAL (#wikimedia-operations) [2017-12-19T16:34:39Z] <elukey> manually started eventlogging cleaner on db1107 to purge/sanitize data up to 90 days ago (tmux is running for user eventlogcleaner) - T108850

elukey set the point value for this task to 0.