Page MenuHomePhabricator

Sanitize Hive EventLogging
Closed, ResolvedPublic21 Estimate Story Points

Description

Refactor some refinery python utils and scripts to be smarter about infering Hive table and HDFS path partitions, and automatically purge after N days.

This should also work with many tables at once.

Details

Related Gerrit Patches:
analytics/refinery/source : masterAdd EL and whitelist sanitization
analytics/refinery/source : masterFactor out RefineTarget from JsonRefine for use with other jobs

Event Timeline

Ottomata created this task.Nov 21 2017, 4:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 21 2017, 4:13 PM

From https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging#Work_in_progress I understand that this will implement the existing purging whitelist. I'll clarify the task description accordingly.

Tbayer updated the task description. (Show Details)Nov 21 2017, 4:31 PM

This task is just to purge 90 days. Implementing the intelligent whitelist based refining will be a different task.

Ottomata updated the task description. (Show Details)Nov 21 2017, 5:46 PM

I see, but that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL), as the advice in the documentation doesn't work for them: "If you want to access EL historical data (that has been kept for longer than 90 days), you'll find it in the MariaDB hosts".
So we should exempt that table until the proper purging strategies are implemented on Hive too. Is there already a task for that BTW?

Is there already a task for that BTW?

Not yet! I just made this one today :) The default is delete after 90 days so we need at least that. Smarter purging is much more complicated. Hopefully we'll be able to re-use the work Luca and Marcel did for that.

that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL),

If the Popups experiment is over and the volume of events will remain low, we can re-enable MySQL imports for it.

...

that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL),

If the Popups experiment is over and the volume of events will remain low, we can re-enable MySQL imports for it.

It's also about the data from the experiment that just ended (tbayer.popups / event.popups). Are you suggesting we should reimport these tables into MySQL?

Naw, tbayer.popups isn’t gonna be touched.

Nuria edited projects, added Analytics; removed Analytics-Kanban.Dec 19 2017, 10:39 PM
Nuria moved this task from Incoming to Operational Excellence Future on the Analytics board.
Nuria edited projects, added Analytics-Kanban; removed Analytics.Jan 3 2018, 10:53 PM

Change 408435 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Factor out RefineTarget from JsonRefine for use with other jobs

https://gerrit.wikimedia.org/r/408435

mforns claimed this task.Feb 6 2018, 5:31 PM

Change 408435 merged by Ottomata:
[analytics/refinery/source@master] Factor out RefineTarget from JsonRefine for use with other jobs

https://gerrit.wikimedia.org/r/408435

mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Feb 7 2018, 3:51 PM

Change 412939 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] [WIP] Add EL and whitelist sanitization

https://gerrit.wikimedia.org/r/412939

Ottomata renamed this task from Purge refined JSON data after 90 days to Sanitize Hive EventLogging .Mar 1 2018, 6:27 PM
Ottomata added subscribers: Nuria, mforns, JAllemandou.

Change 412939 merged by Ottomata:
[analytics/refinery/source@master] Add EL and whitelist sanitization

https://gerrit.wikimedia.org/r/412939

mforns changed the point value for this task from 8 to 21.Mar 14 2018, 4:04 PM
mforns moved this task from Ready to Deploy to Done on the Analytics-Kanban board.

This task is just to purge 90 days. Implementing the intelligent whitelist based refining will be a different task.

To follow up, it looks from https://gerrit.wikimedia.org/r/412939 that this task does now cover whitelist-based sanitization :)

mforns added a comment.EditedMar 15 2018, 6:34 PM

@Neil_P._Quinn_WMF
Yes, it does. And that code is already merged.
The next steps in this project include:

  • Translating the current TSV whitelist into the new YAML-formatted whitelist: T189690
  • Putting the new whitelist in place with puppet: T189691
  • Modify the mysql purging script to use the new whitelist format: T189692
Nuria closed this task as Resolved.Mar 20 2018, 2:59 PM