Page MenuHomePhabricator

Sanitize Hive EventLogging
Closed, ResolvedPublic21 Estimated Story Points

Description

Refactor some refinery python utils and scripts to be smarter about infering Hive table and HDFS path partitions, and automatically purge after N days.

This should also work with many tables at once.

Event Timeline

From https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging#Work_in_progress I understand that this will implement the existing purging whitelist. I'll clarify the task description accordingly.

This task is just to purge 90 days. Implementing the intelligent whitelist based refining will be a different task.

I see, but that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL), as the advice in the documentation doesn't work for them: "If you want to access EL historical data (that has been kept for longer than 90 days), you'll find it in the MariaDB hosts".
So we should exempt that table until the proper purging strategies are implemented on Hive too. Is there already a task for that BTW?

Is there already a task for that BTW?

Not yet! I just made this one today :) The default is delete after 90 days so we need at least that. Smarter purging is much more complicated. Hopefully we'll be able to re-use the work Luca and Marcel did for that.

that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL),

If the Popups experiment is over and the volume of events will remain low, we can re-enable MySQL imports for it.

...

that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL),

If the Popups experiment is over and the volume of events will remain low, we can re-enable MySQL imports for it.

It's also about the data from the experiment that just ended (tbayer.popups / event.popups). Are you suggesting we should reimport these tables into MySQL?

Naw, tbayer.popups isn’t gonna be touched.

Change 408435 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Factor out RefineTarget from JsonRefine for use with other jobs

https://gerrit.wikimedia.org/r/408435

Change 408435 merged by Ottomata:
[analytics/refinery/source@master] Factor out RefineTarget from JsonRefine for use with other jobs

https://gerrit.wikimedia.org/r/408435

Change 412939 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] [WIP] Add EL and whitelist sanitization

https://gerrit.wikimedia.org/r/412939

Ottomata renamed this task from Purge refined JSON data after 90 days to Sanitize Hive EventLogging .Mar 1 2018, 6:27 PM
Ottomata added subscribers: Nuria, mforns, JAllemandou.

Change 412939 merged by Ottomata:
[analytics/refinery/source@master] Add EL and whitelist sanitization

https://gerrit.wikimedia.org/r/412939

mforns changed the point value for this task from 8 to 21.Mar 14 2018, 4:04 PM
mforns moved this task from Ready to Deploy to Done on the Analytics-Kanban board.

This task is just to purge 90 days. Implementing the intelligent whitelist based refining will be a different task.

To follow up, it looks from https://gerrit.wikimedia.org/r/412939 that this task does now cover whitelist-based sanitization :)

@Neil_P._Quinn_WMF
Yes, it does. And that code is already merged.
The next steps in this project include:

  • Translating the current TSV whitelist into the new YAML-formatted whitelist: T189690
  • Putting the new whitelist in place with puppet: T189691
  • Modify the mysql purging script to use the new whitelist format: T189692