Refactor some refinery python utils and scripts to be smarter about infering Hive table and HDFS path partitions, and automatically purge after N days.
This should also work with many tables at once.
Refactor some refinery python utils and scripts to be smarter about infering Hive table and HDFS path partitions, and automatically purge after N days.
This should also work with many tables at once.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Ottomata | T159170 Sunset MySQL data store for eventlogging | |||
Resolved | Ottomata | T162610 Implement EventLogging Hive refinement | |||
Resolved | mforns | T181064 Sanitize Hive EventLogging | |||
Resolved | Ottomata | T185237 Lookout for duplicates in EL refine, implement pluggable transform method config in JSONRefine |
From https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging#Work_in_progress I understand that this will implement the existing purging whitelist. I'll clarify the task description accordingly.
This task is just to purge 90 days. Implementing the intelligent whitelist based refining will be a different task.
I see, but that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL), as the advice in the documentation doesn't work for them: "If you want to access EL historical data (that has been kept for longer than 90 days), you'll find it in the MariaDB hosts".
So we should exempt that table until the proper purging strategies are implemented on Hive too. Is there already a task for that BTW?
Is there already a task for that BTW?
Not yet! I just made this one today :) The default is delete after 90 days so we need at least that. Smarter purging is much more complicated. Hopefully we'll be able to re-use the work Luca and Marcel did for that.
that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL),
If the Popups experiment is over and the volume of events will remain low, we can re-enable MySQL imports for it.
...
that will be a problem in case of the Popups schema (and possibly others too which are no longer stored in MySQL),
If the Popups experiment is over and the volume of events will remain low, we can re-enable MySQL imports for it.
It's also about the data from the experiment that just ended (tbayer.popups / event.popups). Are you suggesting we should reimport these tables into MySQL?
Change 408435 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Factor out RefineTarget from JsonRefine for use with other jobs
Change 408435 merged by Ottomata:
[analytics/refinery/source@master] Factor out RefineTarget from JsonRefine for use with other jobs
Change 412939 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] [WIP] Add EL and whitelist sanitization
Change 412939 merged by Ottomata:
[analytics/refinery/source@master] Add EL and whitelist sanitization
To follow up, it looks from https://gerrit.wikimedia.org/r/412939 that this task does now cover whitelist-based sanitization :)
@Neil_P._Quinn_WMF
Yes, it does. And that code is already merged.
The next steps in this project include: