Page MenuHomePhabricator

Geoeditors_private deletion scripts scheduled day conflicts with retention period
Closed, ResolvedPublic

Description

The deletion scripts for the 2 geoeditors_private data sets (cu_changes, geoeditors_daily)
are scheduled to run the 16th of each month, and have a retention period of 80 days.
But the data sets are monthly, so a weird condition happens:

Imagine we're storing May, Jun, July and August, and today's date is August 16.
The script will run and calculate Aug 16 - 80 days = May 29.
And will not delete May, because it still has a couple days within valid period.
Then, days will pass, and on Sept 15, we'll be storing May, Jun, Jul, Ago and 15 days of Sept.
That is around 135 days.

At least, I would reschedule the timer run on the 22nd of the month.
This way on the 21st, we'd be storing a maximum of around 111 days.
And on the 22nd of the month, we'd be storing only 81 days.

Event Timeline

fdans moved this task from Incoming to Ops Week on the Analytics board.

We should ensure that at least we keep last 90 days.
And delete the data as soon as possible after that.

Some thoughts before taking actions:

  • Datasets are monthly, so when we delete a full month there is at most 31 days discrepancy between the first and last day deleted.
  • IIRC the retention policy is about keeping AT MOST 90 days, so I'd rather keep 65, making sure we always have 2 months of data when the geoeditors job run, and try not to go over instead of having 90 days sure, and delete when there is at most 90+31 = 121 days.
  • Shouldn't we execute the script every day, It'd be no-op on most days, and delete data when we go over the stated period.

@JAllemandou

  • IIRC the retention policy is about keeping AT MOST 90 days, so I'd rather keep 65, making sure we always have 2 months of data when the geoeditors job run, and try not to go over instead of having 90 days sure, and delete when there is at most 90+31 = 121 days.

Yes, I lean towards that as well.

  • Shouldn't we execute the script every day, It'd be no-op on most days, and delete data when we go over the stated period.

Definitely, easier, cleaner :]

Shouldn't we execute the script every day, It'd be no-op on most days, and delete data when we go over the stated period.

+1 seems a lot less error prone

Change 532684 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::data_purge.pp: fix geoeditors retention period

https://gerrit.wikimedia.org/r/532684

@mforns: I wonder if we shouldn't document the deletion strategy in some wikitech page, to help us having a clear mind on how much data we have, and when deletion happen (not complicated, but since it changes based on how many days in the month, it's not super intuitive). Maybe not though...

Change 532684 merged by Ottomata:
[operations/puppet@production] analytics::refinery::job::data_purge.pp: fix geoeditors retention period

https://gerrit.wikimedia.org/r/532684

@JAllemandou
I created this page in Wikitech, explains a bit how data_purge.pp works and how the retention period vs timer interval work.
Please, feel free to modify!

To keep archive happy: this is the page: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Data_deletion_and_sanitization

Thanks a lot @marcel, I think this is very useful :)