Page MenuHomePhabricator

Growth: implement wider data purge window
Closed, ResolvedPublic

Description

Because it takes months to build up enough data via Growth features to facilitate analysis, we will be widening the window of data retention for our features from 90 days to 270 days. This task is to figure out how to implement this for Growth EventLogging schemas. Some open questions:

  • Is Analytics Engineering able to apply a different timeline to some schemas and not others?
  • Can parts of the schema be purged at 90 days, and the rest at 270 days?
  • How can we apply this to our existing schemas, and not just future ones?

Event Timeline

This is in Upcoming Work for @nettrom_WMF to start thinking about and for us to discuss. We should have this sorted out by December.

MMiller_WMF set Due Date to Dec 12 2019, 8:00 AM.Nov 1 2019, 6:26 PM

Is Analytics Engineering able to apply a different timeline to some schemas and not others?
Can parts of the schema be purged at 90 days, and the rest at 270 days?

The answer to both is no, our sanitization pipeline can apply a sanitization at day 2 and another one at say 45 days, this is how we correct errors on sanitization. We have an early and later sanitization job that run every day. Both work within the 90 day period that data is supposed to exist. We do not have a way at this time to make time-based exceptions on retention per schema.

Ottomata subscribed.

We can't do this now. If you need to retain data longer, you'll have to do it yourselves manually by copying the data away from the normal location. Please be careful and don't keep data forever!

We'd like to do a bit of refactoring of our sanitization pipeline, but that work isn't currently planned. If we get around to it we'll consider this use case.

MMiller_WMF removed a project: Analytics.

I am reopening this because this is the task to get the work done, whichever method we choose to use. I am removing the Analytics tag.

Let's leave analytics tag cause we need to know where that data is at should we have a data request.

Ottomata moved this task from Incoming to Data Quality on the Analytics board.

Hey all,
Would it be OK to keep original data for 270 days (can you keep all fields?), or do you need some fields to be sanitized to be able to keep it for 270 days?
If it's OK to keep all fields unsanitized for 270 days, we found an interim solution that we can quickly implement for this particular schema.
It's not scalable to more than a couple schemas, though, so we'd continue to plan this task in the future.

The solution would keep the unsanitized data in the event database for 270 days, without the need of copying it over to somewhere else.
And the data would be completely deleted after 270 days (rolling window). The sanitized data would continue to be stored in the event_sanitized database indefinitely.

If you're OK with that, then I can start implementing it this week.
Cheers!

@mforns -- I think that @nettrom_WMF can answer you on this.

@nettrom_WMF and I decided today that we want to have a strategy for this by mid-December, and we want to implement it in January.

After discussing with @nettrom_WMF we concluded that the solution described above is not a fit.
The raw (unsanitized) data does not have the approval from legal to be kept for 270 days.
For the data to be kept for 270 days, some fields need to be sanitized.
It wouldn't be a "full" sanitization, meaning some fields would still be privacy-sensitive to a certain level,
but the "half-sanitization" would be enough to keep the data for 270 days.

Now, here's another thing we could do:
Implement the "half-sanitization" on the regular EventLogging white-list, as if we were to keep the "half-sanitized" data indefinitely in the event_sanitized table (together with all other EL sanitized data sets),
but then add a deletion timer in puppet, that would remove the data specific to the affected growth schemas from the event_sanitized database after 270 days (sliding window).
IIUC the Growth team is not interested yet in keeping the data for more than 270 days (even if fully sanitized).
This should be pretty quick to do. But again, it would be a temporary solution that wouldn't scale to many schemas...
@Nuria, what do you think?

that would remove the data specific to the affected growth schemas from the event_sanitized database after 270 days (sliding window).

This would remove every event that was received 270 ago (or more) , all data for the event, is that what we want?

@Nuria, I believe, for now, that would be OK for them.
@nettrom_WMF explained that they are aiming to make short term analyses of 270 days,
and that they have no interest so far to keep a fully-sanitized version of the data for longer.

In the future, though, this might change, he said.

@Nuria : I can confirm what @mforns mentions. During my conversations with him yesterday, it became clear to me that how the Growth team is using EventLogging is an in-between case. Since we're running fairly long experiments, we need data for longer than the default 90 days, but we also need richer data than what we'd limit ourselves to if we were to store data indefinitely. Hence a 270 day sliding window for our sanitized data would work well for us. (This is also why we asked for deletion of sanitized data in T234870 as we completed the Help Panel experiment, by the way, we no longer could keep that data around).

@nettrom_WMF which are the schemas subjected to this 270 window?

@nettrom_WMF OK, then!

I will implement a deletion timer specific to those 3 schemas,
that will delete all their data from the event_sanitized database after 270 days of collection.

On your side, please make sure the white-list includes all fields to be kept for 270 days for those schemas :]

Change 556231 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Allow drop-older-than to delete under event_sanitized

https://gerrit.wikimedia.org/r/556231

Change 556232 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::data_purge: Add growth deletion timers

https://gerrit.wikimedia.org/r/556232

Change 556231 merged by Mforns:
[analytics/refinery@master] Allow drop-older-than to delete under event_sanitized

https://gerrit.wikimedia.org/r/556231

Change 556232 merged by Ottomata:
[operations/puppet@production] analytics::refinery::job::data_purge: Add growth deletion timers

https://gerrit.wikimedia.org/r/556232

@nettrom_WMF
We enabled the deletion of the data for the 3 specified schemas: HelpPanel, HomepageVisit, HomepageModule.
No data has been deleted yet because all events are still less than 270 days old.
So, provided you have everything you want to keep in the sanitization white-list, I guess this task can be marked as done!
Cheers

Thank you! Now that this is running, I filed T249666: Growth: validate that data is purged after 270 days so that we remember to validate that the purging is happening correctly once August rolls around.