Page MenuHomePhabricator

Growth: implement wider data purge window
Open, LowPublic

Description

Because it takes months to build up enough data via Growth features to facilitate analysis, we will be widening the window of data retention for our features from 90 days to 270 days. This task is to figure out how to implement this for Growth EventLogging schemas. Some open questions:

  • Is Analytics Engineering able to apply a different timeline to some schemas and not others?
  • Can parts of the schema be purged at 90 days, and the rest at 270 days?
  • How can we apply this to our existing schemas, and not just future ones?

Details

Due Date
Thu, Dec 12, 8:00 AM
Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 1 2019, 6:25 PM
MMiller_WMF moved this task from Inbox to Upcoming Work on the Growth-Team board.Nov 1 2019, 6:26 PM

This is in Upcoming Work for @nettrom_WMF to start thinking about and for us to discuss. We should have this sorted out by December.

MMiller_WMF set Due Date to Thu, Dec 12, 8:00 AM.Nov 1 2019, 6:26 PM
Nuria added a subscriber: Nuria.Nov 4 2019, 5:19 PM

Is Analytics Engineering able to apply a different timeline to some schemas and not others?
Can parts of the schema be purged at 90 days, and the rest at 270 days?

The answer to both is no, our sanitization pipeline can apply a sanitization at day 2 and another one at say 45 days, this is how we correct errors on sanitization. We have an early and later sanitization job that run every day. Both work within the 90 day period that data is supposed to exist. We do not have a way at this time to make time-based exceptions on retention per schema.

Ottomata closed this task as Declined.Nov 11 2019, 4:37 PM
Ottomata added a subscriber: Ottomata.

We can't do this now. If you need to retain data longer, you'll have to do it yourselves manually by copying the data away from the normal location. Please be careful and don't keep data forever!

We'd like to do a bit of refactoring of our sanitization pipeline, but that work isn't currently planned. If we get around to it we'll consider this use case.

MMiller_WMF reopened this task as Open.Nov 12 2019, 11:05 PM
MMiller_WMF removed a project: Analytics.

I am reopening this because this is the task to get the work done, whichever method we choose to use. I am removing the Analytics tag.

Let's leave analytics tag cause we need to know where that data is at should we have a data request.

Ottomata triaged this task as Low priority.Mon, Nov 18, 4:45 PM
Ottomata moved this task from Incoming to Data Quality on the Analytics board.
mforns added a subscriber: mforns.Fri, Nov 22, 4:07 PM

Hey all,
Would it be OK to keep original data for 270 days (can you keep all fields?), or do you need some fields to be sanitized to be able to keep it for 270 days?
If it's OK to keep all fields unsanitized for 270 days, we found an interim solution that we can quickly implement for this particular schema.
It's not scalable to more than a couple schemas, though, so we'd continue to plan this task in the future.

The solution would keep the unsanitized data in the event database for 270 days, without the need of copying it over to somewhere else.
And the data would be completely deleted after 270 days (rolling window). The sanitized data would continue to be stored in the event_sanitized database indefinitely.

If you're OK with that, then I can start implementing it this week.
Cheers!

@mforns -- I think that @nettrom_WMF can answer you on this.

@nettrom_WMF and I decided today that we want to have a strategy for this by mid-December, and we want to implement it in January.

ping @nettrom_WMF again, let us know if @mforns propsed plan works cc @kzimmerman

mforns added a comment.EditedTue, Dec 3, 10:38 PM

After discussing with @nettrom_WMF we concluded that the solution described above is not a fit.
The raw (unsanitized) data does not have the approval from legal to be kept for 270 days.
For the data to be kept for 270 days, some fields need to be sanitized.
It wouldn't be a "full" sanitization, meaning some fields would still be privacy-sensitive to a certain level,
but the "half-sanitization" would be enough to keep the data for 270 days.

Now, here's another thing we could do:
Implement the "half-sanitization" on the regular EventLogging white-list, as if we were to keep the "half-sanitized" data indefinitely in the event_sanitized table (together with all other EL sanitized data sets),
but then add a deletion timer in puppet, that would remove the data specific to the affected growth schemas from the event_sanitized database after 270 days (sliding window).
IIUC the Growth team is not interested yet in keeping the data for more than 270 days (even if fully sanitized).
This should be pretty quick to do. But again, it would be a temporary solution that wouldn't scale to many schemas...
@Nuria, what do you think?

Nuria added a comment.Tue, Dec 3, 10:42 PM

that would remove the data specific to the affected growth schemas from the event_sanitized database after 270 days (sliding window).

This would remove every event that was received 270 ago (or more) , all data for the event, is that what we want?

@Nuria, I believe, for now, that would be OK for them.
@nettrom_WMF explained that they are aiming to make short term analyses of 270 days,
and that they have no interest so far to keep a fully-sanitized version of the data for longer.

In the future, though, this might change, he said.

@Nuria : I can confirm what @mforns mentions. During my conversations with him yesterday, it became clear to me that how the Growth team is using EventLogging is an in-between case. Since we're running fairly long experiments, we need data for longer than the default 90 days, but we also need richer data than what we'd limit ourselves to if we were to store data indefinitely. Hence a 270 day sliding window for our sanitized data would work well for us. (This is also why we asked for deletion of sanitized data in T234870 as we completed the Help Panel experiment, by the way, we no longer could keep that data around).

Nuria added a comment.Wed, Dec 4, 5:57 PM

@nettrom_WMF which are the schemas subjected to this 270 window?

Nuria added a comment.Wed, Dec 4, 8:25 PM

I see, +1 to @mforns idea

mforns added a comment.Wed, Dec 4, 8:34 PM

@nettrom_WMF OK, then!

I will implement a deletion timer specific to those 3 schemas,
that will delete all their data from the event_sanitized database after 270 days of collection.

On your side, please make sure the white-list includes all fields to be kept for 270 days for those schemas :]

Change 556231 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Allow drop-older-than to delete under event_sanitized

https://gerrit.wikimedia.org/r/556231

Change 556232 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::data_purge: Add growth deletion timers

https://gerrit.wikimedia.org/r/556232

nettrom_WMF moved this task from Upcoming Work to External on the Growth-Team board.