Page MenuHomePhabricator

Drop data from Prefupdate schema that is older than 90 days
Open, HighPublic

Event Timeline

Nuria created this task.Apr 13 2020, 12:40 AM
Dsharpe added a subscriber: Dsharpe.

reviewed in Clinic

Milimetric assigned this task to fdans.Apr 13 2020, 3:18 PM
Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Ops Week on the Analytics board.
Milimetric added a project: Analytics-Kanban.
LGoto moved this task from Triage to Tracking on the Product-Analytics board.Apr 13 2020, 4:36 PM

In the Product Analytics team we'd like to understand more about this task. Is background information available in the parent task? If so, I'd appreciate getting access to it so I can review and update the team on it. Is the plan to drop PrefUpdate from the schema whitelist?

I'm also adding @MMiller_WMF since I'm using the PrefUpdate schema data in analysis of Growth Team experiments in order to remove users who self-select in/out of experiment groups. If PrefUpdate is removed from the whitelist, we might have to engineer a workaround until we're able to run experiments that don't require access to data beyond 90 days.

Is background information available in the parent task? If so, I'd appreciate getting access to it so I can review and update the team on it.

Added.

Ottomata reassigned this task from fdans to Milimetric.Aug 17 2020, 3:29 PM
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.
Ottomata added a subscriber: fdans.

Apologies this took so long. My plan is to apply the whitelist to the old data. I will do so by the end of the week, let me know if there are any outstanding concerns. Ping @nettrom_WMF.

@Milimetric : I've gone through the various subtasks and changes we made before the tracking list was implemented and not found any cause for concern. We've got the properties we want to track listed and that covers all current analysis needs. So applying the list to old data should be fine.

One question I had is what we do with the data that's in Druid, is that in a format that requires us to do something?

Oof, good point, I hadn't thought of the Druid version of this. I suppose I'll have to wipe and reload all of that as well, to apply the whitelist.

Nuria added a comment.EditedSep 11 2020, 6:26 PM

I do not think the druid data has anything that needs deletion, does it?

Nuria added a comment.Sep 11 2020, 9:00 PM

Correction: while druid only stores counts of ocurrence of some properties, since https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/8f7594ac506cf398a8b8294d4a863077cbb2a3c6/includes/PrefUpdateInstrumentation.php#68 we are only storing a few properties and for clarity the others should no longer be present even if no user data is associated with them

I had to tackle some performance problems with the actual sanitization. Parking this here

for the record. It's finishing up running now, I'll pair with someone to swap the data tomorrow.

Ok, data is swapped. @nettrom_WMF if you don't mind, could you check out both event.PrefUpdate and event_sanitized.PrefUpdate to triple check that they look ok and the data you want is there?

I parked the original data on hdfs at /wmf/data/archive/backup/tmp/event_... so we can restore anything that doesn't look right. The methodology is basically listed in the scala file I uploaded above. In short, I filtered out everything but the PHP whitelist and nulled out fields in accordance to the EventLogging whitelist.

@Milimetric : I inspected the sanitized data by looking at the event structs of random partitions and aggregating some random months across various years from 2017 onwards, and in all cases the sanitized data looks correct to me.

I did notice that there's a lot of null values in the non-sanitized data, as you mentioned to me in our chat. If that's easy to fix, I think we should so nobody stumbles onto that in a report or analysis. Let me know if it's time-consuming to do, though, and I'll check with the PA team if we can ignore it, because we know the problem will go away as time passes and old data gets deleted.

Milimetric added a comment.EditedFri, Oct 9, 5:30 PM

@nettrom_WMF it is kind of time intensive, it would take me about 2 days of work. I know it seems silly, but it's just very manual because a lot of my usual tools fail when working with so many partitions. If it would cost you or other analysts more than a day or so of work, then let's think of a solution. The other thing I would mention is that I can keep the original non-sanitized data in parallel for the normal period (until it gets > 90 days old). It's fairly easy to query with spark or Hive.

@Milimetric : Not a problem, definitely understand that this would be a non-standard request! I've reached out to the PA team and will report back, probably some time on Tuesday.

@Milimetric : It looks like there's no data in event_sanitized.prefupdate for 2020-09-19 through 2020-09-21, and it looks like there's partial data on 2020-09-22. Would it be possible to re-sanitize that date range, or will we need to wait for the re-sanitization script to stop by?

@nettrom_WMF, yeah, we'll have to wait to re-sanitize, but that'll happen automatically in early November. These dates are around when I started swapping data and it didn't align perfectly. It's technically possible to re-sanitize now but it's kind of a pain and I'd rather just let the process take care of it.

(sorry I missed this last week)