|Open||Milimetric||T250049 Drop data from Prefupdate schema that is older than 90 days|
|Resolved||nettrom_WMF||T250857 Identify pending analyses needing access to data older than 90 days|
In the Product Analytics team we'd like to understand more about this task. Is background information available in the parent task? If so, I'd appreciate getting access to it so I can review and update the team on it. Is the plan to drop PrefUpdate from the schema whitelist?
I'm also adding @MMiller_WMF since I'm using the PrefUpdate schema data in analysis of Growth Team experiments in order to remove users who self-select in/out of experiment groups. If PrefUpdate is removed from the whitelist, we might have to engineer a workaround until we're able to run experiments that don't require access to data beyond 90 days.
@Milimetric : I've gone through the various subtasks and changes we made before the tracking list was implemented and not found any cause for concern. We've got the properties we want to track listed and that covers all current analysis needs. So applying the list to old data should be fine.
One question I had is what we do with the data that's in Druid, is that in a format that requires us to do something?
Correction: while druid only stores counts of ocurrence of some properties, since https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/8f7594ac506cf398a8b8294d4a863077cbb2a3c6/includes/PrefUpdateInstrumentation.php#68 we are only storing a few properties and for clarity the others should no longer be present even if no user data is associated with them
I had to tackle some performance problems with the actual sanitization. Parking this herefor the record. It's finishing up running now, I'll pair with someone to swap the data tomorrow.
Ok, data is swapped. @nettrom_WMF if you don't mind, could you check out both event.PrefUpdate and event_sanitized.PrefUpdate to triple check that they look ok and the data you want is there?
I parked the original data on hdfs at /wmf/data/archive/backup/tmp/event_... so we can restore anything that doesn't look right. The methodology is basically listed in the scala file I uploaded above. In short, I filtered out everything but the PHP whitelist and nulled out fields in accordance to the EventLogging whitelist.
@Milimetric : I inspected the sanitized data by looking at the event structs of random partitions and aggregating some random months across various years from 2017 onwards, and in all cases the sanitized data looks correct to me.
I did notice that there's a lot of null values in the non-sanitized data, as you mentioned to me in our chat. If that's easy to fix, I think we should so nobody stumbles onto that in a report or analysis. Let me know if it's time-consuming to do, though, and I'll check with the PA team if we can ignore it, because we know the problem will go away as time passes and old data gets deleted.
@nettrom_WMF it is kind of time intensive, it would take me about 2 days of work. I know it seems silly, but it's just very manual because a lot of my usual tools fail when working with so many partitions. If it would cost you or other analysts more than a day or so of work, then let's think of a solution. The other thing I would mention is that I can keep the original non-sanitized data in parallel for the normal period (until it gets > 90 days old). It's fairly easy to query with spark or Hive.
@nettrom_WMF, yeah, we'll have to wait to re-sanitize, but that'll happen automatically in early November. These dates are around when I started swapping data and it didn't align perfectly. It's technically possible to re-sanitize now but it's kind of a pain and I'd rather just let the process take care of it.
(sorry I missed this last week)