Page MenuHomePhabricator

[EventLogging Sanitization] Enable older-than-90-day purging of unsanitized EL database (event) in Hive
Closed, ResolvedPublic0 Estimated Story Points

Description

The script that will purge unsanitized EventLogging Hive data (event database) after 90 days is ready, see T199836.
We'd like to activate it as soon as possible to comply with the data retention guidelines.
So, we'd like to confirm with Hive EventLogging users that nobody is still depending on the unsanitized version of data.
Remind you that the sanitized version lives in Hive's event_sanitized database, is periodically populated according to the EL sanitization white-list and has data since Nov 2017.

Event Timeline

mforns triaged this task as Medium priority.Nov 14 2018, 4:34 PM
mforns created this task.

@mpopov @Tbayer
Hi! I'd like to confirm that you guys are OK with us activating the script that will delete all events older than 90 days in Hive's 'event' database (unsanitized data).
As we discussed in earlier threads:

  • All instances of the app_install_id field are being kept indefinitely in a sanitized form: salted hash with rotating salt every 3 months, see: T198426, T199902.
  • As requested per @mpopov, the old salt is being kept for 2 extra weeks after salt rotation (end of quarter) to allow for consistent backfilling of the event_sanitized database in case of issues: T199899, T199900.

@leila @bmansurov
Hi! I'd also like to confirm with you guys that it's OK to activate the script that will delete all events older than 90 days from Hive's event database (unsanitized), so you'll be left only with the sanitized version of it in event_sanitized database. I believe the main point we wan to discuss here is how we keep unsanitized CitationUsage events while we figure out a way to handle those with Legal. On my end, I'd be happy to white-list all fields temporarily, given that we continue an active conversation with Legal to solve this in the short term. Would that be OK with you? Do you see any other concerns in activating the script regarding EL data belonging to the Research team?

@mpopov @Tbayer @leila @bmansurov
Oh forgot... Please, feel free to subscribe other people that you think might be interested in participating in this discussion to this task. Thanks!

@mforns the best option to move forward on our end is to make a copy of the specific parts of the table we're interested in and keep those in HDFS. As you know, mistakes happen at the time of purging and we can't afford data loss in this case. Once we do this, you can purge without white-listing. Given that there is a discussion with Legal about this schema, we can then come back to the data in HDFS and purge as needed. Does this sound good to you?

@leila Yes, makes sense to me! When you say "copy specific parts of the table" you mean specific time ranges, no? Sure, let's do that.

Also, @mpopov, we should probably fix the white-list to include the recent (or any other) renames to EL schema fields T209087, and backfill sanitization before we activate the purging script. Otherwise, data will be lost for those renamed fields.

Also, @Neil_P._Quinn_WMF, @nettrom_WMF and @chelsyx, please check out this task. I don't recall there was any pending issue on your side before we can proceed, but just in case. Thanks!

Just to double-check: The information in the documentation that "Sanitization happens right after events are generated (with a couple hours lag)" is still current, right? In that case I don't think this will be a concern (although we will need to update some queries - CCing @Groceryheist regarding ReadingDepth).

@leila Yes, makes sense to me! When you say "copy specific parts of the table" you mean specific time ranges, no? Sure, let's do that.

Yup, keeping time range as a filter, but also potentially dropping other fields which we may not need, if any. I will coordinate with Miriam off-this-task and we will give you the clear signal soon.

@leila

Yup, keeping time range as a filter, but also potentially dropping other fields which we may not need, if any. I will coordinate with Miriam off-this-task and we will give you the clear signal soon.

Great :] thank you!

@Tbayer

Just to double-check: The information in the documentation that "Sanitization happens right after events are generated (with a couple hours lag)" is still current, right? In that case I don't think this will be a concern (although we will need to update some queries - CCing @Groceryheist regarding ReadingDepth).

Yes, it still happens this way. We are considering to shift if back, but will let you know if we do.

@Tbayer

Just to double-check: The information in the documentation that "Sanitization happens right after events are generated (with a couple hours lag)" is still current, right? In that case I don't think this will be a concern (although we will need to update some queries - CCing @Groceryheist regarding ReadingDepth).

Yes, it still happens this way. We are considering to shift if back, but will let you know if we do.

I agree with Tilman—as long as the sanitized version is available more or less immediately, this won't be a problem, because we won't have to write two separate queries to access all the available whitelisted data. On the other hand, delaying the sanitization significantly sounds like it would be a big annoyance.

@Tbayer @Neil_P._Quinn_WMF

On the other hand, delaying the sanitization significantly sounds like it would be a big annoyance.

Makes, sense. We'll take this into account.

Hi @leila! Have you guys copied the data you need? Thanks

Hey @mforns! Not yet, we will be done by next week. Thanks!

@leila and @Miriam, can you please leave me a couple days to execute the deletion script before the Christmas vacation kicks in (end of quarter)? Thanks!

Oh, and if you guys need any help copying/formatting that data, you can ping me and I'll try to help.

@mforns thanks so much! Would starting executing the script next Monday be ok? Would this leave you enough time? We are doing the last checks and we should be done by the end of the week!

@mforns
Did we updated docs to note that whitelist changes have to go in when schema does? Otherwise data will get purged in the absence of a whitelist. We should make sure to advertise this fact.

I've been doing some data vetting for the last 2 days. I've found 2 minor issues with the data:

  • Some fields, that should be copied over to the event_sanitized database, are being nullified. Those correspond to field renames on EL schemas that were done in the past, without mirroring the changes on the EL sanitization whitelist. Since the renames, the fields with new names are not in the whitelist and are not persisted in event_sanitized. We pinged the schema owners in T209087 a couple months ago, but the problem persists. I pinged them again now.
  • From 2018-12-01T00 until 2018-12-14T17 the 3 capsule fields represented originally with camelCase (webHost, seqId, recvFrom) were not copied over to the event_sanitized database. The problem seems to be a change to Refine that was deployed on 2018-11-29. Note: userAgent field was also affected, but it is usually not whitelisted, so it was nullified anyway. The cases that parts of the userAgent are whitelisted, were previously fixed in T211833. A backfill of all tables for that period would fix the problem. But do we need to backfill? Those 3 fields are rarely used: webHost has an analogous field - wiki; seqId doesn't provide any other advantage than uuid; recvFrom is barely used.

Change 489817 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Lowercase capsule fields in EL sanitization whitelist

https://gerrit.wikimedia.org/r/489817

Change 489817 merged by Mforns:
[analytics/refinery@master] Lowercase capsule fields in EL sanitization whitelist

https://gerrit.wikimedia.org/r/489817

@Tbayer @Neil_P._Quinn_WMF @Miriam @leila @mpopov @bmansurov @nettrom_WMF @chelsyx @Nuria

Sorry for the mass ping.

After last round of bug-fixes, white-list corrections and backfillings to the event_sanitized database, the (unsanitized) event database is ready to be purged.
I will proceed now to productionize the deletion script that will periodically delete data older than 90 days from the event database.
It will run for the first time on Feb 28th.

Cheers!

@mforns I may have missed this: where is the list of whitelisted event databases? I'll double-check it before you start purging. @Miriam FYI.

@leila it is not a whitelist of databases but rather schema fields per schema, every schema owner sends code patches for list and @bmansurov has been doing that for research when it pertained.

@leila, +1 to Nuria, but I think you're just confused by my latest message.
We already discussed this 14th of November onwards.
As I understood it from @Miriam's comments, you guys copied the CitationUsage schemas over to another location, right?
So, the script won't affect that copy.
I was just giving a last ping after final deletion.
Please, confirm :]

Here's the plan for tomorrow:

  1. Collect the names of all tables in the event database that belong to the EL pipeline.
  2. For each table T, delete /wmf/data/event/T/year=2017 and /wmf/data/event/T/year=2018/month=M with M in (1,2,...10).
  3. For each table T, execute an msck repair table T command in Hive.
  4. Execute the deletion script (refinery-drop-older-than) once with --older-than=90 to delete the last due days.
  5. Productionize a systemd timer that calls the deletion script periodically (every day) in puppet.

Ok, starting the deletion now.

The deletion is finished now. The event database only contains now the last 90 days of data.
We're still deleting the corresponding Hive partitions (meta-data), maybe this causes some warnings when querying data.
The estimated time of completion of the partition meta-data synch'ing is in 24 hours.

@GoranSMilovanovic I found 2 schemas that I wasn't aware of, that I believe belong to WMDE:
WMDEBannerEvents and WMDEBannerSizeIssue.
I didn't delete them yet, in case their owner wasn't aware of the 90-day deletion policy.
Are you the owner of those schemas?

@mforns Hi. No, I do not own WMDEBannerEvents of WMDEBannerSizeIssue, but maybe @kai.nissen does?

Change 493687 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Add timer to delete analytics EL unsanitized events after 90d

https://gerrit.wikimedia.org/r/493687

Change 493687 merged by Elukey:
[operations/puppet@production] Add timer to delete analytics EL unsanitized events after 90d

https://gerrit.wikimedia.org/r/493687

Change 494495 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Fix small bug in refinery-drop-older-than

https://gerrit.wikimedia.org/r/494495

Change 494495 merged by Ottomata:
[analytics/refinery@master] Fix small bug in refinery-drop-older-than

https://gerrit.wikimedia.org/r/494495