Page MenuHomePhabricator

[EventLogging Sanitization] Update EL sanitization white-list for field renames in EL schemas
Closed, ResolvedPublic3 Story Points

Description

Today we were reviewing the EL sanitization white-list (for other reasons) and we saw that the schema MobileWikiAppShareAFact had some fields renamed a couple months ago. However, the corresponding white-list was not updated with the new field names. So, after 90 days of the renames, those fields will be purged (maybe some have been already).

Everytime there's a rename in an EL schema, the whitelist should be updated accordingly, otherwise data loss might happen. See: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging#What_happens_when_I_rename_fields_in_a_schema?

Please, can you make sure that all your schemas that suffered field renames are updated in the EL sanitization whitelist?

I'm subscribing other data analysts to this task so that they can check as well, in case they have schemas in such situation.
I also added Reading-analysis and Product-Analytics (sorry if those tags are not accurate).

Event Timeline

mforns created this task.Nov 8 2018, 6:51 PM
mforns added a subscriber: Nuria.
Tbayer removed Tbayer as the assignee of this task.Nov 9 2018, 4:54 PM
Tbayer added a subscriber: Tbayer.

@mforns: I assume "you" in the task description refers to me (since you assigned the task to me). I didn't have anything to do with the original creation of the schema or the field renames in question, and am not among the schema's maintainers.
We'll likely discuss this in our team meeting later today - it's probably best if the involved analysts determine the precise list of field names to be added, although I'll be happy to help submitting the resulting whitelist patch as I did earlier this week in case of the Popups schema.

PS: and (in the name of the team) thanks for catching this!

mforns added a comment.Nov 9 2018, 4:59 PM

@Tbayer
I should have looked who that schema's maintainer was, sorry for that.
I intended it just as a heads up, and thought of you, given that you've been our main point of contact in the past, regarding EL Reading schemas in general.
Please, feel free to reasign the task! I also added other analysts to the task so that they can chime in.

Nuria added a comment.Nov 9 2018, 6:46 PM

Reassiging to @bearloga who is working with android team.

Nuria assigned this task to mpopov.Nov 9 2018, 6:47 PM
nettrom_WMF moved this task from Triage to Next Up on the Product-Analytics board.Nov 9 2018, 7:37 PM
mforns renamed this task from [EventLogging Sanitization] Update EL sanitization whit-elist for field renames in EL schemas to [EventLogging Sanitization] Update EL sanitization white-list for field renames in EL schemas.Nov 12 2018, 5:22 PM
fdans moved this task from Incoming to Radar on the Analytics board.Nov 12 2018, 5:23 PM
mforns added a comment.Feb 8 2019, 4:31 PM

@mpopov Just a heads-up that we'll be turning on the deletion script that will delete unsanitized EL data very soon.
If there are no changes to EL sanitization whitelist, the data that this task refers will be irrecoverable after the script runs.
If you want me to backfill (thus fix) any renamed fields that are missing today, now would be the time! :]
If so, please send a patch to https://github.com/wikimedia/analytics-refinery/blob/master/static_data/eventlogging/whitelist.yaml adding (not replacing) the new names of the fields that were renamed.
Thanks!

@mpopov Just a heads-up that we'll be turning on the deletion script that will delete unsanitized EL data very soon.
If there are no changes to EL sanitization whitelist, the data that this task refers will be irrecoverable after the script runs.
If you want me to backfill (thus fix) any renamed fields that are missing today, now would be the time! :]

Can you make sure my VisualEditorFeatureUse whitelist patch (T212588) is merged and deployed before you do that? Thanks!

@Neil_P._Quinn_WMF

Can you make sure my VisualEditorFeatureUse whitelist patch (T212588) is merged and deployed before you do that? Thanks!

Sure, we plan to deploy refinery this week.
However, this will only guarantee that events collected from now on will be copied over to the event_sanitized database and kept indefinitely*.
The events that are already collected since 2018-10-24 until now would be lost*.
Do you want me to backfill the schema to event_sanitized since the start of collection for this schema?

(*) Soon, we'll be adding a new sanitization pass 45 days after collection, so that teams will have 45 days to write a whitelist patch and get it merged after the start of event collection. In this hopefully near future :] we'd like teams to submit their whitelist patches within that time range, to avoid having to backfill, because it's time-consuming.

Change 490212 had a related patch set uploaded (by Chelsyx; owner: Chelsyx):
[analytics/refinery@master] Add updated field names of MobileWikiAppShareAFact to EventLogging whitelist

https://gerrit.wikimedia.org/r/490212

Since MobileWikiAppShareAFact may be useful for iOS in the future, I submit the patch to add the updated field names: https://gerrit.wikimedia.org/r/490212
@mpopov Please feel free to make changes if you think differently.

Change 490212 merged by Mforns:
[analytics/refinery@master] Add updated field names of MobileWikiAppShareAFact to EventLogging whitelist

https://gerrit.wikimedia.org/r/490212

Thanks @chelsyx!
Please, can you tell me since when should I apply backfilling for MobileWikiAppShareAFact?
Meaning, when did events with renamed fields start flowing in for this schema?

@chelsyx, or did you just add those fields for events flowing in in the future?

Nuria added a comment.Feb 13 2019, 4:33 PM

@mforns , @chelsyx When fields are renamed I think it will be of use to keep old and new names in the whitelist, that way no backfilling is required. Can we change the code for the whitelist for this schema such it does not require backfilling, if both names are present the sweeper process for sanitization should take care of newly added fields.

Let me know if this makes sense.

@Nuria the change @chelsyx posted does indeed add renamed fields and keeps the old ones as you suggest.
The backfilling would be still needed, because since some of the schema fields were renamed, those have been not copied over to the event_sanitized database, because they were not in the whitelist.

@mforns According to https://meta.wikimedia.org/w/index.php?title=Schema:MobileWikiAppShareAFact&diff=prev&oldid=18118760, looks like we should backfill since 11 June 2018, but I don't know when exactly did the data start flowing in...

Nuria added a comment.Feb 13 2019, 6:40 PM

@chelsyx we will backfill as much as we can but we might not have as much data.

Found (and fixed) an oversight regarding ReadingDepth: T216096

@chelsyx @mpopov @Neil_P._Quinn_WMF (cc @Nuria)
I backfilled both schemas since the discussed dates: event_sanitized.VisualEditorFeatureUse since 2018-10-24 and event_sanitized.MobileWikiAppShareAFact since 2018-06-21. I vetted the resulting data and it all looks good to me, but please give a quick check to confirm.
Thanks for the changes you guys did! Now I will proceed to productionize the purging script that will delete events older than 90 days from the raw events database (event). This will happen on Feb 28th.
Cheers!

mforns claimed this task.Feb 22 2019, 4:07 PM
mforns added a project: Analytics-Kanban.
mforns moved this task from Next Up to Done on the Analytics-Kanban board.

I just checked event_sanitized.MobileWikiAppShareAFact, and it looks good to me. Thanks @mforns !

Nuria closed this task as Resolved.Feb 25 2019, 10:54 PM
Nuria set the point value for this task to 3.

@chelsyx thanks for the check!

Change 493424 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[analytics/refinery@master] Update whitelisting for Android-related schemas

https://gerrit.wikimedia.org/r/493424

Change 493424 merged by Mforns:
[analytics/refinery@master] Update whitelisting for Android-related schemas

https://gerrit.wikimedia.org/r/493424