Page MenuHomePhabricator

Whitelist sample flags and page/rev ID fields for ReadingDepth schema
Closed, ResolvedPublic1 Story Points


It looks like after extending this schema to support and distinguish events from different samples (experiments), we forgot to update the whitelist with the corresponding boolean field fields: default_sample, page-issues-a_sample, page-issues-b_sample.

Also, after the decision in T209051: ReadingDepth schema is whitelisting both session ids and page ids (to keep page information and drop session information) we did not yet whitelist the actual pageID and revisionID in addition to pageTitle field, probably because these two had been added in a separate effort during that task.

(Lastly, the purge policy on the schema talk page was not updated with T209051, I can take care of that.)

Event Timeline

Nuria moved this task from Incoming to Radar on the Analytics board.Feb 14 2019, 12:56 AM

PS: patch is at (seems @gerritbot is lagging a bit currently)

NB: The names of these sample field names are spelled with underscores in Hive (e.g. page_issues_b_sample, see below) but with dashes in the schema page (e.g. page-issues-b_sample ). Which version does the whitelist require?

hive (default)> DESCRIBE event.readingdepth;
col_name	data_type	comment
dt                  	string              	                    
event               	struct<action:string,domInteractiveTime:bigint,firstPaintTime:bigint,isAnon:boolean,namespaceId:bigint,pageTitle:string,pageToken:string,sessionToken:string,skin:string,totalLength:bigint,visibleLength:bigint,default_sample:boolean,page_issues_a_sample:boolean,page_issues_b_sample:boolean>	                    
recvfrom            	string

@Jdrewniak points out that in the spelling is still page-issues-b_sample/ page-issues-a_sample (i.e. like on the schema page, not like in Hive).

Tbayer moved this task from Triage to Blocked on the Product-Analytics board.Feb 14 2019, 7:29 PM
Tbayer added a subscriber: mforns.

Blocked on code review and an answer to T216096#4953210 from someone familiar with the whole EL pipeline and the purging mechanism (@mforns?).

@Tbayer Thanks for spotting this!

Even if the events arrive to HDFS with the page-issues-X_sample hyphen notation, the EL Refine process normalizes them to all underscores before writing them to Hive. So the notation that the EL sanitization whitelist needs is all underscores. I will modify your patch and merge it (except for that detail, it LGTM!). I also will sanitize this schema retroactively for a couple months, since I'm already doing some maintenance work on EL sanitization pipeline. Hope this helps. Will ping you in this task when finished. Cheers!

Great, thanks a lot! The sample fields were introduced in September, so no need to go further back. (CC @Groceryheist )

@Tbayer event_sanitized.readingdepth is backfilled using the new whitelist.
I have vetted the resulting data and it looks good to me, but please do a quick check.
Note that the 28th of this month (in one week) we'll execute the purging script to delete data older than 90 days on the event database, and backfillings like this will not be possible any more (after 90 days).

mforns claimed this task.Feb 22 2019, 3:59 PM
mforns added a project: Analytics-Kanban.
mforns moved this task from Next Up to Done on the Analytics-Kanban board.
Nuria closed this task as Resolved.Feb 25 2019, 10:53 PM
Nuria set the point value for this task to 1.

@Tbayer event_sanitized.readingdepth is backfilled using the new whitelist.
I have vetted the resulting data and it looks good to me, but please do a quick check.

I spot-checked by comparing the daily number of events with each sample field set between the sanitized and unsanitized version, and they matched. Thanks again!

@Tbayer thanks for the check!