Page MenuHomePhabricator

Whitelist sample flags and page/rev ID fields for ReadingDepth schema
Closed, ResolvedPublic1 Estimated Story Points


It looks like after extending this schema to support and distinguish events from different samples (experiments), we forgot to update the whitelist with the corresponding boolean field fields: default_sample, page-issues-a_sample, page-issues-b_sample.

Also, after the decision in T209051: ReadingDepth schema is whitelisting both session ids and page ids (to keep page information and drop session information) we did not yet whitelist the actual pageID and revisionID in addition to pageTitle field, probably because these two had been added in a separate effort during that task.

(Lastly, the purge policy on the schema talk page was not updated with T209051, I can take care of that.)

Event Timeline

NB: The names of these sample field names are spelled with underscores in Hive (e.g. page_issues_b_sample, see below) but with dashes in the schema page (e.g. page-issues-b_sample ). Which version does the whitelist require?

hive (default)> DESCRIBE event.readingdepth;
col_name	data_type	comment
dt                  	string              	                    
event               	struct<action:string,domInteractiveTime:bigint,firstPaintTime:bigint,isAnon:boolean,namespaceId:bigint,pageTitle:string,pageToken:string,sessionToken:string,skin:string,totalLength:bigint,visibleLength:bigint,default_sample:boolean,page_issues_a_sample:boolean,page_issues_b_sample:boolean>	                    
recvfrom            	string
Tbayer added a subscriber: mforns.

Blocked on code review and an answer to T216096#4953210 from someone familiar with the whole EL pipeline and the purging mechanism (@mforns?).

@Tbayer Thanks for spotting this!

Even if the events arrive to HDFS with the page-issues-X_sample hyphen notation, the EL Refine process normalizes them to all underscores before writing them to Hive. So the notation that the EL sanitization whitelist needs is all underscores. I will modify your patch and merge it (except for that detail, it LGTM!). I also will sanitize this schema retroactively for a couple months, since I'm already doing some maintenance work on EL sanitization pipeline. Hope this helps. Will ping you in this task when finished. Cheers!

Great, thanks a lot! The sample fields were introduced in September, so no need to go further back. (CC @Groceryheist )

@Tbayer event_sanitized.readingdepth is backfilled using the new whitelist.
I have vetted the resulting data and it looks good to me, but please do a quick check.
Note that the 28th of this month (in one week) we'll execute the purging script to delete data older than 90 days on the event database, and backfillings like this will not be possible any more (after 90 days).

mforns added a project: Analytics-Kanban.
mforns moved this task from Next Up to Done on the Analytics-Kanban board.
Nuria set the point value for this task to 1.

@Tbayer event_sanitized.readingdepth is backfilled using the new whitelist.
I have vetted the resulting data and it looks good to me, but please do a quick check.

I spot-checked by comparing the daily number of events with each sample field set between the sanitized and unsanitized version, and they matched. Thanks again!