Whitelist sample flags and page/rev ID fields for ReadingDepth schema
Closed, ResolvedPublic1 Estimated Story Points
Actions

Assigned To

Authored By

	• Tbayer
	Feb 14 2019, 12:52 AM

Description

It looks like after extending this schema to support and distinguish events from different samples (experiments), we forgot to update the whitelist with the corresponding boolean field fields: default_sample, page-issues-a_sample, page-issues-b_sample.

Also, after the decision in T209051: ReadingDepth schema is whitelisting both session ids and page ids (to keep page information and drop session information) we did not yet whitelist the actual pageID and revisionID in addition to pageTitle field, probably because these two had been added in a separate effort during that task.

(Lastly, the purge policy on the schema talk page was not updated with T209051, I can take care of that.)

Related Objects

Mentioned In: T214093: Modern Event Platform: Schema Guidelines and Conventions
T209051: ReadingDepth schema is whitelisting both session ids and page ids
T209087: [EventLogging Sanitization] Update EL sanitization white-list for field renames in EL schemas
Mentioned Here: T209051: ReadingDepth schema is whitelisting both session ids and page ids

Event Timeline

• Tbayer created this task.Feb 14 2019, 12:52 AM

• Nuria moved this task from Incoming to Radar on the Analytics board.Feb 14 2019, 12:56 AM

• Tbayer mentioned this in T209087: [EventLogging Sanitization] Update EL sanitization white-list for field renames in EL schemas.Feb 14 2019, 1:08 AM

• Tbayer mentioned this in T209051: ReadingDepth schema is whitelisting both session ids and page ids.Feb 14 2019, 1:17 AM

PS: patch is at https://gerrit.wikimedia.org/r/490514 (seems @gerritbot is lagging a bit currently)

NB: The names of these sample field names are spelled with underscores in Hive (e.g. page_issues_b_sample, see below) but with dashes in the schema page (e.g. page-issues-b_sample ). Which version does the whitelist require?

hive (default)> DESCRIBE event.readingdepth;
OK
col_name	data_type	comment
dt                  	string              	                    
event               	struct<action:string,domInteractiveTime:bigint,firstPaintTime:bigint,isAnon:boolean,namespaceId:bigint,pageTitle:string,pageToken:string,sessionToken:string,skin:string,totalLength:bigint,visibleLength:bigint,default_sample:boolean,page_issues_a_sample:boolean,page_issues_b_sample:boolean>	                    
recvfrom            	string
...

@Jdrewniak points out that in https://github.com/wikimedia/mediawiki-skins-MinervaNeue/blob/f07985c6dee5106da8f381a47214e7349fcd147e/resources/skins.minerva.scripts/pageIssuesLogger.js#L65 the spelling is still page-issues-b_sample/ page-issues-a_sample (i.e. like on the schema page, not like in Hive).

ovasileva added a project: Web-Team-Backlog (Tracking).Feb 14 2019, 11:05 AM

Blocked on code review and an answer to T216096#4953210 from someone familiar with the whole EL pipeline and the purging mechanism (@mforns?).

@Tbayer Thanks for spotting this!

Even if the events arrive to HDFS with the page-issues-X_sample hyphen notation, the EL Refine process normalizes them to all underscores before writing them to Hive. So the notation that the EL sanitization whitelist needs is all underscores. I will modify your patch and merge it (except for that detail, it LGTM!). I also will sanitize this schema retroactively for a couple months, since I'm already doing some maintenance work on EL sanitization pipeline. Hope this helps. Will ping you in this task when finished. Cheers!

Great, thanks a lot! The sample fields were introduced in September, so no need to go further back. (CC @Groceryheist )

• Tbayer mentioned this in T214093: Modern Event Platform: Schema Guidelines and Conventions.Feb 20 2019, 10:41 PM

@Tbayer event_sanitized.readingdepth is backfilled using the new whitelist.
I have vetted the resulting data and it looks good to me, but please do a quick check.
Note that the 28th of this month (in one week) we'll execute the purging script to delete data older than 90 days on the event database, and backfillings like this will not be possible any more (after 90 days).
Cheers!

mforns claimed this task.Feb 22 2019, 3:59 PM

mforns added a project: Analytics-Kanban.

mforns moved this task from Next Up to Done on the Analytics-Kanban board.

ovasileva added a project: Reading Depth.Feb 25 2019, 4:31 PM

• Nuria closed this task as Resolved.Feb 25 2019, 10:53 PM

• Nuria set the point value for this task to 1.

In T216096#4976316, @mforns wrote:

@Tbayer event_sanitized.readingdepth is backfilled using the new whitelist.
I have vetted the resulting data and it looks good to me, but please do a quick check.

I spot-checked by comparing the daily number of events with each sample field set between the sanitized and unsanitized version, and they matched. Thanks again!

@Tbayer thanks for the check!

kzimmerman moved this task from Epics to Blocked on the Product-Analytics board.Aug 9 2019, 5:55 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Whitelist sample flags and page/rev ID fields for ReadingDepth schemaClosed, ResolvedPublic1 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Whitelist sample flags and page/rev ID fields for ReadingDepth schema
Closed, ResolvedPublic1 Estimated Story Points
Actions