Page MenuHomePhabricator

Add is_pageview as a dimension to the 'webrequest_sampled_128' Druid dataset
Closed, ResolvedPublic1 Estimated Story Points

Description

It would be useful to be able to limit analysis of this data in Turnilo and Superset to pageviews only (it offers several dimensions that are not available in the pageviews-specific Druid datasets).

Event Timeline

This is not super high priority, but per a brief discussion with @Nuria some weeks ago it should be fairly easy to do, considering that the data appears to come from the refined webrequest table which already contains is_pageview.

Milimetric triaged this task as Medium priority.Jan 3 2019, 6:13 PM
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Change 482277 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update druid-webrequest jobs adding is_pageview

https://gerrit.wikimedia.org/r/482277

JAllemandou added a project: Analytics-Kanban.
JAllemandou set the point value for this task to 1.
JAllemandou moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Change 482277 merged by Nuria:
[analytics/refinery@master] Update druid-webrequest jobs adding is_pageview

https://gerrit.wikimedia.org/r/482277

In neither turnilo nor superset does is_pageview appear as a dimension. I think we might need a job restart.

mmmm .. both jobs were restarted on the 1/7

Job Name : webrequest-druid-hourly-coord
App Path : hdfs://analytics-hadoop/wmf/refinery/2019-01-07T21.16.01+00.00--scap_sync_2019-01-07_0001/oozie/webrequest/druid/hourly/coordinator.xml
Status : RUNNING
Start Time : 2019-01-07 20:00 GMT
End Time : 3000-01-01 00:00 GMT

Job Name : webrequest-druid-daily-coord
App Path : hdfs://analytics-hadoop/wmf/refinery/2019-01-07T21.16.01+00.00--scap_sync_2019-01-07_0001/oozie/webrequest/druid/daily/coordinator.xml
Status : RUNNING
Start Time : 2019-01-07 00:00 GMT
End Time : 3000-01-01 00:00 GMT
Pause Time : -
Concurrency : 2

And hdfs://analytics-hadoop/wmf/refinery/2019-01-07T21.16.01+00.00--scap_sync_2019-01-07_0001/oozie/webrequest/druid/daily/load_webrequests_daily.json.template has the is_pageview changes

Change 485002 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Add is_pageview to webrequest turnilo datasource

https://gerrit.wikimedia.org/r/485002

Change 485002 merged by Elukey:
[operations/puppet@production] turnilo: add is_pageview to webrequest datasource

https://gerrit.wikimedia.org/r/485002

Nice to know that druid admin interface displayed all dimensions: webrequest_source hostname time_firstbyte ip http_status response_size http_method uri_host uri_path uri_query content_type referer user_agent x_cache continent country_code isp as_number is_pageview

Turnilo needed a patch (webrequest_sampled_128 datasource had intrspection disabled), and I manually updated the columns in superset (automagic column scan failed...)