Add page_id to pageview_hourly when present in webrequest x_analytics header
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JAllemandou
	Oct 20 2015, 3:29 PM

Description

Search & Discovery needs page_id to use pageview in search ranking algo.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		EBernhardson	T116055 Build out hadoop job to calculate average page views over time for cirrussearch scoring purposes
		Resolved		EBernhardson	T116023 Add page_id to pageview_hourly when present in webrequest x_analytics header

Event Timeline

JAllemandou created this task.Oct 20 2015, 3:29 PM

JAllemandou raised the priority of this task from to Needs Triage.

JAllemandou updated the task description. (Show Details)

JAllemandou added a project: Analytics-Backlog.

JAllemandou added subscribers: JAllemandou, EBernhardson.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2015, 3:29 PM

EBernhardson added a parent task: T116055: Build out hadoop job to calculate average page views over time for cirrussearch scoring purposes.Oct 20 2015, 5:58 PM

JAllemandou edited projects, added Analytics-Kanban; removed Analytics-Backlog.Oct 30 2015, 9:29 AM

JAllemandou set Security to None.

JAllemandou claimed this task.Oct 30 2015, 9:32 AM

JAllemandou removed JAllemandou as the assignee of this task.Nov 5 2015, 6:10 PM

JAllemandou edited projects, added Analytics-Backlog; removed Analytics-Kanban.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 5 2015, 6:10 PM

• Tbayer subscribed.Nov 17 2015, 12:25 AM

Change 255318 had a related patch set uploaded (by EBernhardson):
Add page_id to intermediate pageview

https://gerrit.wikimedia.org/r/255318

gerritbot added a project: Patch-For-Review.Nov 25 2015, 2:40 AM

Ran a few exploratory queries which eventually led me to:

SELECT * FROM (select pageview_info['project'] as project, pageview_info['page_title'] as page_title, collect_set(x_analytics_map['page_id']) as page_id from wmf.webrequest WHERE page_title <> '-' AND webrequest_source IN ('text', 'mobile') AND year=2015 AND month=11 AND day=11 AND hour=11 AND is_pageview = TRUE AND COALESCE(pageview_info['project'], '') <> '' GROUP BY pageview_info['project'], pageview_info['language_variant'], pageview_info['page_title'], access_method, x_analytics_map['zero'], agent_type, referer_class, geocoded_data['continent'], geocoded_data['country_code'], geocoded_data['country'], geocoded_data['subdivision'], geocoded_data['city'], user_agent_map) x where size(page_id) > 1 limit 100;

This is basically the request that builds pageview_hourly, but stripping page_title == '-' and selecting 100 rows with more than one page_id. A few selected results:

| en.wikipedia       | Annie_Parker                                                                             | ["36287267","33687379"]
| en.wikipedia       | Anti-sexualism                                                                           | ["1289723","41235850"]
| en.wikipedia       | David_copperfield                                                                        | ["63110","16087448"]
| en.wikipedia       | Facebook_messenger                                                                       | ["36974751","36250682"]
| en.wikipedia       | Infrastructure_as_a_service                                                              | ["19541494","12103809"]
| en.wikipedia       | Love_Dose                                                                                | ["43578247","44987655"]

I ran a few hundred through a script to query these page id's out of mysql, and they look to always be redirects. The page title output looks to always be the redirect itself, but might be worthwhile to find out for sure. I couldn't find where this page_id is generated, it doesn't seem to be included by the varnish config in analytics.inc.vcl.erb
in operations/puppet.

A few examples:

Annie_Parker | ["36287267","33687379"]

+------------------+-----------------------+
| page_is_redirect | page_title            |
+------------------+-----------------------+
|                0 | Decoding_Annie_Parker |
|                1 | Annie_Parker          |
+------------------+-----------------------+

Anti-sexualism | ["1289723","41235850"]

+------------------+----------------+
| page_is_redirect | page_title     |
+------------------+----------------+
|                0 | Antisexualism  |
|                1 | Anti-sexualism |
+------------------+----------------+

David_copperfield | ["63110","16087448"]

+------------------+-------------------+
| page_is_redirect | page_title        |
+------------------+-------------------+
|                0 | David_Copperfield |
|                1 | David_copperfield |
+------------------+-------------------+

I think the most appropriate thing to do for the pageview_hourly table will be to store the array of page_id's rather than trying to resolve this. Users of the data can do as they will.

I could see there being an argument to instead put x_analytics_map['page_id'] into the group by clause, any thoughts/opinions?

Thanks a lot Erik for the investigation !
I think I'd rather go for adding page_id as a single field and group by it.
It makes a little bit more data, but it's easier to query at the end for users.
Would that work for you ?

yea that can work, i suppose i just lean towards keeping data small. But it's easier to query and no harm to have it separated out, can be grouped on later.

And an interesting point is that, since we are using columnar format (parquet), data won't grow that much :)

JAllemandou assigned this task to EBernhardson.Dec 1 2015, 6:23 PM

JAllemandou edited projects, added Analytics-Kanban; removed Analytics-Backlog.

JAllemandou moved this task from Next Up to Done on the Analytics-Kanban board.

• Nuria closed this task as Resolved.Dec 2 2015, 6:46 PM

Add page_id to pageview_hourly when present in webrequest x_analytics headerClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Add page_id to pageview_hourly when present in webrequest x_analytics header
Closed, ResolvedPublic
Actions

Related Objects
Search...