Page MenuHomePhabricator

Add page_id to pageview_hourly when present in webrequest x_analytics header
Closed, ResolvedPublic

Description

Search & Discovery needs page_id to use pageview in search ranking algo.

Event Timeline

JAllemandou raised the priority of this task from to Needs Triage.
JAllemandou updated the task description. (Show Details)
JAllemandou added a project: Analytics-Backlog.
JAllemandou added subscribers: JAllemandou, EBernhardson.

Change 255318 had a related patch set uploaded (by EBernhardson):
Add page_id to intermediate pageview

https://gerrit.wikimedia.org/r/255318

Ran a few exploratory queries which eventually led me to:

SELECT * FROM (select pageview_info['project'] as project, pageview_info['page_title'] as page_title, collect_set(x_analytics_map['page_id']) as page_id from wmf.webrequest WHERE page_title <> '-' AND webrequest_source IN ('text', 'mobile') AND year=2015 AND month=11 AND day=11 AND hour=11 AND is_pageview = TRUE AND COALESCE(pageview_info['project'], '') <> '' GROUP BY pageview_info['project'], pageview_info['language_variant'], pageview_info['page_title'], access_method, x_analytics_map['zero'], agent_type, referer_class, geocoded_data['continent'], geocoded_data['country_code'], geocoded_data['country'], geocoded_data['subdivision'], geocoded_data['city'], user_agent_map) x where size(page_id) > 1 limit 100;

This is basically the request that builds pageview_hourly, but stripping page_title == '-' and selecting 100 rows with more than one page_id. A few selected results:

| en.wikipedia       | Annie_Parker                                                                             | ["36287267","33687379"]
| en.wikipedia       | Anti-sexualism                                                                           | ["1289723","41235850"]
| en.wikipedia       | David_copperfield                                                                        | ["63110","16087448"]
| en.wikipedia       | Facebook_messenger                                                                       | ["36974751","36250682"]
| en.wikipedia       | Infrastructure_as_a_service                                                              | ["19541494","12103809"]
| en.wikipedia       | Love_Dose                                                                                | ["43578247","44987655"]

I ran a few hundred through a script to query these page id's out of mysql, and they look to always be redirects. The page title output looks to always be the redirect itself, but might be worthwhile to find out for sure. I couldn't find where this page_id is generated, it doesn't seem to be included by the varnish config in analytics.inc.vcl.erb
in operations/puppet.

A few examples:

Annie_Parker | ["36287267","33687379"]

+------------------+-----------------------+
| page_is_redirect | page_title            |
+------------------+-----------------------+
|                0 | Decoding_Annie_Parker |
|                1 | Annie_Parker          |
+------------------+-----------------------+

Anti-sexualism | ["1289723","41235850"]

+------------------+----------------+
| page_is_redirect | page_title     |
+------------------+----------------+
|                0 | Antisexualism  |
|                1 | Anti-sexualism |
+------------------+----------------+

David_copperfield | ["63110","16087448"]

+------------------+-------------------+
| page_is_redirect | page_title        |
+------------------+-------------------+
|                0 | David_Copperfield |
|                1 | David_copperfield |
+------------------+-------------------+

I think the most appropriate thing to do for the pageview_hourly table will be to store the array of page_id's rather than trying to resolve this. Users of the data can do as they will.

I could see there being an argument to instead put x_analytics_map['page_id'] into the group by clause, any thoughts/opinions?

Thanks a lot Erik for the investigation !
I think I'd rather go for adding page_id as a single field and group by it.
It makes a little bit more data, but it's easier to query at the end for users.
Would that work for you ?

yea that can work, i suppose i just lean towards keeping data small. But it's easier to query and no harm to have it separated out, can be grouped on later.

And an interesting point is that, since we are using columnar format (parquet), data won't grow that much :)