Page MenuHomePhabricator

Wikimedia "top" pageviews API weirdness with the "Paul_Elio" article [5 pts] {slug}
Closed, ResolvedPublic

Description

Looking at https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2015/11/11 and similar URLs, I'm noticing some weirdness with the "Paul_Elio" article on the English Wikipedia. The page itself is just a redirect to the "Elio_Motors" article.

"Paul_Elio" does not appear in the top 1,000 viewed pages for 2015-11-01 through 2015-11-12.

It appears on 2015-11-13 with 182,479 views.

It does not appear on 2015-11-14.

It appears on 2015-11-15 with 4,069,710 views.

It appears on 2015-11-16 with 4,832,338 views.

It does not appear on 2015-11-17.

It's possible that these numbers are accurate, but they seem pretty wild.

Event Timeline

MZMcBride raised the priority of this task from to Needs Triage.
MZMcBride updated the task description. (Show Details)
MZMcBride added projects: Services, Analytics.
MZMcBride subscribed.

To mention other recent pageview weirdness tickets (may or may not be related, in any case they illustrate it might not be the API's fault):
T117945 (https://en.wikipedia.org/wiki/Angelsberg in the top 200 list there, with 74 million view from May-Oct 2015, is also worth a look)
T117343

Quick data check:

  • Elio Motors page seems Ok
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Elio_Motors/daily/2015110100/2015113000

{"items":[{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015110100","access":"all-access","agent":"all-agents","views":169},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015110200","access":"all-access","agent":"all-agents","views":273},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015110300","access":"all-access","agent":"all-agents","views":310},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015110400","access":"all-access","agent":"all-agents","views":384},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015110500","access":"all-access","agent":"all-agents","views":336},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015110600","access":"all-access","agent":"all-agents","views":251},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015110700","access":"all-access","agent":"all-agents","views":233},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015110800","access":"all-access","agent":"all-agents","views":216},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015110900","access":"all-access","agent":"all-agents","views":290},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015111000","access":"all-access","agent":"all-agents","views":264},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015111100","access":"all-access","agent":"all-agents","views":252},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015111200","access":"all-access","agent":"all-agents","views":273},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015111300","access":"all-access","agent":"all-agents","views":306},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015111400","access":"all-access","agent":"all-agents","views":229},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015111500","access":"all-access","agent":"all-agents","views":265},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015111600","access":"all-access","agent":"all-agents","views":269},
{"project":"en.wikipedia","article":"Elio_Motors","granularity":"daily","timestamp":"2015111700","access":"all-access","agent":"all-agents","views":312}]}
  • Paul Elio page has 3 odds day
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Paul_Elio/daily/2015110100/2015113000

{"items":[{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110100","access":"all-access","agent":"all-agents","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110200","access":"all-access","agent":"all-agents","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110300","access":"all-access","agent":"all-agents","views":5},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110400","access":"all-access","agent":"all-agents","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110500","access":"all-access","agent":"all-agents","views":2},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110600","access":"all-access","agent":"all-agents","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110700","access":"all-access","agent":"all-agents","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111000","access":"all-access","agent":"all-agents","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111200","access":"all-access","agent":"all-agents","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111300","access":"all-access","agent":"all-agents","views":182479},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111400","access":"all-access","agent":"all-agents","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111500","access":"all-access","agent":"all-agents","views":4069710},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111600","access":"all-access","agent":"all-agents","views":4832338},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111700","access":"all-access","agent":"all-agents","views":3}]}

More investigations on the go.

Seems that this page have been widely crawled on the given days:

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/spider/Paul_Elio/daily/2015110100/2015113000

{"items":[{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110100","access":"all-access","agent":"spider","views":0},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110200","access":"all-access","agent":"spider","views":0},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110300","access":"all-access","agent":"spider","views":2},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110400","access":"all-access","agent":"spider","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110500","access":"all-access","agent":"spider","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110600","access":"all-access","agent":"spider","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110700","access":"all-access","agent":"spider","views":0},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111000","access":"all-access","agent":"spider","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111200","access":"all-access","agent":"spider","views":0},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111300","access":"all-access","agent":"spider","views":182479},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111400","access":"all-access","agent":"spider","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111500","access":"all-access","agent":"spider","views":4069709},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111600","access":"all-access","agent":"spider","views":4832335},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111700","access":"all-access","agent":"spider","views":0}]}

The user bit of it seems normal though:

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Paul_Elio/daily/2015110100/2015113000

{"items":[{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110100","access":"all-access","agent":"user","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110200","access":"all-access","agent":"user","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110300","access":"all-access","agent":"user","views":3},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110400","access":"all-access","agent":"user","views":0},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110500","access":"all-access","agent":"user","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110600","access":"all-access","agent":"user","views":0},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015110700","access":"all-access","agent":"user","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111000","access":"all-access","agent":"user","views":0},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111200","access":"all-access","agent":"user","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111300","access":"all-access","agent":"user","views":0},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111400","access":"all-access","agent":"user","views":0},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111500","access":"all-access","agent":"user","views":1},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111600","access":"all-access","agent":"user","views":3},
{"project":"en.wikipedia","article":"Paul_Elio","granularity":"daily","timestamp":"2015111700","access":"all-access","agent":"user","views":3}]}

And from hive, all those calls where made from a single Ruby program.

I have sent an email to the analytics team to discuss weither we should remove sipders from the top data.

I'm leaning towards removing spiders from top data. It doesn't seem of import to people looking for the "popular" pages.

It's of absolutely no import; in fact, it makes things less usable. I'd already filed a task on this: T117343

I'm leaning towards removing spiders from top data. It doesn't seem of import to people looking for the "popular" pages.

Known spiders yes, I agree, they are hardly relevant on "top" data.
Now, have in mind that all our pageview data is affected by robot traffic that is not marked as such. We only tag as robots the ones that identify as being so. That being said, let's go ahead and remove them.

I would like to let API bake a bit before starting fixing bugs there, as it gets more use issues like this one will surface and I would rather wait couple weeks before we do triage.

Should we read that as "we're working on this but generally we'll wait a while" or "we'll wait a while to fix this one until other bugs have the opportunity to surface"?

JAllemandou subscribed.

I'll start filling new data with user-only top tonight (meaning data from 2015-11-18 onward will be user-only).
We'll then backfill older data later.

Job restarted without spiders (or at least what we identify as Spiders).
Restart Day : 2015-11-17 (testing day).
Backfilling is tracked using https://phabricator.wikimedia.org/T118991.

Nemo_bis subscribed.

Please remember to add a specific blue project to all tasks related to pageviews data.

JAllemandou renamed this task from Wikimedia "top" pageviews API weirdness with the "Paul_Elio" article to Wikimedia "top" pageviews API weirdness with the "Paul_Elio" article [5 pts] {slug}.Nov 19 2015, 3:00 PM

@Nemo_bis: you need us to add Datasets-Webstatscollector on all pageview data tasks? I'm not familiar...