Too few page views for June/July 2015
Open, NormalPublic

Description

It seems we missed part of page views in June/July. (data stream error?)
Sharp drop starts around June 16 and (only partial) recovery happens two weeks later.

First indication:
a 15% MoM drop in page views for all wikipedias combined is extreme [1]

Second indication:
A peak in some very general subject can always happen (e.g. due to related media coverage).
A sharp drop is much more unlikely, if on many articles, and for an extended period (more than one or two days), but too short to be seasonal.

Both stats.grok.se and vitribyte.com show that same extended drop: e.g. https://www.vitribyte.com/insight/topictrends?p=London_Bridge&w=3m
stats.grok.se also shows full day of missing data near the start of the drop.
And in weeks that follow old level is never reached again.

[1] http://stats.wikimedia.org/EN/TablesPageViewsMonthlyOriginalCombined.htm
[2] I'll try to attach screenshots

ezachte created this task.Jul 16 2015, 3:10 PM
ezachte updated the task description. (Show Details)
ezachte raised the priority of this task from to Needs Triage.
ezachte assigned this task to kevinator.
ezachte added a project: Analytics.
ezachte added subscribers: ezachte, DarTar.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2015, 3:10 PM

Hi @ezachte,
I don't think this is an error in the data. Here are some of the combined factors that are causing the drop:

  • seasonality is causing a downward trend
  • The HTTPS rollout affected bots crawling pages
  • the links changing (cannonical=rel) to https may have resulted in crawlers indexing fewer pages

@ezachte, I second @kevinator – we just discussed this briefly – I don't think there's any data loss or anything suggesting we should regenerate PV dumps, but let me know if you believe otherwise. Page-level traffic data should be replaced by the end of Q1 with data based on the new PV definition and generated from hadoop.

  • Seasonality

Granted, there is a seasonal component, but 2015 drop is twice largest earlier drop May->Jun

MoM May->June for all Wikipedias combined
2009 -6%
2010 -7%
2011 -6%
2012 -3%
2013 -2%
2014 -5%
2015 -15%

Also seasonal effects are usually gradual and over many weeks (not around X mas).

For some topics I would expect more requests in summer, like London Bridge, not a very steep drop to ~1/5th for two weeks (see chart above). Even with bot pollution in these view stats those don't account for 80% of requests for such a popular topic (they can count for 100% on topics rarely visited).

  • HTTPS rollout

@kevinator Could you give a pointer to what happened in June. I know only of https rollout from longer ago. Thx

  • Hadoop

@DarTar these data are already from hadoop, but indeed using an old definition, compatible with Domas' initial choices

To expand on that example:

http://stats.grok.se/en/201507/London_Bridge receives between 500-800 views per day. Last time I checked about 10 of those were crawler requests. It's only because crawlers do this indiscriminately for every article, no matter how obscure, that it adds up to a significant percentage of overall views. But no so for popular articles.

@ezachte see T102431 for more context on the recent HTTPS rollout, I don't think there has been a public-facing report of this data yet.
For some countries we suspect the observed drop in traffic might be explained as miscategorized bot traffic.

these data are already from hadoop, but indeed using an old definition, compatible with Domas' initial choices

yes, I'm referring to T44259

i'm completely ignorant regarding computer stats -- however i'd like to point out something regarding the suggestino of fewer crawler bots. If you look at thte stats for wikisources here -- http://stats.wikimedia.org/wikisource/EN/TablesPageViewsMonthly.htm -- for the current month of july you'll see a strange thing. besides for the fact that the hebrew, russian, or czech sites are barely reaching half of their average trend of the past half year, the english site has a wopping estimated amount of 106 million! more than 4 X teh average numbers of the past half year. How can it be explained the drastic increase in the english site with the simiultaneous drop in the other sites mentioned above?

DarTar triaged this task as Normal priority.Jul 24 2015, 4:34 PM
Tbayer added a subscriber: Tbayer.Jul 27 2015, 9:15 PM

comScore provides another data point about low pageviews in June
we know comScore has its own issues, and doesn't track all our traffic, still it is remarkable

comScore uniques also saw an unusually large drop in June (375M, the lowest since August 2010).

@kevinator can you share on this thread a link to the summary of the HTTPS analysis, if it has been published? I can't comment on comScore data but the results I saw suggested that the HTTPS rollout blocked not only a huge amount of scripted requests but also a sizeable proportion of requests from various countries which are potentially due to bots unidentified by ua-parser.

kevinator moved this task from Next Up to Radar on the Analytics-Kanban board.Aug 1 2015, 12:40 AM
Tbayer added a comment.Aug 1 2015, 2:19 AM

Since I happened to have been looking at them for other reasons (T106502), I made a quick and dirty chart of daily total pageviews (new definition, from Hive) for May-July:


(A similar chart can be generated here.)

As Dario mentioned, a more thorough examination of the effect of the HTTPS switchover has been done elsewhere.

DarTar moved this task from Staged to Radar on the Research board.Aug 6 2015, 10:18 PM
Milimetric moved this task from Incoming to Radar on the Analytics-Backlog board.

Some time ago I captured all requests for article London Bridge on English Wikipedia in June.

Alas I did not capture referer field (and data are gone) so I can't see %perc via Google per day)
The dip occurred only for external requests (blue).
I did capture regional data: filtering for Asia or North America shows roughly the same dip as for whole world.

Misreported bots has nothing to do with it.
Although there is/was definitely an issue with official stats catching hardly any bots in June (at least for this article)
Compare
0.08% bots for new definition
3.14% bots for legacy test (catching all user agents string with bot/crawl/spider/http)

In both charts 30 June is an exceptional peak in traffic (Bridge was opened on June 30, many years ago)
Extra traffic is mostly internal (=from other Wikipedia page), maybe a 'Did you know' page or similar?

@kevinator Should this task be assigned to you when it is in the Radar column?

Milimetric moved this task from Incoming to Radar on the Analytics board.Jan 12 2016, 7:32 PM
Luke081515 removed kevinator as the assignee of this task.Sep 23 2016, 2:29 PM
Luke081515 added a subscriber: kevinator.