Page MenuHomePhabricator

Total page view numbers on Wikistats do not match new page view definition
Closed, ResolvedPublic

Description

Despite the recent switch of Wikistats to the new page view definition, the monthly total page view numbers for all Wikimedia projects on Wikistats still differ a lot from the total numbers recorded on Hive. (And the updated pageview report card shows discrepancies too, although they are much smaller.)

(table updated 2016-04-20)

monthReference (Hive)WMF report card (February 10 data)Wikistats ("Wikimedia, All Projects")
September 2015158991010831584909722815,578M
October 2015157127629951566182757015,344M
November 2015160944625401603215574115,707M
December 2015149982804511494911355714,647M
January 201616627007075N/A16,220 M
February 201616553391875N/A16,178M
March 201615941196508N/A15,539 M

(all normalized to 30days/month)

The Hive (projectview_hourly) numbers are the standard that has been used for a while in e.g. the WMF Quarterly Reports, the Vital Signs dashboard, the Product page on mediawiki.org and the readership metrics reports.

During the planning for the conversion of Wikistats (T114379: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts]), there was agreement that the "total" numbers on Wikistats should match this standard (specifically the results of the query below), with the possible exception of ad-hoc corrections in case of irrecoverable data losses:

After discussion with @Tnegrin and @JKatzWMF, I wanted to briefly chime in just to make sure that we will be using consistent definitions (consistent with the new def pageview data we are already publishing e.g. in the quarterly report scorecard, the Vital Signs dashboard and the weekly reading metrics report). I'm like 90% sure that's the plan already , but to spell out the assumption concretely for the monthly "all Wikimedia projects" numbers:
PageViewsPerMonthAllTotalled.csv and https://reportcard.wmflabs.org/graphs/pageviews will (apart from 30day normalization) contain the same numbers as generated by this query:

hive (default)> SELECT year, month, SUM(view_count) AS total_views FROM wmf.projectview_hourly WHERE year=2015 AND agent_type = 'user' GROUP BY year, month ORDER BY year, month LIMIT 1000;

[...]

@Tbayer absolutely, being consistent is important.
The only inherent complication I see is if wmf.projectview_hourly doesn't cater for unrecoverable data mishaps. [...]
So I felt compelled to add some complexity: if we have a bad-data hour (or days) and we can't fix that at the source (e.g. the data weren't collected) Wikistats will come up with a best estimate rather than a number known to be way off, by extrapolating from the good hours in that month. [...]

I don't think there have been any data losses of this sort in recent month, so that explanation seems unlikely. Perhaps the "All Projects" Wikistats page is still using a private non-standard pageview definition that e.g. reflects a different opinion about which projects should be counted. But that would contradict the documentation on that page (it refers to https://wikitech.wikimedia.org/wiki/Analytics/Pageviews which in turn refers to https://meta.wikimedia.org/wiki/Research:Page_view ). In any case, we need to be consistent and transparent about what we consider as "total" pageviews - publishing different numbers in different places without explanation should be avoided.
(CCing @Wwes who recently pointed out these discrepancies too.)

Event Timeline

Tbayer assigned this task to ezachte.
Tbayer raised the priority of this task from to Needs Triage.
Tbayer updated the task description. (Show Details)
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 11 2016, 4:07 AM

@Not much. I did some consistency checks, but nothing conclusive yet. My approach is to compare Wikistats counts with ad hoc aggegrated webstatscollector 3.0 counts. If those match it's out of my hands, and the mismatch should be found in hive scripts. If those don't match hopefully it will become apparent what constitutes the difference. BTW I may be mostly offline for the rest of the week (moving).

Tbayer updated the task description. (Show Details)Apr 21 2016, 12:20 AM
Restricted Application added a project: Internet-Archive. · View Herald TranscriptApr 21 2016, 12:20 AM

I have updated the Hive / Wikistats comparison in the task description with two more months' worth of data. Wikistats continues to miss several hundred million pageviews per month.

If the reason can't be found and fixed soon, we should add a warning message to the "All Projects" page and consider it as deprecated for the moment.

I don't see why such a small (alleged) underreporting would be a problem. What I read in your table is that the numbers are directly proportional/monotonous/follow the same trend, so that one could say they do in fact match (I suggest a more precise bug summary).

We could add a warning, in all those numbers, that probably only the first two digits or so are meaningful.

I don't see why such a small (alleged) underreporting would be a problem.

I don't consider 400 million missing views (for the most recent month) to be "small". The Analytics team frequently fixes issues in the pageview definition with much smaller impact, and rightfully so.
(Or maybe there is a misunderstanding here because the report card data is included too . That was just for comparison and as an additional indicator where the problem might lie on the Wikistats side. This bug is not about the difference between the report card data and the Hive/Hadoop source, which is indeed small. It's about the difference between these two and the Wikistats data.)

What I read in your table is that the numbers are directly proportional/monotonous/follow the same trend, so that one could say they do in fact match

That sort of handwaving (what "trend", exactly?) isn't good data analysis. The numbers don't match and because we don't know the reason for the error, we also can't be confident it will stay confined, even if you happen to use the data for purposes where half a billion more or less doesn't matter.

(I suggest a more precise bug summary).

What's not precise about it? There is a large discrepancy between the Wikistats numbers and the Hive numbers, even though there was an explicit goal set last year that they should be the same. The exact data sources are clearly stated for both.

We could add a warning, in all those numbers, that probably only the first two digits or so are meaningful.

Wikistats has been using five significant digits in this table for many years.

The discrepancy between [1] and [2] is as follows:
rightmost column in [2] is calculated from other columns, numbers which were already rounded to millions

The perl code WikiReportsOutputTables says:
"#very Q&D: parse javascript macro's, extract counts, build new macro for overall total", and
"# extremely Q&D (saves a few days restructuring)"
There are more Q&D's (most trivial) in Wikistats , but this *by far* is the dirtiest one.

Adding Wikimedia wide totals was on express request, but really an afterthought and counter to the overall structure of Wikistats, which treats each project as totally independent from other projects.
The other columns are just recycled html from project-specific reporting steps.

Now, as then, it feels like a waste of time to restructure the code which is quite complicated, as it does too many things at once (building bar chart, several percentages, normalized and non-normalized versions, etc) so I'd rather look how I can read the output used in [1] directly to generate the rightmost column with overall totals.

First I''ll look further into discrepancy between [1] and hive

I should have said *part of the discrepancy* is as follows [..].

There seems to be more than rounding error.
So it looks like some special project is included in [1], but not in [2]

I followed along so far, Erik and Tilman, let me know if I can help.

So my post from two days ago was a false lead. The issue of the rounding error is real, but also really small. More like 1 or 2 M rather than 400 M.

Today I made some real progress (but more to do).

The counts for mobile traffic in page view reports [1] did not include zero traffic.

I added a step to combine counts for .m and .z[ero]

(if you see an almost empty page, there are issues with stat1001, sometimes gif's and js files don't load) try Ctrl-F5

[1] e.g. https://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm

Nuria moved this task from Incoming to Radar on the Analytics board.May 16 2016, 4:53 PM
Nuria added a subscriber: Nuria.May 23 2016, 8:56 PM

Any more updates on this regard?

After adding Wp zero earlier, this week I added two more categories that were missing: mobile traffic to other projects than Wikipedia, and other 'Other Projects' than commons: wikidata, foundation, meta, species, incubator (desktop only, mobile to follow).

Also missing but negligible are mediawiki (6M/mon) and www.wikisource (0.5M/mon, the precursor of language-specific wikisource wikis).

So here are current results:

Changes are live and documented in the report: https://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProjects.htm

I temporarily disabled url validation: Wikistats only counts existing wikis (white list), where hive accepts any url people type in the address bar. This is probably a very minor discrepancy.

There is also a systemic shift between hive query and input for Wikistats (hourly projectviews files), which follows a simple pattern: in order to get the daily totals from hive I need to total files for hour 1-23 plus file for hour 0 of the next day.

To do: include mobile traffic to 'other projects'

Nuria added a comment.May 24 2016, 3:56 PM

Also missing but negligible are mediawiki (6M/mon) and www.wikisource (0.5M/mon, the precursor of language-specific wikisource wikis).

These two are not counted on pageview definition. We try to count "knowledge wikis", list for which pageviews are counted is here:
https://github.com/wikimedia/analytics-refinery/blob/master/static_data/pageview/whitelist/whitelist.tsv

Ok, from what i see in your corrections differences are between 1 and 4%, correct?

Ah good to see there is white list.

differences are between 1 and 4%, correct?

Actually 1/10 of that, the average difference from May 2015 to April 2016 is 0.267% or rounded 0.3%
With mobile added for special projects like commons (to do) this will drop further.

ezachte closed this task as Resolved.May 28 2016, 7:28 PM

With mobile traffic added for 'other projects' (commons, etc)
The average difference over 12 months is now 0.118% or 1/10 of a percent.