Investigate sample cube pageview_count vs unsampled log pageview count [13 pts] {hawk}
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JAllemandou
	Aug 13 2015, 9:34 AM

Description

As per Tilman Bayer:

Hi Kevin and Dan,

thanks again for taking the time last week to explain how to retrieve pageview from Hive! It worked out well and I was able to publish the quarterly report on schedule on July 30. (And BTW I also got my access to Hue sorted out this week with the help of Andrew O.)

As envisaged, I did some spot checks to see if the new def sampled PV data I had used for the Q3 report (cube 0.5, back then via Pentaho) matches with the new def unsampled PV data on Hive. Picking two days during the overlap in April 2015:

Example #1:

mysql:research@analytics-store.eqiad.wmnet [staging]> select sum(pageviews) from pageviews05 where timestamp between '2015-04-15 00:00:00' and '2015-04-16 00:00:00' and is_spider = 0;
+----------------+

sum(pageviews)

+----------------+

634431000

+----------------+

BUT:
hive (wmf)> select sum(view_count) from projectview_hourly where agent_type <> 'spider' and year=2015 and month=04 and day=15;
622049519

Example #2:

mysql:research@analytics-store.eqiad.wmnet [staging]> select sum(pageviews) from pageviews05 where timestamp between '2015-04-16 00:00:00' and '2015-04-17 00:00:00' and is_spider = 0;
+----------------+

sum(pageviews)

+----------------+

626001000

+----------------+

BUT

hive (wmf)> select sum(view_count) from projectview_hourly where agent_type <> 'spider' and year=2015 and month=04 and day=16;
616117283

(I also added these examples to our Etherpad at https://etherpad.wikimedia.org/p/analytics-notes , but feel free to move it to a more appropriate place - not sure if it merits a Phabricator task.)

I didn't have time to look further into these discrepancies before the publication of the report, which is why the Q3 and y-o-y trend data in the quarterly scorecard is marked as "N/A" in the pageviews section.

But it would still be good to find out the reason. IIRC you said that these two databases are based on the same (iteration of the new) pageview definition. Are we certain that both use the same timezone? And what was the sampling method - was it realized via counters, or did it rely on random number generators?

Wikidata doesn't make the difference (200k view_count vs 10m missing) so it would be good to investigate in spider or projects differences.

Related Objects

Mentioned In: T114379: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts]
Mentioned Here: T51266: www.f / foundationwiki (wikimediafoundation.org) pageviews low, underreporting?
T108895: Get a fresh read on Google-referred traffic

Event Timeline

JAllemandou created this task.Aug 13 2015, 9:34 AM

JAllemandou raised the priority of this task from to Needs Triage.

JAllemandou updated the task description. (Show Details)

JAllemandou added a project: Analytics-Backlog.

JAllemandou subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 13 2015, 9:34 AM

Milimetric assigned this task to JAllemandou.Aug 13 2015, 5:30 PM

Milimetric moved this task from Incoming to Radar on the Analytics-Backlog board.

Milimetric set Security to None.

As suggested in the email thread, here are some comparisons to help narrowing down the source of the discrepancies.

First, to determine the overlap as April 1 to April 19 (if one only considers full days):

Sampled (old):

mysql:research@analytics-store.eqiad.wmnet [staging]> SELECT MAX(timestamp) FROM pageviews05;
+---------------------+
| MAX(timestamp)      |
+---------------------+
| 2015-04-20 06:00:00 |
+---------------------+

vs. unsampled (new):

hive (wmf)> SELECT MIN(concat(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")," ",LPAD(hour,2,"0"))) FROM projectview_hourly WHERE year=2015;
...
2015-04-01 00

Daily differences

date	sampled (pageviews05)	unsampled (projectview_hourly)	sampled/unsampled
4/1/2015	707994000	703754393	100.60%
4/2/2015	695838000	691980785	100.56%
4/3/2015	710488000	685283454	103.68%
4/4/2015	675997000	675118881	100.13%
4/5/2015	701604000	699149260	100.35%
4/6/2015	762081000	759857386	100.29%
4/7/2015	766895000	764488513	100.31%
4/8/2015	760840000	758943357	100.25%
4/9/2015	739881000	737156321	100.37%
4/10/2015	686692000	685136394	100.23%
4/11/2015	657947000	656584400	100.21%
4/12/2015	710316000	708321238	100.28%
4/13/2015	751991000	748829169	100.42%
4/14/2015	751630000	747917929	100.50%
4/15/2015	734909000	732286176	100.36%
4/16/2015	735386000	733084619	100.31%
4/17/2015	693636000	691541783	100.30%
4/18/2015	648870000	649612632	99.89%
4/19/2015	712319000	714113734	99.75%
Sum (April 1-19)	13605314000	13543160424	100.46%

Kind of weird how the sampled views are mostly overcounting and only undercounting in the last two days. One could do an hourly plot for say the last four days to see if there was a sudden drop of some kind. Then again, finding the reason for that drop would likely not resolve the differences entirely.

Data obtained via:

tbayer@stat1003:~$ mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -hanalytics-store.eqiad.wmnet --database staging --batch --execute "SELECT LEFT(timestamp, 10) AS date, SUM(pageviews) FROM pageviews05 WHERE timestamp>='2015-04' GROUP BY date;"

(sampled) and (unsampled:)

hive (wmf)> SELECT CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) as date, SUM(view_count), year, month, day FROM wmf.projectview_hourly WHERE (year=2015 AND month=4 AND day<20) GROUP BY year, month, day;

Desktop/mobile web/apps
Sampled:

mysql:research@analytics-store.eqiad.wmnet [staging]> SELECT access_method, SUM(pageviews) AS views FROM pageviews05 WHERE timestamp BETWEEN '2015-04-01' AND '2015-04-20' GROUP BY access_method;
...
+---------------+------------+
| access_method | views      |
+---------------+------------+
| desktop       | 9000440000 |
| mobile app    |  110292000 |
| mobile web    | 4494582000 |
+---------------+------------+

Unsampled:

hive (wmf)> SELECT access_method, SUM(view_count) AS views FROM wmf.projectview_hourly WHERE year=2015 AND month=4 AND day<20 GROUP BY access_method;

access_method	views
desktop	8916696055
mobile app	138151343
mobile web	4488313026

So in this timespan the sampled and unsampled views differ for all three categories - but much more so for the apps than for desktop and mobile web. Again, it would be nice to know more about the sampling method before investigating further.

Wikidata

Sampled:

mysql:research@analytics-store.eqiad.wmnet [staging]> SELECT SUM(pageviews) FROM pageviews05 WHERE timestamp BETWEEN '2015-04-01' AND '2015-04-20' and project = 'wikidata';
+----------------+
| SUM(pageviews) |
+----------------+
|       34844000 |
+----------------+

Unsampled:

hive (wmf)> SELECT SUM(view_count) FROM wmf.projectview_hourly WHERE year=2015 AND month=4 And day<20 AND project = 'wikidata';

Result: 34762904

This was just for double-checking - as Joseph said above, we already ruled out the possibility that Wikidata views are missing entirely (as they were apparently from an earlier version of the cube data).

Interesting analysis !
It would also be nice to know if there are differences among projects (projects listed in only one of the dataset for instance, or big discrepencies between the two datasets in the same project).

In T108925#1552570, @JAllemandou wrote:

Interesting analysis !
It would also be nice to know if there are differences among projects (projects listed in only one of the dataset for instance, or big discrepencies between the two datasets in the same project).

Yes, there are - I already did a cursory check on that last week, and some smaller projects with only a few hundred or thousand pageviews were missing. That's not terribly surprising though with an 1:1000 sample. I'm now going to check which are the biggest missing ones.

In the meantime, Dario mentioned https://trello.com/c/Xvp93j8p/446-high-data-qa-of-legacy-vs-new-pv-via-udf to me, where @Ironholds had examined sampled vs. unsampled discrepancies in January (the cyan/purple lines in the graph). Oliver, was that the up to 6% difference example you had mentioned last week?

On the other hand, I noticed that Oliver's recent report for T108895 (figure 1) compared sampled and unsampled total pageviews from May to August 2015, and judged the differences not substantial for the purposes of that report. From eyeballing the graph there, it seems that the differences reached up to about half a percent, too, but perhaps less often than here in April.
In any case, I understand that the sampled and unsampled numbers there come from two other databases, namely pageview_hourly (unsampled, since May) and another one that's not on Hive (for the sampled data covering a longer timespan, I guess that comes directly from Kafka). pageview_hourly seems to agree exactly with projectview_hourly which was used here (I spot-checked for some days). So it might be interesting to know if there are differences between the two sampled ones (pageviews05 and the one that's not on Hive).

OK, it seems we have two first culprits: mediawiki.org and wikimediafoundation.org didn't make it into the sample very often...:

project	sampled (pageviews05)	unsampled (projectview_hourly)
mediawiki	1000	7459727
wikimediafoundation	0	5771280

Data obtained via:

hive (wmf)> SELECT project, SUM(view_count) AS views FROM wmf.projectview_hourly WHERE year=2015 AND month=4 And day<20 AND (project = 'wikimediafoundation' OR project = 'mediawiki') GROUP BY project ORDER BY views DESC LIMIT 1000;

tbayer@stat1003:~$ mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -hanalytics-store.eqiad.wmnet --database staging --batch --execute "SELECT project, SUM(pageviews) AS views FROM pageviews05 WHERE timestamp BETWEEN '2015-04-01' AND '2015-04-20' AND (project = 'wikimediafoundation' OR project = 'mediawiki') GROUP BY project ORDER BY views DESC"

And outreach.wikimedia is another project that is missing in the sampled views (pageviews05) but present in the unsampled projectview_hourly.
There was an discussion on the Analytics mailing list earlier this week about the fact that this domain was added to the pageview definition, which appears to have happened on April 17, 2015. I first thought that this corresponds to the observation above (T108925#1542170 ) that the sampled views were overcounting from April1-17 and only undercounting on April 18-19. But actually outreach.wikimedia is present earlier too; I guess because the data was regenerated retroactively after the change.

date	unsampled views (projectview_hourly) for outreach
4/1/2015	12672
4/2/2015	11152
4/3/2015	12985
4/4/2015	8837
4/5/2015	7464
4/6/2015	12942
4/7/2015	13947
4/8/2015	12884
4/9/2015	14027
4/10/2015	17783
4/11/2015	11245
4/12/2015	12465
4/13/2015	9400
4/14/2015	10767
4/15/2015	16051
4/16/2015	15265
4/17/2015	11982
4/18/2015	11056
4/19/2015	9762

Data obtained via

hive (wmf)> SELECT CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) as date, SUM(view_count), year, month, day FROM wmf.projectview_hourly WHERE (year=2015 AND month=4 AND day<20)  AND project='outreach.wikimedia' GROUP BY year, month, day;

and

tbayer@stat1003:~$ mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -hanalytics-store.eqiad.wmnet --database staging --batch --execute "SELECT SUM(pageviews) FROM pageviews05 WHERE project LIKE 'outreach%'"

SUM(pageviews)
NULL

Unfortunately, the discovery of these three differing domains does not resolve the sampled/unsampled discrepancy - on the contrary, it is now even a bit larger than assumed (recall that the sampled number was already higher in that timespan). It's quite a mess ...

In T108925#1563097, @Tbayer wrote:
OK, it seems we have two first culprits: mediawiki.org and wikimediafoundation.org didn't make it into the sample very often...:

project sampled (pageviews05) unsampled (projectview_hourly)

mediawiki 1000 7459727

wikimediafoundation 0 5771280

Data obtained via:
hive (wmf)> SELECT project, SUM(view_count) AS views FROM wmf.projectview_hourly WHERE year=2015 AND month=4 And day<20 AND (project = 'wikimediafoundation' OR project = 'mediawiki') GROUP BY project ORDER BY views DESC LIMIT 1000;
tbayer@stat1003:~$ mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -hanalytics-store.eqiad.wmnet --database staging --batch --execute "SELECT project, SUM(pageviews) AS views FROM pageviews05 WHERE timestamp BETWEEN '2015-04-01' AND '2015-04-20' AND (project = 'wikimediafoundation' OR project = 'mediawiki') GROUP BY project ORDER BY views DESC"

PS: On the other hand, in the previous iteration (0.4) of the Cube data, wikimediafoundation.org was present, albeit with implausibly low numbers: T51266#1045802 . FWIW.

Also, it seems the difference on mediawiki.org and wikimediadoundation.org doesn't account for the total of missing data: it sums up to ~12M for the 2015-04-01->2015-04-20 period, while the original discrepency is ~10M a day.
I still wonder :(

In T108925#1567339, @JAllemandou wrote:

Also, it seems the difference on mediawiki.org and wikimediadoundation.org doesn't account for the total of missing data: it sums up to ~12M for the 2015-04-01->2015-04-20 period, while the original discrepency is ~10M a day.
I still wonder :(

As I said, the sign is wrong too, i.e. including the pageviews from these two domains during this timespan would increase rather than decrease the discrepancies anyway.

Have you considered looking at the codebase used to retrieve this,
identifying discrepancies, and, in the absence of discrepancies, just
rerunning a sample and comparing agianst /that/?

@Ironholds that seems like a good idea. From a product perspective, I am okay with <5% random discrepancy, but if its non-random that presents more of a problem. @JAllemandou can you help us move forward so that we can send out weekly and monthly trend reports that include ~ accurate historical data?

@Tbayer, what projects are missing? mediawiki.org and wikimediafoundation.org don't fit into Reading objectives enough that I think their absence should be a blocker.

In T108925#1632602, @JKatzWMF wrote:

@Ironholds that seems like a good idea. From a product perspective, I am okay with <5% random discrepancy, but if its non-random that presents more of a problem. @JAllemandou can you help us move forward so that we can send out weekly and monthly trend reports that include ~ accurate historical data?

Yes, if the discrepancies could be explained by the randomness of the sampling method, that wouldn't be an issue. However, since asking about that method above, I've been told that sampling is implemented by a counter on the Varnish servers (correct?). And in that case, the difference between sampled and unsampled numbers for total pageviews during a certain period (like for the "Daily differences" table above) should never exceed the number of servers times the sampling factor, and would be much smaller in most cases. Here, this theoretical limit would be less than a million (1000 servers x counting every 1000th view). But the discrepancies in the "Daily differences" table are considerable larger than that.

If that's not true and the sampling is done via a random number generator instead (i.e. every pageview has the same a priori probability to be counted in the sample), the expected sampling inaccuracy can be estimated as a binomial proportion confidence interval - I could calculate that for the example discrepancies above, but they already seem way too large in that case too.

@Tbayer, what projects are missing? mediawiki.org and wikimediafoundation.org don't fit into Reading objectives enough that I think their absence should be a blocker.

Certainly, but the problem is rather that they are present in one dataset (projectview_hourly, the new unsampled one) but not the other (pageviews05, which we would like to use for historical comparison). I'm fine working with or without these smaller non-content wikis, but it should be consistent.

Well if the goal is perfect consistency you're never getting it; the pageview definition changes over time and some of those changes control what projects are included.

@JKatzWMF I can try to help :)
I am currently computing the daily pageview / webrequest rate for august, to see if there could be big differences daily (I think it could, actually).
Please let's discuss around the path to follow, and how I could help.

I am currently computing the daily pageview / webrequest rate for august, to see if there could be big differences daily (I think it could, actually).

Please let's discuss around the path to follow, and how I could help.

@JAllemandou, @Tbayer is running this analysis. Can the two of you figure out what is necessary to resolve any meaningful, directional/biased discrepancies. In the short-term the previous year is required, but I believe the longer term goal is that at any given time we should be able to generate (or append to an existing) table of pageviews by project, by country and by platform since at least 2010.

Regarding the addition or subtraction of a property, I suggest that @Tbayer remove the projects that are not in the historical analysis from your current analysis (provided they are small projects).

As mentioned above, adding or removing these small projects does not resolve the discrepancies.

We still have not understood the reason for these differences at all, and while we may find out that perfect consistency can't be achieved, I don't think that we should use the pageviews05 table for historical comparisons without a better understanding, one that goes beyond a vague conjecture that maybe the pageview definition changed.

@JAllemandou: thanks, let us know what you find out!
Alternatively, some days ago @Milimetric kindly shared sampled UDP pageviews on Hive in the milimetric.webrequest_sampled table, which goes back to June 30, 2014. If that's compatible, we could investigate using it instead, for y-o-y comparisons if not longer term ones.

@Tbayer: nothing great from the pageview / webrequest ratio. It is pretty stable, with bumps every now and then, but nothing big enough to explain a difference in sampling I think.
@Tbayer: Please ping me today when you're online, to have a chat around the issue. Thanks !

@Tbayer: that table was just loaded to answer some answers from legal. I won't be keeping it updated and will probably delete it after you're done with it. But I think we should indeed find whatever data we have that reaches the furthest back in history, and load it into a table that will be updated going forward. That seems like a good approach to the worse but still valuable historical data.

• Tbayer added a subscriber: • ellery.Sep 17 2015, 9:44 PM

I've been told that sampling is implemented by a counter on the Varnish servers (correct?).

Nope! Not correct.

If that's not true and the sampling is done via a random number generator instead (i.e. every pageview has the same a priori probability to be counted in the sample),

Currently, Hive is sampling across all webrequest via:

TABLESAMPLE(BUCKET 1 OUT OF 1000 ON rand())

This means that every webrequest in a given hour has the same likely hood of being sampled, since this works across all webrequest. Sampling is not done on varnish hosts.

In T108925#1654408, @Ottomata wrote:

...

Currently, Hive is sampling across all webrequest via:
TABLESAMPLE(BUCKET 1 OUT OF 1000 ON rand())
This means that every webrequest in a given hour has the same likely hood of being sampled, since this works across all webrequest. Sampling is not done on varnish hosts.

(As I just said on IRC too:)

If sampling is done hourly, that also means that the sampling errors will be larger. Above I had estimated that this effect can't explain the discrepancies, but I was looking at the daily numbers as one sample, which now turns out not to be true. Will check again for hourly.

JAllemandou edited projects, added Analytics-Kanban; removed Analytics-Backlog.Sep 21 2015, 1:06 PM

JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.

@JKatzWMF, @Tbayer : Please let me know if this explains enough :)

Some data on differences from hive vs pageview05 (I call it cube) data on April month (1st to 20th because pageview05 doesn't contain data after the 20th).

First plots of hourly datasets, counts and percent differences:

diff_percent_hive_cube.png (612×812 px, 38 KB)

It shows almost no difference between hive and cube, except for three hours (2015-04-03, hours 17/18/19) where the cube data diverges strongly.

From a project perspective, the differences concur with what @Tbayer has found

- mediawiki project is really more represented in hive than in cube
some miss-transformation in cube, like m.wikisource or m.wikidata not transformed into [wikisource + access_method = mobile]

Now some more interesting thing: spiders tagging.

diff_hive_cube_no_spider.png (612×812 px, 92 KB)

Here we can see a difference between hive and cube when we only consider no-spiders.
The chart below show the ratio (hive/cube) considering all agent types and no-spider only agent types.

diff_percent_hive_cube_spider.png (612×812 px, 68 KB)

We can see a clear difference between hive and cube: hive removes less spiders.

An update in code should reduce this difference starting September 17th see (here).

JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.Sep 21 2015, 7:56 PM

@Tbayer, @JKatzWMF : I move this task to done since you have not commented in negative ways :)

JAllemandou moved this task from In Code Review to Done on the Analytics-Kanban board.Sep 24 2015, 11:30 AM

Milimetric renamed this task from Investigate sample cube pageview_count vs unsampled log pageview count to Investigate sample cube pageview_count vs unsampled log pageview count [5 pts].Sep 24 2015, 3:39 PM

Milimetric added a subscriber: • ggellerman.

JAllemandou moved this task from Done to Paused on the Analytics-Kanban board.Sep 25 2015, 3:36 PM

In T108925#1660700, @JAllemandou wrote:

@JKatzWMF, @Tbayer : Please let me know if this explains enough :)

Thanks Joseph! I spent some time looking into this, and to be honest, I am a still bit confused about what exactly you found and changed, so I appreciate your offer to explain it a bit further. Remarks inline.

Some data on differences from hive vs pageview05 (I call it cube) data on April month (1st to 20th because pageview05 doesn't contain data after the 20th).

First plots of hourly datasets, counts and percent differences:

It shows almost no difference between hive and cube, except for three hours (2015-04-03, hours 17/18/19) where the cube data diverges strongly.

But that's the same information as in the "daily differences" table posted above on Aug 15, just presented in graphical form - correct?

Even disregarding the 3.7% outlier on April 3, for which we still have no explanation (or do we?), I would not characterize a 0.5% difference on other days as "almost nothing". Sure, if we could establish that the difference is at most 0.5% and understand its reasons fairly well, that would be tolerable for some of the analyses we want to do. But that doesn't seem to be the case now - rather, these observed differences from a small overlap window should be seen as indicators of a potentially larger problem, especially since they don't seem to follow a consistent pattern (e.g. Cube suddenly turning from over- to undercounting after April 17).

From a project perspective, the differences concur with what @Tbayer has found

mediawiki project is really more represented in hive than in cube

As discussed, including mediawiki.org in cube would have increased rather than decreased the discrepancy.

some miss-transformation in cube, like m.wikisource or m.wikidata not transformed into [wikisource + access_method = mobile]

That's interesting! But how does it translate into the observed differences in the total numbers, where all access methods are counted? It seems that a wrong access method classification for m.* domains would not have affected that (but would cause additional problems elsewhere).

Now some more interesting thing: spiders tagging.

Here we can see a difference between hive and cube when we only consider no-spiders.
The chart below show the ratio (hive/cube) considering all agent types and no-spider only agent types.

We can see a clear difference between hive and cube: hive removes less spiders.

Again, great find, but: It seems that this has unearthed an additional problem rather than resolved the previously discussed problem, since the "daily differences" posted above concerned total pageviews regardless of spider/human classification.

An update in code should reduce this difference starting September 17th see (here).

That's great! Does this retroactively affect Hive data from before September 17, or not?

In any case I assume it does not chagne the total numbers (i.e. disregarding spider classification) on Hive whose discrepancies with the Cube data we discussed above.

PS: sorry about the late answer - getting reliable historical comparison data is very important for us in Reading, but Jon and I were prioritizing some more timely issues last week.

PPS: For the record, I noticed that these total numbers on Hive have been very slightly changing retroactively in any case, e.g. decreasing by less than 0.01% for each of the first four days of April (see below). While this increases the observed discrepancies a tiny bit, it doesn't seem relevant here. I assume it reflects the other modifications recorded in the change log, e.g. removing arbcom wikis.

date	Cube (pageviews05)	Hive (projectview_hourly) accessed Aug 15	Hive accessed Sept 29
4/1/2015	707994000	703754393	703740811
4/2/2015	695838000	691980785	691968820
4/3/2015	710488000	685283454	685269980
4/4/2015	675997000	675118881	675109384

(queries: see above)

Hey Tilman :)
Thanks for your comments.

First, you are right, we have backfilled data removing arbcom wikis and outreach.wikimedia, this explains the change between now and last august.

About discrepancy, I think the relevant chart is

It shows the ratio between cube and hive hourly. We see that discrepency moves randomly around 1.
Here is the same chart removing some outliers (2015-04-03), and then the same data plotted by hour of day.

diff_percent_hive_cube_no_outliers.png (612×812 px, 66 KB)

While these charts show some good randomness, It is true that the median is more frenquently between 0.99 and 1 than between 1 and 1.01.

Just to be sure, I added a Kolmogorov-Smirnov test test on the hive and cube distributions, and obviously there is no stattstical difference between the two distributions (KS statistic = 0.027713625866050806, P-value = 0.99585181741233186)

If you think there is more to it, please let me know and let's discuss !

Joseph

Thanks Joseph - if you could also answer the other questions (in the comment before the PPS about backfilling data), that would be great.

I will spend some more time trying to understand this application of the Kolmogorov-Smirnov test (feel free to share the underlying spreadsheet or code), but right now I am a bit puzzled about how it relates to the question at hand. We already said above that we expect some random noise in the Cube data from the sampling method. But its size will obviously depend on the sampling rate (and also on the aggregation level - hourly vs. daily - as discussed above on Sept 18). If we were to sample 1:100000, say, we'd expect the Cube data to differ much more from the unsampled data on Hive than with 1:1000, and a 1:1 rate would mean that the Cube and Hive data have to be identical (and that any discrepancies mean that there have been errors).

However, unless I'm missing something about how you applied this test, your statement "obviously there is no stattstical [sic] difference" does not depend on the sampling rate. I.e. it would apply even to the situation where the pageviews recorded in Cube would need to correspond 1:1 to the pageviews recorded in Hive. This is of course absurd. Let me know if I'm misunderstanding something.

Hey Tilman,
About bots no backfill was done. It could have been possible to do it from July 17th (2 month of refined webrequest are kept), but we try to prevent backfilling when there is no critical issue in the data as with the arbcom.

About the statistical test, I just wanted to express that the difference we find between cube and hive data is not significant statistically. It certainly doesn't explain the difference that exists between both.
I am sorry for having aimed at the wrong target ...

I have reviewed again the pageview definition history, and nothing strikes me.

What I will do now is to recompute is_pageview for the sampled data for april. I expect the numbers to be closer to the hive ones (in the sampling-explainable interval).
That would then mean that there was a difference in the way pageviews were computed by @Ironholds using R and the way it has been implemented in Java.

Tilman: I confirm the discrepancy we observe comes from a difference in the computation of the pageview boolean, and not the sampling (see below).
Understanding the difference more precisely would involve @Ironholds, to provide his code for the computation of april month.

day	Hive-unsampled	Cube	Hive-sampled	Cube / H-unsampled	H-sampled / H-unsampled
2015-04-01	703740811	707994000	703160000	1.0060436867	0.9991746805
2015-04-02	691968820	695838000	691090000	1.0055915525	0.9987299717
2015-04-03	685269980	710488000	685080000	1.0368001237	0.9997227662
2015-04-04	675109384	675997000	675020000	1.0013147736	0.9998676007
2015-04-05	699141380	701604000	697714000	1.0035223491	0.9979583815
2015-04-06	759843712	762081000	760236000	1.0029444055	1.0005162746
2015-04-07	764473566	766895000	763493000	1.0031674529	0.9987173317
2015-04-08	758929629	760840000	757112000	1.0025171912	0.9976050098
2015-04-09	737141651	739881000	736318000	1.0037161772	0.9988826422
2015-04-10	685117893	686692000	683786000	1.002297571	0.9980559652
2015-04-11	656572577	657947000	656189000	1.0020933299	0.9994157889
2015-04-12	708307921	710316000	706448000	1.0028350368	0.997374135
2015-04-13	748818828	751991000	748555000	1.0042362343	0.9996476745
2015-04-14	747906245	751630000	746890000	1.0049789061	0.9986412134
2015-04-15	732268940	734909000	732189000	1.0036053147	0.9998908325
2015-04-16	733068657	735386000	731231000	1.0031611541	0.9974931993
2015-04-17	691529083	693636000	690759000	1.0030467511	0.9988864055
2015-04-18	649600948	648870000	649650000	0.9988747738	1.000075511
2015-04-19	714103028	712319000	712790000	0.9975017218	0.9981612905

The sampling is done daily, and the version of pageview I have used for the Hive-sampled set is the most recent one (I have tested the previous ones as well, the difference is really really small).
Let's get in touch and decide how we move forward.

In T108925#1690491, @JAllemandou wrote:

Tilman: I confirm the discrepancy we observe comes from a difference in the computation of the pageview boolean, and not the sampling (see below).

Knowing that means a big step forward; thanks for all the time you are putting into this!

Understanding the difference more precisely would involve @Ironholds, to provide his code for the computation of april month.

I think @Ironholds already recommended this approach himself on August 24 above, so I hope he will be happy to provide a pointer to that code.

day Hive-unsampled Cube Hive-sampled Cube / H-unsampled H-sampled / H-unsampled

[...]

The sampling is done daily,

OK, we need to be aware that per Andrew above the real sampling is done hourly instead of daily (and presumably was done this way for the Cube data until April 2015 too). Let's assume for now that this doesn't affect the conclusion from your simulation.

and the version of pageview I have used for the Hive-sampled set is the most recent one (I have tested the previous ones as well, the difference is really really small).
Let's get in touch and decide how we move forward.

Great - I may try to catch you on IRC early tomorrow morning PDT (from say 15h UTC); we actually have a quite timely need right now for comparable historical data because we are finalizing the slides for next week's quarterly review of the Reading team and want to include reliable YOY comparisons there. I'll follow up here on another issue regarding that.

In T108925#1660700, @JAllemandou wrote:

...

Now some more interesting thing: spiders tagging.

Here we can see a difference between hive and cube when we only consider no-spiders.
The chart below show the ratio (hive/cube) considering all agent types and no-spider only agent types.

We can see a clear difference between hive and cube: hive removes less spiders.

An update in code should reduce this difference starting September 17th see (here).

This is a new and quite important finding. I should say that for the purposes of the Reading team (such as the aforementioned quarterly review) we are mostly interested in human pageviews only. I.e. in that use case, the discrepancies with the historical data are now much larger than assumed. To check, I have re-generated the earlier "daily discrepancies" table for non-spider views below; I guess the percentages correspond to the green line in your second chart above.

date	pageviews05 (Cube)	projectview_hourly (Hive)	Cube/Hive
4/1/2015	592674000	602546281	98.36%
4/2/2015	583892000	592559951	98.54%
4/3/2015	599624000	588147712	101.95%
4/4/2015	568067000	577420845	98.38%
4/5/2015	601047000	609998035	98.53%
4/6/2015	650968000	662694399	98.23%
4/7/2015	648567000	660615908	98.18%
4/8/2015	636195000	646392437	98.42%
4/9/2015	615877000	626270629	98.34%
4/10/2015	575352000	585260517	98.31%
4/11/2015	549285000	558154195	98.41%
4/12/2015	598148000	605843822	98.73%
4/13/2015	647194000	653249285	99.07%
4/14/2015	635839000	645641185	98.48%
4/15/2015	614221000	622041702	98.74%
4/16/2015	605629000	616110887	98.30%
4/17/2015	579308000	588273161	98.48%
4/18/2015	542755000	553661403	98.03%
4/19/2015	602462000	622498436	96.78%
sum	11447104000	11617380790	98.53%

While this error is much larger than the one we are still investigating for the total pageviews, it is at least better understood, if we assume that the changes in the spider definition (your difflink) explain most of it.

Data via

mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -hanalytics-store.eqiad.wmnet --database staging --batch --execute "SELECT LEFT(timestamp, 10) AS date, SUM(pageviews) FROM pageviews05 WHERE timestamp>='2015-04' AND is_spider = FALSE AND is_automata = FALSE GROUP BY date;"

and

hive (default)> SELECT CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) as date, SUM(view_count), year, month, day FROM wmf.projectview_hourly WHERE (year=2015 AND month=4 AND day<20) AND agent_type = 'user' GROUP BY year, month, day;

For the record, Oliver has shared the reference implementation of the pageview definition underlying pageviews05: https://github.com/wikimedia-research/pageviews ("you want (particularly) read_sampled_log and to_pageviews"), and Joseph is going to compare it with the Java code underlying projectview_hourly/pageview_hourly.

Tilman,
I found that the definitions matching issue is, while small, less simple than I expected.
I have not managed to get closer to the R definition than 0.07% in average, even with the closest fit I could get.
Also, while investigating, I found a bug in my previous analysis between hive-unsampled and hive-sampled --> they now reconciliate really well.

There are 3 impacting differences in the filtering logic between pageviews in R and in java, in addition to the projects (mediawiki and wikimediafoundation) already discussed.

No filter of action=edit in R while present in java (~ + 3416000 / day)
search pages not included in R while in java (~ +464000 / day)
uri host are not converted to lowercase in R while they are in java (~ -614000 / day)

day	Hive-unsampled	Hive-sampled	H-sampled / h-unsampled	Cube	Hive-sampled-modified	Cube / hive-sampled-modifed
2015-04-01	703740811	704571000	100.118%	707994000	708167000	99.98%
2015-04-02	691968820	692450000	100.070%	695838000	696339000	99.93%
2015-04-03	685269980	686220000	100.139%	710488000	689739000	103.01%
2015-04-04	675109384	676260000	100.170%	675997000	678905000	99.57%
2015-04-05	699141380	698936000	99.971%	701604000	702029000	99.94%
2015-04-06	759843712	761692000	100.243%	762081000	764045000	99.74%
2015-04-07	764473566	765310000	100.109%	766895000	769045000	99.72%
2015-04-08	758929629	758685000	99.968%	760840000	762219000	99.82%
2015-04-09	737141651	737644000	100.068%	739881000	741532000	99.78%
2015-04-10	685117893	685191000	100.011%	686692000	688346000	99.76%
2015-04-11	656572577	657446000	100.133%	657947000	660008000	99.69%
2015-04-12	708307921	707789000	99.927%	710316000	710537000	99.97%
2015-04-13	748818828	749873000	100.141%	751991000	753670000	99.78%
2015-04-14	747906245	748232000	100.044%	751630000	752032000	99.95%
2015-04-15	732268940	733471000	100.164%	734909000	737092000	99.70%
2015-04-16	733068657	732575000	99.933%	735386000	735999000	99.92%
2015-04-17	691529083	692118000	100.085%	693636000	695300000	99.76%
2015-04-18	649600948	650982000	100.213%	648870000	653855000	99.24%
2015-04-19	714103028	714020000	99.988%	712319000	717074000	99.34%

Have you checked when those four Java differences were added to the Java?

JAllemandou moved this task from Paused to In Progress on the Analytics-Kanban board.Oct 5 2015, 10:48 AM

@Ironholds: I corrected my previous comment: three diffs, not four.

No filter of action=edit in R while present in java (~ + 3416000 / day)

Commited the 27th of February 2015, change-id I0ec309ebf62eaeaec0d3e612ecf0d0af831b1e1d

search pages not included in R while in java (~ +464000 / day)

Commited the 3rd of March 2015, change-id I9312cbfcc3987c6a70c3c95a288ce4cefc91dc31

uri host are not converted to lowercase in R while they are in java (~ -614000 / day)

Commited the 5th of January 2015 (original commit for the file), change-id Idf32e1f94ee267d5cb0d90f15a9afbef32e669c2

I have found a (hopefully) last bug in my try for re-computation of R-like pageviews in hive (search for query parameters (curid, oldid, title) is done only if the query paramter is first (just after ?) in R. It makes the result fit less, but it's more correct.
The last two things I could image that create the difference are:

day	cube	Hive-sampled-modified	cube/ hive-sampled-modifed
2015-04-01	707994000	708690000	100.10%
2015-04-02	695838000	696815000	100.14%
2015-04-03	710488000	689968000	97.11%
2015-04-04	675997000	679124000	100.46%
2015-04-05	701604000	702264000	100.09%
2015-04-06	762081000	764555000	100.32%
2015-04-07	766895000	769940000	100.40%
2015-04-08	760840000	762926000	100.27%
2015-04-09	739881000	742005000	100.29%
2015-04-10	686692000	688846000	100.31%
2015-04-11	657947000	660337000	100.36%
2015-04-12	710316000	710869000	100.08%
2015-04-13	751991000	754157000	100.29%
2015-04-14	751630000	752552000	100.12%
2015-04-15	734909000	737567000	100.36%
2015-04-16	735386000	736446000	100.14%
2015-04-17	693636000	695827000	100.32%
2015-04-18	648870000	654245000	100.83%
2015-04-19	712319000	717343000	100.71%

@Ironholds: I have re-run the R code on stat1002 having installed github versions of the R packages--> Got quite different results from the ones in DB.
I did (sorry for the poor code, I am not a R person):

library(pageviews)
pvcount <-lapply(get_files(earliest="2015-04-01", latest="2015-04-19"), function(l) nrow(to_pageviews(read_sampled_log(l))))

And got that result:

721702
704113
689490
717606
669698
716917
766752
772423
751207
732034
672779
669549
720380
752698
750538
732099
728555
686288
648094

Any idea ?

Don't apologise, that looks like totally serviceable code (heck, it's
pretty much precisely how I'd do it). Tried multiplying the results by
1,000? ;p

Even there, there's variation - my hypothesis would be ts filtering.
So, some log lines are simultaneously screwy and invalidly formatted -
as a simple heuristic against that I used to filter out any row
without a valid timestamp field (since that's a standardised format
storing server-side information).

So try, for the anonymous function:

data <- read_sampled_log(l)
data$timestamp <- convert_timestamp(data$timestamp)
data <- data[!is.na(data$timestamp),]
nrow(to_pageviews(data))

Timestamp conversion tested on one day 2014-04-01 (heavy computation, so keep it small) --> still 721702.

I have another experiment on the go: files contains partial days starting at 06:25 in the morning --> data needs to be reaggregated instead of taken by file.

Will know more tomorrow.

• kevinator renamed this task from Investigate sample cube pageview_count vs unsampled log pageview count [5 pts] to Investigate sample cube pageview_count vs unsampled log pageview count [5 pts] {hawk}.Oct 6 2015, 1:26 AM

Confirmed: files don't contain days, so recombination among files is needed.
This done, I got back on the same numbers as the one in DB for the first few days.

JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.Oct 6 2015, 10:15 AM

JAllemandou moved this task from In Code Review to Paused on the Analytics-Kanban board.Oct 7 2015, 3:32 PM

• Tbayer mentioned this in T114379: Feed Wikistats traffic reports with aggregated hive data {lama} [21 pts].Oct 8 2015, 2:45 AM

This is a new and quite important finding. I should say that for the purposes of the Reading team (such as the aforementioned quarterly review) we are mostly interested in >human pageviews only. I.e. in that use case, the discrepancies with the historical data are now much larger than assumed. To check, I have re-generated the earlier "daily >discrepancies" table for non-spider views below; I guess the percentages correspond to the green line in your second chart above.

It will be more correct saying "pageviews not tagged as bots". We know there is a bunch of automated crawler traffic that we are not catching, thus you cannot assume that pageviews not tagged as spider are "human" pageviews. Side note but an important one.

JAllemandou moved this task from Paused to Done on the Analytics-Kanban board.Oct 13 2015, 4:19 PM

In T108925#1719740, @Nuria wrote:

This is a new and quite important finding. I should say that for the purposes of the Reading team (such as the aforementioned quarterly review) we are mostly interested in >human pageviews only. I.e. in that use case, the discrepancies with the historical data are now much larger than assumed. To check, I have re-generated the earlier "daily >discrepancies" table for non-spider views below; I guess the percentages correspond to the green line in your second chart above.

It will be more correct saying "pageviews not tagged as bots". We know there is a bunch of automated crawler traffic that we are not catching, thus you cannot assume that pageviews not tagged as spider are "human" pageviews. Side note but an important one.

Right, but what I said was that "for the purposes of the Reading team (such as the aforementioned quarterly review) we are mostly interested in human pageviews only" and that's still correct ;) Pageviews not tagged as bots/spiders are the metric we use for that because it's the best proxy we have. Good to keep its limitations in mind, of course.

@JAllemandou : I see you moved this task to "done" on the board, but left its status as open. In our October 6 IRC discussion (log) we had resolved to collect the various findings from this task into an overview table, ideally with an estimate/upper bound for the discrepancy caused by each (even just orders of magnitude are helpful too), and dates where the change was introduced (if applicable). Like this:

description of the difference	pageviews05/R	pageview_hourly/Java	size estimate	Change date, link
"action=edit counted?	yes	no	~ + 3416000 / day	February 27, 2015
...	...	...	...	...

You suggested a wiki page for this, which makes great sense to me. Has this been started already? If you're still wondering about a suitable page name, how about something like https://meta.wikimedia.org/wiki/Research:Page_view/Dataset_discrepancies (to be linked from / to https://meta.wikimedia.org/wiki/Research:Page_view#Change_log )?

@Tbayer, our tasks just remain open as a matter of process, the product owner closes them as resolved to acknowledge that it's completed to everyone's satisfaction. As far as this task is concerned, it was going on for a long time so I suggested we wrap it up. If there's anything else to be done, please open separate tasks for each item of work. That makes it easier on us to track and stay predictable with our other work.

@Tbayer
Sorry but team does not agree that the differences between the old R code and the current pageviews need to be keep up to date in a wiki page. We feel we have put quite a bit of work in documenting something that should be obvious, namely that the R pageview code that was written more than 6 months ago is outdated, it is to be expected that results using that code will differ from pageview definition in Java, more so as time goes by. We understand for some time period you only have data from the R definition and that is unfortunate, should you feel you need to keep up to date a list of differences these are the resources you can use:

Changelog of pageview definition is kept here:
https://meta.wikimedia.org/wiki/Research:Page_view#Change_log

Changes to java code are here:
https://github.com/wikimedia/analytics-refinery-source/commits/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java

The authoritative source as to changes to the definition is the java code but we try our best to keep wikis up to date.

• kevinator closed this task as Resolved.Oct 15 2015, 4:03 PM

• kevinator subscribed.

For what it's worth, as the author of both the R definition and the Java definition I absolutely agree with this approach - with the exception that the authoritative source should be the wikis. "the code documents itself!" has never and will never be true of absolutely anything. Teams at the Wikimedia Foundation have a duty of transparency and the wiki page should always be up to date, because that's where the transparency comes from. If the wiki page and code disagree, the code is wrong.

If the wiki page and code disagree, the code is wrong.

You are right, this should be the approach for a definition with such a high visibility. We have made an effort to keep those wikis shinny so they should be the primary reference.

English really needs a word for "yay, we agree!" but in the absence of that I'll just say this :P

Update: So @JAllemandou was in SF last week for the Dev Summit and All Hands, and he and I took the opportunity on Friday to sit down in person and carry out the above mentioned plan to summarize the findings from this task (and clarify some that had not been fully clear to me from the comments here). The result is this table on Meta-wiki. Feel free to edit, and of course to add possible future findings. It took us about 40 minutes to compile this.

In T108925#1726450, @Milimetric wrote:

@Tbayer, our tasks just remain open as a matter of process, the product owner closes them as resolved to acknowledge that it's completed to everyone's satisfaction. As far as this task is concerned, it was going on for a long time so I suggested we wrap it up. If there's anything else to be done, please open separate tasks for each item of work. That makes it easier on us to track and stay predictable with our other work.

In my understanding a task is done when the underlying objective has been achieved, not when it has stayed open for a certain amount of time. That said, I agree that the objective of "investigate ..." had not been spelled out very clearly in the original task description. From my perspective, it has been to understand the discrepancies between these two datasets well enough to be able to combine them for the purpose of doing historical comparisons beyond April 2015. That was not the case when the task was closed on October 15. With the new table we at least have a better grasp on what has been found so far, even though we may still not have explained every difference fully.

I was a bit surprised to see the importance of valid historical comparisons questioned a lot, so let me dwell on that a little. There is a lot of attention - internally at WMF, from our community, and in the general public - for how this metric (total pageviews) develops in the long term. E.g. just two days ago, the largest German weekly newspaper highlighted a supposed year-over-year drop in these total PVs while covering Wikipedia's 15th anniversary. The new Reading Product page, the WMF quarterly report and the Reading team's quarterly review all track its year-over-year changes (in each I had to link back to this bug as a caveat about the lack of reliable comparable data, but will now be able to refer to the new overview on Meta). For all these audiences, the difference between a +2% increase and a -2% decrease (say) year-over-year matters quite a bit. As another example, I recently resorted to the pageviews05 data to investigate a traffic drop in mid-December (to decide whether it could still be seasonal, or was worth ringing an alarm; will cover this in the upcoming readership metrics report).

It's true that work on this task drew out a bit, but consider that the most important finding here (accompanied by a bug fix) came only after several weeks, and that it was only after over a month that work began on Oliver's suggestion to compare the code directly, leading to further important findings - e.g. regarding action=edit - more than six weeks after the task had been opened. There were also various false starts in the investigation that I spent time helping to correct. I take some responsibility for not highlighting clearly that we would already have needed this resolved at the end of September for Q1 quarterly review/report - this was an oversight on my part, but in general I also wanted to be mindful of the Analytics team's workload at the time (pageview API etc.).

In T108925#1726530, @Nuria wrote:

@Tbayer
Sorry but team does not agree that the differences between the old R code and the current pageviews need to be keep up to date in a wiki page.

That's a misrepresentation of what I asked for, and of what this task had been about. The investigation above was focused on the discrepancies caused by a one-time change (from the R implementation to the Java implementation of the same page view definition). IMHO these discrepancies should have been documented back then already, instead of being uncovered months later by an analyst from another team who had been advised by the Analytics department to use this data for comparison.

We feel we have put quite a bit of work in documenting something that should be obvious, namely that the R pageview code that was written more than 6 months ago is outdated, it is to be expected that results using that code will differ from pageview definition in Java, more so as time goes by. We understand for some time period you only have data from the R definition and that is unfortunate, should you feel you need to keep up to date a list of differences these are the resources you can use:

Changelog of pageview definition is kept here:
https://meta.wikimedia.org/wiki/Research:Page_view#Change_log

Changes to java code are here:
https://github.com/wikimedia/analytics-refinery-source/commits/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java

The authoritative source as to changes to the definition is the java code but we try our best to keep wikis up to date.

For the record, I fully agree with the outcome of the subsequent discussion stating that the definition as recorded on the wiki should be regarded as the authoritative specification. And ideally it should be specific enough to avoid differences between different implementations in the first place.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptJan 14 2016, 5:08 PM

	F2645160: diff_percent_hive_cube_no_outliers.png
	Sep 29 2015, 8:36 PM

	F2624415: diff_percent_hive_cube_spider.png
	Sep 21 2015, 7:55 PM

Investigate sample cube pageview_count vs unsampled log pageview count [13 pts] {hawk}Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Investigate sample cube pageview_count vs unsampled log pageview count [13 pts] {hawk}
Closed, ResolvedPublic
Actions