Page MenuHomePhabricator

[REQUEST] Language team's baseline metrics - correct CLL pageview data
Closed, ResolvedPublic

Description

There seems to be an issue causing pageview data from the API to differ from the underlying data in the Hadoop cluster. @Amire80 wants to use some of the data affected by this as a baseline for later evaluation of Compact Language Links(CLL), so he would like the "correct" data from the Hadoop cluster.

3 languages have been affected: RU, NL, and EN, hence the baseline for these 3 needs to be adjusted.

Event Timeline

HJiang-WMF renamed this task from [REQUEST] Language team's baseline metrics to [REQUEST] Language team's baseline metrics - correct CLL pageview data.Aug 10 2016, 4:19 PM
HJiang-WMF updated the task description. (Show Details)
Amire80 added subscribers: JAllemandou, Milimetric.

In more detail:

I am currently recording the number of interlanguage clicks between all languages every day, and I compare their number with the number of daily pageviews. There are several issues with how it's done:

  • I do it in a Google spreadsheet. It might make more sense to do it in our own data collecting and dashboarding technology, as @JAllemandou and @Milimetric suggested a couple of times. This will be generally more robust and easier to maintain and publicize.
  • When I started doing this, it was enough to correlate it with the pageviews data only for a few languages, but it makes sense to do it for all languages and not just a sample. Currently I use the https://tools.wmflabs.org/pageviews/ site to get the data for this sample of languages, but it won't scale for all languages.
  • The number of daily pageviews are seriously skewed for at least three languages (en, nl, ru) since mid-July—see T141506. Addressing this through the usual Pageviews API and the wmflabs pageviews website, and writing a manual HQL query that filters out the bogus data will be needed. @Milimetric hinted at how to write this query at T141506#2533974.

The first step would be to write this query so that we'll have non-skewed data pageviews for en, nl and ru ASAP. This is quite urgent because the Compact Language Links is already deployed at ru and is planned to be deployed to all languages soon, so we need to be ready with the data.

After that we should start transitioning all the metrics for this feature from a spreadsheet to a proper Wikimedia dashboard.

Below is the minimally formatted output from Hive on correct EN Wiki pageview starting July 20th to July 31st, the 1st day we start to see skewed pageview data.

year  month	day	    date	views
2016	7	20	2016-07-20	15566586
2016	7	21	2016-07-21	16760763
2016	7	22	2016-07-22	16217447
2016	7	23	2016-07-23	18445963
2016	7	24	2016-07-24	20203382
2016	7	25	2016-07-25	21975615
2016	7	26	2016-07-26	21380231
2016	7	27	2016-07-27	19596072
2016	7	28	2016-07-28	18448872
2016	7	29	2016-07-29	16099460
2016	7	30	2016-07-30	16352205
2016	7	31	2016-07-31	19848976

More to follow and will update @Amire80 on further details later.

All preliminary adjusted pageview counts for the three wikis:

EN wiki:

2016	7	20	2016-07-20	147591574
2016	7	21	2016-07-21	160737954
2016	7	22	2016-07-22	162500983
2016	7	23	2016-07-23	144436151
2016	7	24	2016-07-24	153682095
2016	7	25	2016-07-25	181401922
2016	7	26	2016-07-26	185191158
2016	7	27	2016-07-27	178341451
2016	7	28	2016-07-28	174283446
2016	7	29	2016-07-29	164215576
2016	7	30	2016-07-30	148321796
2016	7	31	2016-07-31	157467651
2016	8	1	2016-08-01	182779001
2016	8	2	2016-08-02	179841165
2016	8	3	2016-08-03	166476182
2016	8	4	2016-08-04	157751025
2016	8	5	2016-08-05	154370599
2016	8	6	2016-08-06	139515133
2016	8	7	2016-08-07	145583119
2016	8	8	2016-08-08	178171197
2016	8	9	2016-08-09	174795010
2016	8	10	2016-08-10	175266848
2016	8	11	2016-08-11	172442674
2016	8	12	2016-08-12	164145953
2016	8	13	2016-08-13	144264473
2016	8	14	2016-08-14	151224792
2016	8	15	2016-08-15	178504686
2016	8	16	2016-08-16	170633582

RU wiki:

2016	7	20	2016-07-20	18002751
2016	7	21	2016-07-21	20364572
2016	7	22	2016-07-22	23013788
2016	7	23	2016-07-23	22821666
2016	7	24	2016-07-24	26586786
2016	7	25	2016-07-25	31161525
2016	7	26	2016-07-26	31737641
2016	7	27	2016-07-27	32531652
2016	7	28	2016-07-28	32720562
2016	7	29	2016-07-29	31640460
2016	7	30	2016-07-30	28193439
2016	7	31	2016-07-31	29084307
2016	8	1	2016-08-01	32512884
2016	8	2	2016-08-02	32757354
2016	8	3	2016-08-03	30784116
2016	8	4	2016-08-04	28147983
2016	8	5	2016-08-05	25998700
2016	8	6	2016-08-06	22533367
2016	8	7	2016-08-07	23787873
2016	8	8	2016-08-08	27617700
2016	8	9	2016-08-09	28303708
2016	8	10	2016-08-10	29626509
2016	8	11	2016-08-11	30283348
2016	8	12	2016-08-12	30625244
2016	8	13	2016-08-13	27086723
2016	8	14	2016-08-14	29181607
2016	8	15	2016-08-15	33586892
2016	8	16	2016-08-16	31597300

NL wiki:

2016	7	20	2016-07-20	1834975
2016	7	21	2016-07-21	1833843
2016	7	22	2016-07-22	1997255
2016	7	23	2016-07-23	1886527
2016	7	24	2016-07-24	2000741
2016	7	25	2016-07-25	2484878
2016	7	26	2016-07-26	2491209
2016	7	27	2016-07-27	2472188
2016	7	28	2016-07-28	2412744
2016	7	29	2016-07-29	2431223
2016	7	30	2016-07-30	2028314
2016	7	31	2016-07-31	2021094
2016	8	1	2016-08-01	2367691
2016	8	2	2016-08-02	2469991
2016	8	3	2016-08-03	2427744
2016	8	4	2016-08-04	2419846
2016	8	5	2016-08-05	2458192
2016	8	6	2016-08-06	1964125
2016	8	7	2016-08-07	2159764
2016	8	8	2016-08-08	2632790
2016	8	9	2016-08-09	2563256
2016	8	10	2016-08-10	2623378
2016	8	11	2016-08-11	2702355
2016	8	12	2016-08-12	2348788
2016	8	13	2016-08-13	1604863
2016	8	14	2016-08-14	1692343
2016	8	15	2016-08-15	2047259
2016	8	16	2016-08-16	2444235

Since the outrageous spike happened starting on July 20th with Chrome 41 on Windows being the driver behind the spike, I also looked into pageview data before July 20th to find out the usual percentage of Chrome 41 on Windows. It turns out that it is roughly between 3.5% to 5% of normal pageview data. So if we would want further finetuning adjustment, probably that could be considered, too.

Thank you, @HJiang-WMF! When you get a chance, would you post the HQL you used to generate these numbers?

The query. This example is for getting data between August 1st and 16th of RU wiki pageview data in the formatted manner as posted above.

SELECT year, month, day, CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) AS date, sum(view_count) AS views 
FROM wmf.pageview_hourly
WHERE
  year = 2016
  AND month = 8
  AND day >=1 
  AND day <= 16
  AND project = 'ru.wikipedia' 
  AND agent_type = 'user'
  AND access_method = 'desktop'
  and NOT (user_agent_map['browser_family'] = 'Chrome'
      and user_agent_map['browser_major'] = '41'
      and user_agent_map['os_family'] = 'Windows'
      AND page_title = 'Main_Page')
 GROUP BY year, month, day ORDER BY year, month, day;

Hmm. Either I'm missing something, or this query, as well as the numbers posted above, still shows a rise in the number of daily pageviews starting mid-July. It's around 120 million per week until mid-july and around 200 million per week after that.

I thought that the issue is that 'Main_Page' is the English name, so I tried it with the Russian name 'Заглавная_страница', but I get the same result.

Hmm. Either I'm missing something, or this query, as well as the numbers posted above, still shows a rise in the number of daily pageviews starting mid-July. It's around 120 million per week until mid-july and around 200 million per week after that.

Actually, if one looks at it day-by-day, it seems quite plausible [edited to add: ... that this query simply yields the uncorrected views - now confirmed]. (Between 12 and 17 million/day for each day from July 1-19, and between 26 and 34 million for each day from July 24 to August 16, dropping back to 18 million on August 17-18.) Keep in mind that desktop traffic is usually much lower on weekends, or if aggregating views by week, that the incident fell in the middle of a calendar week.

(I've been looking at this result because I'm constructing a similar query to calculate the global pageviews with these corrections.)

I thought that the issue is that 'Main_Page' is the English name, so I tried it with the Russian name 'Заглавная_страница', but I get the same result.

Yeah, https://ru.wikipedia.org/wiki/Main_Page etc. are hardcoded redirects to the main page in the local language (like for many special pages), which is why this kind of query works cross-language. [edit: I misremembered, it seems these hardcoded redirects are actually treated like normal (wiki page) redirects in the webrequest/pageview tables. So changing 'Main_Page' to 'Заглавная_страница' was necessary, but it still did not give the correct result because there was another bug in the query, namely in the os_family condition - will post below.]

SELECT year, month, day, CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) AS date, sum(view_count) AS views 
FROM wmf.pageview_hourly
WHERE
  year = 2016
  AND month > 6
  AND project = 'ru.wikipedia' 
  AND agent_type = 'user'
  AND access_method = 'desktop'
  and NOT (user_agent_map['browser_family'] = 'Chrome'
      and user_agent_map['browser_major'] = '41'
      and user_agent_map['os_family'] = 'Windows'
      AND page_title = 'Main_Page')
GROUP BY year, month, day 
ORDER BY year, month, day
LIMIT 10000;

year	month	day	date	views
2016	7	1	2016-07-01	15146857
2016	7	2	2016-07-02	13309419
2016	7	3	2016-07-03	14231594
2016	7	4	2016-07-04	16257980
2016	7	5	2016-07-05	16512976
2016	7	6	2016-07-06	15782406
2016	7	7	2016-07-07	16213196
2016	7	8	2016-07-08	15090574
2016	7	9	2016-07-09	12992570
2016	7	10	2016-07-10	14291279
2016	7	11	2016-07-11	16831735
2016	7	12	2016-07-12	16146646
2016	7	13	2016-07-13	15796416
2016	7	14	2016-07-14	16053259
2016	7	15	2016-07-15	14893612
2016	7	16	2016-07-16	12672429
2016	7	17	2016-07-17	13761908
2016	7	18	2016-07-18	16173416
2016	7	19	2016-07-19	16106530
2016	7	20	2016-07-20	18002751
2016	7	21	2016-07-21	20364572
2016	7	22	2016-07-22	23013788
2016	7	23	2016-07-23	22821666
2016	7	24	2016-07-24	26586786
2016	7	25	2016-07-25	31161525
2016	7	26	2016-07-26	31737641
2016	7	27	2016-07-27	32531652
2016	7	28	2016-07-28	32720562
2016	7	29	2016-07-29	31640460
2016	7	30	2016-07-30	28193439
2016	7	31	2016-07-31	29084307
2016	8	1	2016-08-01	32512884
2016	8	2	2016-08-02	32757354
2016	8	3	2016-08-03	30784116
2016	8	4	2016-08-04	28147983
2016	8	5	2016-08-05	25998700
2016	8	6	2016-08-06	22533367
2016	8	7	2016-08-07	23787873
2016	8	8	2016-08-08	27617700
2016	8	9	2016-08-09	28303708
2016	8	10	2016-08-10	29626509
2016	8	11	2016-08-11	30283348
2016	8	12	2016-08-12	30625244
2016	8	13	2016-08-13	27086723
2016	8	14	2016-08-14	29181607
2016	8	15	2016-08-15	33586892
2016	8	16	2016-08-16	31597300
2016	8	17	2016-08-17	17690035
2016	8	18	2016-08-18	17737970
2016	8	19	2016-08-19	15697771

[edited to add]: The query below confirms that these numbers are identical to the uncorrected views, i.e. the correction didn't work. Right now I don't know either where the bug lies in the above query, will check further.

SELECT year, month, day, CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) AS date, sum(view_count) AS views 
FROM wmf.pageview_hourly
WHERE
  year = 2016
  AND month > 6
  AND project = 'ru.wikipedia' 
  AND agent_type = 'user'
  AND access_method = 'desktop'
GROUP BY year, month, day 
ORDER BY year, month, day
LIMIT 10000;
year	month	day	date	views
2016	7	1	2016-07-01	15146857
2016	7	2	2016-07-02	13309419
2016	7	3	2016-07-03	14231594
2016	7	4	2016-07-04	16257980
2016	7	5	2016-07-05	16512976
2016	7	6	2016-07-06	15782406
2016	7	7	2016-07-07	16213196
2016	7	8	2016-07-08	15090574
2016	7	9	2016-07-09	12992570
2016	7	10	2016-07-10	14291279
2016	7	11	2016-07-11	16831735
2016	7	12	2016-07-12	16146646
2016	7	13	2016-07-13	15796416
2016	7	14	2016-07-14	16053259
2016	7	15	2016-07-15	14893612
2016	7	16	2016-07-16	12672429
2016	7	17	2016-07-17	13761908
2016	7	18	2016-07-18	16173416
2016	7	19	2016-07-19	16106530
2016	7	20	2016-07-20	18002751
2016	7	21	2016-07-21	20364572
2016	7	22	2016-07-22	23013788
2016	7	23	2016-07-23	22821666
2016	7	24	2016-07-24	26586786
2016	7	25	2016-07-25	31161525
2016	7	26	2016-07-26	31737641
2016	7	27	2016-07-27	32531652
2016	7	28	2016-07-28	32720562
2016	7	29	2016-07-29	31640460
2016	7	30	2016-07-30	28193439
2016	7	31	2016-07-31	29084307
2016	8	1	2016-08-01	32512884
2016	8	2	2016-08-02	32757354
2016	8	3	2016-08-03	30784116
2016	8	4	2016-08-04	28147983
2016	8	5	2016-08-05	25998700
2016	8	6	2016-08-06	22533367
2016	8	7	2016-08-07	23787873
2016	8	8	2016-08-08	27617700
2016	8	9	2016-08-09	28303708
2016	8	10	2016-08-10	29626509
2016	8	11	2016-08-11	30283348
2016	8	12	2016-08-12	30625244
2016	8	13	2016-08-13	27086723
2016	8	14	2016-08-14	29181607
2016	8	15	2016-08-15	33586892
2016	8	16	2016-08-16	31597300
2016	8	17	2016-08-17	17690035
2016	8	18	2016-08-18	17737970
2016	8	19	2016-08-19	16310465
50 rows selected (266.667 seconds)

OK, found the bug in the query: Different major Windows variants (Windows 7, Windows 8.1, Windows 10, ...) actually carry different values in user_agent_map['os_family'] details below). So the WHERE condition needs to read user_agent_map['os_family'] LIKE 'Windows%'instead of user_agent_map['os_family'] = 'Windows'.

With that fix, and also (as already suggested by @Amire80 above) changing 'Main_Page' to 'Заглавная_страница', the numbers look plausibly fixed:

SELECT year, month, day, CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) AS date, sum(view_count) AS views 
FROM wmf.pageview_hourly
WHERE
  year = 2016
  AND month > 6
  AND project = 'ru.wikipedia' 
  AND agent_type = 'user'
  AND access_method = 'desktop'
  and NOT (user_agent_map['browser_family'] = 'Chrome'
      and user_agent_map['browser_major'] = '41'
      and user_agent_map['os_family'] LIKE 'Windows%'
      AND page_title = 'Заглавная_страница')
GROUP BY year, month, day 
ORDER BY year, month, day
LIMIT 10000;
year	month	day	date	views
2016	7	1	2016-07-01	15146484
2016	7	2	2016-07-02	13309166
2016	7	3	2016-07-03	14231352
2016	7	4	2016-07-04	16257596
2016	7	5	2016-07-05	16512584
2016	7	6	2016-07-06	15782088
2016	7	7	2016-07-07	16212871
2016	7	8	2016-07-08	15090272
2016	7	9	2016-07-09	12992330
2016	7	10	2016-07-10	14290971
2016	7	11	2016-07-11	16831199
2016	7	12	2016-07-12	16146296
2016	7	13	2016-07-13	15796101
2016	7	14	2016-07-14	16052987
2016	7	15	2016-07-15	14893343
2016	7	16	2016-07-16	12672240
2016	7	17	2016-07-17	13761712
2016	7	18	2016-07-18	16173134
2016	7	19	2016-07-19	16078752
2016	7	20	2016-07-20	16617712
2016	7	21	2016-07-21	15927561
2016	7	22	2016-07-22	15385765
2016	7	23	2016-07-23	13221429
2016	7	24	2016-07-24	14329528
2016	7	25	2016-07-25	15993665
2016	7	26	2016-07-26	15920161
2016	7	27	2016-07-27	15924852
2016	7	28	2016-07-28	15890992
2016	7	29	2016-07-29	15345762
2016	7	30	2016-07-30	13184307
2016	7	31	2016-07-31	14017981
2016	8	1	2016-08-01	16009843
2016	8	2	2016-08-02	16071622
2016	8	3	2016-08-03	15831568
2016	8	4	2016-08-04	15625088
2016	8	5	2016-08-05	15057452
2016	8	6	2016-08-06	13081192
2016	8	7	2016-08-07	14348028
2016	8	8	2016-08-08	16573789
2016	8	9	2016-08-09	16605814
2016	8	10	2016-08-10	16811483
2016	8	11	2016-08-11	16778428
2016	8	12	2016-08-12	16601648
2016	8	13	2016-08-13	14352198
2016	8	14	2016-08-14	15552892
2016	8	15	2016-08-15	18389558
2016	8	16	2016-08-16	18208280
2016	8	17	2016-08-17	17689696
2016	8	18	2016-08-18	17737760
2016	8	19	2016-08-19	17108040
2016	8	20	2016-08-20	705689
51 rows selected (468.525 seconds)
SELECT user_agent_map['os_family'], 
SUM(view_count) AS allviews
FROM wmf.pageview_hourly
WHERE
  year = 2016
  AND month = 7
  AND day = 31
  AND hour =0
  AND agent_type = 'user'
  AND access_method = 'desktop'
  AND user_agent_map['os_family'] LIKE 'Windows%' 
GROUP BY user_agent_map['os_family']
ORDER BY allviews DESC
LIMIT 10000;
_c0	allviews
Windows 10	2702132
Windows 7	2336013
Windows 8.1	586562
Windows XP	372937
Windows 8	152689
Windows Vista	120705
Windows 98	64057
Windows	22061
Windows 2000	6585
Windows RT 8.1	4748
Windows CE	4622
Windows 95	1179
Windows 3.1	1028
Windows Phone	839
Windows RT	589
Windows NT 4.0	559
Windows ME	52
Windows NT	11
Windows Mobile	3
19 rows selected (54.179 seconds)

In fact, even just from an aggregated point of view, it's fairly obvious that it doesn't matter whether the "Main page" in page_title is in English or local language. For example:

select page_title, sum(view_count) as views FROM wmf.pageview_hourly
WHERE
  year = 2016
  AND month = 7
  AND day >=10
  AND day <= 25
  AND project = 'ru.wikipedia' 
  and (page_title = 'Main_Page' or page_title = 'Заглавная_страница')
group by page_title
order by views DESC
limit 1000;

Results are:

page_title	views
Заглавная_страница	65973389
Main_Page	10495

So Main Page, even if in English, is redirected to main page in Russian, and the English "main page" has negligible pageview numbers even over 15 days, and compared to the redirected values of pageviews, so the language of "main page" is not an issue. @Tbayer @Amire80

In fact, even just from an aggregated point of view, it's fairly obvious that it doesn't matter whether the "Main page" in page_title is in English or local language. For example:

select page_title, sum(view_count) as views FROM wmf.pageview_hourly
WHERE
  year = 2016
  AND month = 7
  AND day >=10
  AND day <= 25
  AND project = 'ru.wikipedia' 
  and (page_title = 'Main_Page' or page_title = 'Заглавная_страница')
group by page_title
order by views DESC
limit 1000;

Results are:

page_title	views
Заглавная_страница	65973389
Main_Page	10495

So Main Page, even if in English, is redirected to main page in Russian,

True, and that was already pointed out above (T142505#2569038 ).

and the English "main page" has negligible pageview numbers even over 15 days, and compared to the redirected values of pageviews, so the language of "main page" is not an issue. @Tbayer @Amire80

I'm not sure I understand. Doesn't the fact that these two numbers differ so much underline the importance of choosing the right page title (i.e. localizing 'Main_Page') for the query?

Or to put it differently: Taking the query at T142505#2564671 and fixing only the aforementioned os_family bug (without also changing 'Main_Page' to ''Заглавная_страница') would still only have given basically the same result as the query without any correction for the TLS issue.

SELECT year, month, day, CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) AS date, sum(view_count) AS views 
FROM wmf.pageview_hourly
WHERE
  year = 2016
  AND month > 6
  AND project = 'ru.wikipedia' 
  AND agent_type = 'user'
  AND access_method = 'desktop'
  and NOT (user_agent_map['browser_family'] = 'Chrome'
      and user_agent_map['browser_major'] = '41'
      and user_agent_map['os_family'] LIKE 'Windows%'
      AND page_title = 'Main_Page')
GROUP BY year, month, day 
ORDER BY year, month, day
LIMIT 10000;

year	month	day	date	views
2016	7	1	2016-07-01	15146857
2016	7	2	2016-07-02	13309419
2016	7	3	2016-07-03	14231593
2016	7	4	2016-07-04	16257979
2016	7	5	2016-07-05	16512976
2016	7	6	2016-07-06	15782406
2016	7	7	2016-07-07	16213196
2016	7	8	2016-07-08	15090574
2016	7	9	2016-07-09	12992570
2016	7	10	2016-07-10	14291279
2016	7	11	2016-07-11	16831735
2016	7	12	2016-07-12	16146646
2016	7	13	2016-07-13	15796416
2016	7	14	2016-07-14	16053259
2016	7	15	2016-07-15	14893612
2016	7	16	2016-07-16	12672429
2016	7	17	2016-07-17	13761908
2016	7	18	2016-07-18	16173416
2016	7	19	2016-07-19	16106530
2016	7	20	2016-07-20	18002751
2016	7	21	2016-07-21	20364572
2016	7	22	2016-07-22	23013788
2016	7	23	2016-07-23	22821666
2016	7	24	2016-07-24	26586786
2016	7	25	2016-07-25	31161525
2016	7	26	2016-07-26	31737641
2016	7	27	2016-07-27	32531652
2016	7	28	2016-07-28	32720562
2016	7	29	2016-07-29	31640460
2016	7	30	2016-07-30	28193439
2016	7	31	2016-07-31	29084307
2016	8	1	2016-08-01	32512884
2016	8	2	2016-08-02	32757354
2016	8	3	2016-08-03	30784116
2016	8	4	2016-08-04	28147983
2016	8	5	2016-08-05	25998700
2016	8	6	2016-08-06	22533367
2016	8	7	2016-08-07	23787873
2016	8	8	2016-08-08	27617700
2016	8	9	2016-08-09	28303708
2016	8	10	2016-08-10	29626509
2016	8	11	2016-08-11	30283348
2016	8	12	2016-08-12	30625244
2016	8	13	2016-08-13	27086723
2016	8	14	2016-08-14	29181607
2016	8	15	2016-08-15	33586892
2016	8	16	2016-08-16	31597300
2016	8	17	2016-08-17	17690035
2016	8	18	2016-08-18	17737970
2016	8	19	2016-08-19	17108273
2016	8	20	2016-08-20	1687813
51 rows selected (498.566 seconds)
Amire80 triaged this task as Medium priority.Aug 22 2016, 7:45 PM