Page MenuHomePhabricator

[REQUEST] WMTW want to have zhwiki yearly-aggregate top 10 articles viewed in 2017.
Closed, ResolvedPublic

Description

What's requested:
Wikimedia Taiwan would like to request the year-aggregate(2017/1/1-12/31) top 10 most viewed pages on Chinese Wikipedia.

Why it's requested:
The statistics data related to Chinese Wikipedia is always hot topic to post on social media. It would be a nice piece of material if we can get the raw data and publish it in our Facebook live-program for making it viral.

When it's requested:

As soon as possible, it would be the best if that could be done before 2018/1/7 15:59 (UTC).

Other helpful information:

The live video WMTW facebook had last year (in Chinese, login FB needed), based on T154434

Event Timeline

Reke created this task.Jan 2 2018, 10:49 AM
Restricted Application added subscribers: Cosine02, Aklapper. · View Herald TranscriptJan 2 2018, 10:49 AM

@Tbayer : Would you mind help us again this year?

Reke added a comment.Jan 2 2018, 11:50 AM

Yes, thank you, that's a useful link, but I'm not sure that I can get the annual data by it or not. If some articles were hit in one month but cool down soon, could I get it's viewing count during cold month?

Tbayer updated the task description. (Show Details)
Tbayer closed this task as Resolved.Jan 4 2018, 8:01 AM
Tbayer claimed this task.

Below is the result, in the same format as last year (with the same caveats, e.g. it includes all namespaces, but it should be enough information for you to generate the actual top 10 articles list, by restricting to mainspace and also removing the entry for the minus sign page, which does not correspond to real views for that page).

SELECT CONCAT('https://zh.wikipedia.org/wiki/',page_title), SUM(view_count) AS views
FROM wmf.pageview_hourly
WHERE
   year = 2017
   AND project = 'zh.wikipedia'
   AND agent_type = 'user'
GROUP BY page_title
ORDER BY views DESC LIMIT 100;

https://zh.wikipedia.org/wiki/Wikipedia:首页 81781547
https://zh.wikipedia.org/wiki/Special:搜索 31819859
https://zh.wikipedia.org/wiki/- 17248875
https://zh.wikipedia.org/wiki/Wiki 9519196
https://zh.wikipedia.org/wiki/Wikipedia:分类索引 7962763
https://zh.wikipedia.org/wiki/Wikipedia:关于 7804979
https://zh.wikipedia.org/wiki/Wikipedia:分類索引 7726821
https://zh.wikipedia.org/wiki/維基媒體基金會 7606514
https://zh.wikipedia.org/wiki/Category:交通 7602289
https://zh.wikipedia.org/wiki/Category:交通组织 7596516
https://zh.wikipedia.org/wiki/Category:鐵路公司 7595696
https://zh.wikipedia.org/wiki/Category:路面電車公司 7595029
https://zh.wikipedia.org/wiki/Category:東京都交通局 7594617
https://zh.wikipedia.org/wiki/Category:東京都交通局路線模板 7594098
https://zh.wikipedia.org/wiki/Template:荒川線 7593737
https://zh.wikipedia.org/wiki/Special:链接搜索 6855227
https://zh.wikipedia.org/wiki/三生三世十里桃花_(电视剧) 4255793
https://zh.wikipedia.org/wiki/孤單又燦爛的神-鬼怪 4151930
https://zh.wikipedia.org/wiki/植劇場 3340421
https://zh.wikipedia.org/wiki/楚乔传 3030982
https://zh.wikipedia.org/wiki/防彈少年團 2864292
https://zh.wikipedia.org/wiki/TWICE 2625939
https://zh.wikipedia.org/wiki/Running_Man 2490787
https://zh.wikipedia.org/wiki/Special:Search 2466862
https://zh.wikipedia.org/wiki/郭文贵 2343382
https://zh.wikipedia.org/wiki/一家人 2284846
https://zh.wikipedia.org/wiki/习近平 2184497
https://zh.wikipedia.org/wiki/通靈少女_(電視劇) 2170936
https://zh.wikipedia.org/wiki/Special:监视列表 2132276
https://zh.wikipedia.org/wiki/那年花開月正圓 2109048
https://zh.wikipedia.org/wiki/射雕英雄传_(2017年电视剧) 2091843
https://zh.wikipedia.org/wiki/2017年Running_Man節目列表 2079085
https://zh.wikipedia.org/wiki/大力女子都奉順 2043132
https://zh.wikipedia.org/wiki/六四事件 2038266
https://zh.wikipedia.org/wiki/迪丽热巴 2037004
https://zh.wikipedia.org/wiki/歌手2017 1855932
https://zh.wikipedia.org/wiki/金正恩 1850388
https://zh.wikipedia.org/wiki/綜藝玩很大 1846060
https://zh.wikipedia.org/wiki/使徒行者2 1841232
https://zh.wikipedia.org/wiki/Running_Man節目列表 1753379
https://zh.wikipedia.org/wiki/Special:用户登录 1748578
https://zh.wikipedia.org/wiki/Wanna_One 1734988
https://zh.wikipedia.org/wiki/極品絕配 rELOO1718498540b8
https://zh.wikipedia.org/wiki/當你沉睡時 1631261
https://zh.wikipedia.org/wiki/锦绣未央 1619771
https://zh.wikipedia.org/wiki/甘味人生 1616056
https://zh.wikipedia.org/wiki/比特币 1600096
https://zh.wikipedia.org/wiki/藍色海洋的傳說 1569931
https://zh.wikipedia.org/wiki/刘晓波 1528097
https://zh.wikipedia.org/wiki/名偵探柯南動畫集數列表 1526143
https://zh.wikipedia.org/wiki/孔劉 1508495
https://zh.wikipedia.org/wiki/三流之路 1496622
https://zh.wikipedia.org/wiki/EXO 1495521
https://zh.wikipedia.org/wiki/進擊的巨人 1439025
https://zh.wikipedia.org/wiki/稍息立正我愛你 1438632
https://zh.wikipedia.org/wiki/鹿晗 1352242
https://zh.wikipedia.org/wiki/杨幂 1349636
https://zh.wikipedia.org/wiki/趙麗穎 1326042
https://zh.wikipedia.org/wiki/首页 1309289
https://zh.wikipedia.org/wiki/薛之谦 1307942
https://zh.wikipedia.org/wiki/吳亦凡 1297323
https://zh.wikipedia.org/wiki/同盟 1290224
https://zh.wikipedia.org/wiki/新垣結衣 1285653
https://zh.wikipedia.org/wiki/草榴社区 1273197
https://zh.wikipedia.org/wiki/PRODUCE_101_(第二季) 1259661
https://zh.wikipedia.org/wiki/賭城群英會 1221676
https://zh.wikipedia.org/wiki/臺灣 1210771
https://zh.wikipedia.org/wiki/焦急的羅曼史 1209701
https://zh.wikipedia.org/wiki/誇世代 1203299
https://zh.wikipedia.org/wiki/中国共产党第十九次全国代表大会 1194744
https://zh.wikipedia.org/wiki/朴炯植 1167889
https://zh.wikipedia.org/wiki/择天记_(电视剧) 1166703
https://zh.wikipedia.org/wiki/瑯琊榜_(電視劇) 1154464
https://zh.wikipedia.org/wiki/河伯的新娘_2017 1148481
https://zh.wikipedia.org/wiki/火影忍者疾風傳動畫集數列表 1148450
https://zh.wikipedia.org/wiki/火影忍者 1142517
https://zh.wikipedia.org/wiki/溏心風暴3 1141838
https://zh.wikipedia.org/wiki/2017年夏季世界大學運動會 1139626
https://zh.wikipedia.org/wiki/江泽民 1131651
https://zh.wikipedia.org/wiki/奇怪的搭檔 1130471
https://zh.wikipedia.org/wiki/AV女優列表 1129356
https://zh.wikipedia.org/wiki/浪漫醫生金師傅 1126524
https://zh.wikipedia.org/wiki/中華民國 1124189
https://zh.wikipedia.org/wiki/三上悠亞 1121516
https://zh.wikipedia.org/wiki/ONE_PIECE 1115454
https://zh.wikipedia.org/wiki/刀劍神域 1108627
https://zh.wikipedia.org/wiki/被告人_(韓國電視劇) 1106608
https://zh.wikipedia.org/wiki/麻醉風暴2 1105105
https://zh.wikipedia.org/wiki/中华人民共和国 1098579
https://zh.wikipedia.org/wiki/認識的哥哥 1076289
https://zh.wikipedia.org/wiki/Voice 1073743
https://zh.wikipedia.org/wiki/只為你停留 1066255
https://zh.wikipedia.org/wiki/隧道_(電視劇) 1042539
https://zh.wikipedia.org/wiki/我的愛情不平凡 1036910
https://zh.wikipedia.org/wiki/踩過界 1034821
https://zh.wikipedia.org/wiki/降魔的 1027083
https://zh.wikipedia.org/wiki/周杰倫 1026996
https://zh.wikipedia.org/wiki/不懂撒嬌的女人 1025694
https://zh.wikipedia.org/wiki/宋仲基 1023405
https://zh.wikipedia.org/wiki/幸福來了 1023101
100 rows selected (2932.635 seconds)

(NB: For convenience I included the link to the desktop version for each page, but the numbers refer to the aggregate pageviews for desktop, mobile web and apps.)

Yes, but as @Reke pointed out, this doesn't directly yield the yearly top 10. T154446: Provide a yearly "Data type" option for topviews was filed on this occasion a year ago, but is still open (CC @MusikAnimal ).

Shizhao moved this task from Backlog to Closed on the Chinese-Sites board.Jan 5 2018, 2:10 AM
Milimetric added a subscriber: Milimetric.EditedDec 12 2018, 9:24 PM

EDIT: Made a bunch of mistakes in my query here, fixed in T183903#4824489

Restricted Application added a project: Product-Analytics. · View Herald TranscriptDec 12 2018, 9:24 PM
MusikAnimal added a comment.EditedDec 14 2018, 12:35 AM

quickly adding an example for all wikis, in case it's needed in the future (I haven't tested it but I'll edit it if it breaks and you yell at me on IRC :)):

 SELECT CONCAT('https://', project, '.org/wiki/', page_title),
        SUM(view_count) AS views
   FROM wmf.pageview_hourly
  WHERE year = 2018
    AND project = 'zh.wikipedia'
    AND agent_type = 'user'
  GROUP BY project, page_title
  ORDER BY project, views DESC
  LIMIT 100
;

I'm guessing the project = 'zh.wikipedia' shouldn't be there? :)

The ORDER BY project also seems to not work, for whatever reason: Line 8:9 Invalid table alias or column reference 'project': (possible column names are: _c0, views) (state=42000,code=10004)

This is running without errors:

SELECT CONCAT('https://', project, '.org/wiki/', page_title), SUM(view_count) AS views
FROM wmf.pageview_hourly
WHERE
   year = 2018
   AND agent_type = 'user'
GROUP BY project, page_title
ORDER BY views DESC LIMIT 100;

I'm running this now. Who knows when/if it will finish! ;)

Milimetric added a comment.EditedDec 14 2018, 6:46 PM

I'm sorry! I messed up, and completely forgot this is a little trickier than a simple query. Here's the real query that's doing this in production for the daily and monthly tops: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/daily/pageview_top_articles.hql

And here's a version that would work in theory for yearly. However, I'm pretty sure this will need more resources than we have available on the cluster, and therefore won't ever finish. The 45 minute run was probably because of that limit 100 statement. But I could be wrong, may be worth a shot. I think to improve the performance, you could either join to a table of statistics compiled per-wiki to weed out pages with low view_counts (this would dramatically reduce the number of records in the counted subquery below), or use a different approach of iterating over the data and keeping bloom filter counts:

   WITH counted AS (
         SELECT project,
                page_title,
                SUM(view_count) as views
           FROM pageview_hourly
          WHERE year=2018
            AND agent_type = 'user'
            AND page_title != '-'
            AND month=12
            AND day=12
            AND hour=12
          GROUP BY project, page_title, year, month, day, hour
        ),

        ranked AS (
         SELECT project,
                page_title,
                views,
                rank() OVER (PARTITION BY project ORDER BY views DESC) as ranking
           FROM counted
        )

 SELECT project,
        CONCAT('https://', project, '.org/wiki/', page_title) AS article,
        views
   FROM ranked
  WHERE ranking <= 100
;

(note you should remove the month/day/hour filters if running this for real. For reference it took 90 seconds on 1 hour which if it scaled linearly (which it doesn't) would imply 10 days of processing (definitely do this on a screen if you plan on doing it))

...

And here's a version that would work in theory for yearly. However, I'm pretty sure this will need more resources than we have available on the cluster, and therefore won't ever finish. The 45 minute run was probably because of that limit 100 statement. But I could be wrong, may be worth a shot. I think to improve the performance, you could either join to a table of statistics compiled per-wiki to weed out pages with low view_counts (this would dramatically reduce the number of records in the counted subquery below), or use a different approach of iterating over the data and keeping bloom filter counts:

...

(note you should remove the month/day/hour filters if running this for real. For reference it took 90 seconds on 1 hour which if it scaled linearly (which it doesn't) would imply 10 days of processing (definitely do this on a screen if you plan on doing it))

Actually the query worked just fine when applied to (almost) an entire year's worth of data - it took less than an hour to complete even while using the nice queue. This was not too surprising, considering that 1. processing one year of pageview_hourly data is not too demanding per se (judging from previous examples) and 2. the complexity of the second and third query seems to mainly depend on the number of pages present in the dataset, not the number of rows that the first query has to process.

I modified and expanded the above query a bit (P7945 ), in particular with the following:

  • including the percentage of mobile view for that pages (considering that it is an often used criterion to weed out anomalies)
  • including both the page's name and its (desktop) URL, for easier processing
  • ranking pages only by mainspace views (i.e. views tagged with namespace_id =0), to avoid special pages etc. clogging the results
  • but still reporting all views for each page (including those where the namespace wasn't logged, which still happens due to some bugs)

The result is a TSV file with several MB in size. (It still needs some manual vetting for each project to weed out anomalies, see e.g. @MusikAnimal 's notes at T211827#4838895 .)
Below is a sample excerpt illustrating the format (top 10 pages for barwiki in November 2018).

@MusikAnimal, would you like to run this sometime in the next few days and distribute the result? (If not, I might be able to do it myself later this week.)

projectpagedesktopurlviews Nov 2018mobile_percentage
bar.wikipediaHoamseitnhttps://bar.wikipedia.org/wiki/Hoamseitn602637.76
bar.wikipediaWikipediahttps://bar.wikipedia.org/wiki/Wikipedia43080.93
bar.wikipediaBoarischhttps://bar.wikipedia.org/wiki/Boarisch308025.58
bar.wikipediaMingahttps://bar.wikipedia.org/wiki/Minga275658.67
bar.wikipediaInternational Standard Book Numberhttps://bar.wikipedia.org/wiki/International_Standard_Book_Number21240.89
bar.wikipediaThea Gottschalkhttps://bar.wikipedia.org/wiki/Thea_Gottschalk158664.44
bar.wikipediaEnzyklopädiehttps://bar.wikipedia.org/wiki/Enzyklopädie14930.67
bar.wikipediaSteanhttps://bar.wikipedia.org/wiki/Stean12890.47
bar.wikipedia1956https://bar.wikipedia.org/wiki/195612310.32
bar.wikipediaFinnlandhttps://bar.wikipedia.org/wiki/Finnland10971.09
bar.wikipediaEnglische Sprochhttps://bar.wikipedia.org/wiki/Englische_Sproch10821.39

Query for the entire year:

1SET mapred.job.queue.name=nice;
2
3WITH counted AS ( -- adapted from https://phabricator.wikimedia.org/T183903#4824489 :
4 SELECT project,
5 page_title,
6 SUM(view_count) as views, -- Some mainspace views are wrongly logged without namespace_id (NULL)
7 SUM(IF(namespace_id = 0,view_count,0)) AS ns0views,
8 SUM(IF(access_method != 'desktop', view_count, 0))/SUM(view_count) AS mobile_ratio
9 FROM wmf.pageview_hourly
10 WHERE year=2018
11 AND agent_type = 'user'
12 AND page_title != '-'
13 GROUP BY project, page_title
14 HAVING ns0views >= 100 -- Some small projects may have very low traffic pages in the top X
15 ),
16
17 ns0ranked AS (
18 SELECT project,
19 page_title,
20 views,
21 ns0views,
22 mobile_ratio,
23 rank() OVER (PARTITION BY project ORDER BY ns0views DESC) as ranking
24 FROM counted
25 )
26
27 SELECT project,
28 REGEXP_REPLACE(page_title,'_',' ') AS page,
29 CONCAT('https://', project, '.org/wiki/', page_title) AS desktopurl,
30 views,
31 ROUND(100 * mobile_ratio, 2) AS mobile_percentage
32 FROM ns0ranked
33 WHERE ranking <= 150 -- In case ranking by all views differs a bit (cf. above)
34 ORDER BY project ASC, views DESC LIMIT 1000000