[REQUEST] WMTW want to have zhwiki yearly-aggregate top 10 articles viewed in 2017.
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Reke
	Jan 2 2018, 10:49 AM

Description

What's requested:
Wikimedia Taiwan would like to request the year-aggregate(2017/1/1-12/31) top 10 most viewed pages on Chinese Wikipedia.

Why it's requested:
The statistics data related to Chinese Wikipedia is always hot topic to post on social media. It would be a nice piece of material if we can get the raw data and publish it in our Facebook live-program for making it viral.

When it's requested:

As soon as possible, it would be the best if that could be done before 2018/1/7 15:59 (UTC).

Other helpful information:

The live video WMTW facebook had last year (in Chinese, login FB needed), based on T154434

Related Objects

Mentioned In: T211827: Request: Top articles of 2018 on all Wikipedias
T192360: WMFI want to have fiwiki aggregates top 5000 articles viewed in 2018,2017 and all-time-in-the-database
Mentioned Here: T211827: Request: Top articles of 2018 on all Wikipedias
P7945 Query: top 100+ viewed mainspace pages per wiki in 2018
T154446: Provide a yearly "Data type" option for topviews
T154434: [REQUEST] WMTW facebook page want to have zhwiki yearly-aggregate top 10 articles viewed in 2016.

Event Timeline

Reke created this task.Jan 2 2018, 10:49 AM

Restricted Application added subscribers: Stang, Aklapper. · View Herald TranscriptJan 2 2018, 10:49 AM

@Tbayer : Would you mind help us again this year?

Framawiki added a project: WMF-General-or-Unknown.Jan 2 2018, 10:55 AM

Framawiki subscribed.

Useful link, that shows per-month stats: https://tools.wmflabs.org/topviews/?project=zh.wikipedia.org&platform=all-access&date=last-month&excludes=

In T183903#3866690, @Framawiki wrote:

Useful link, that shows per-month stats: https://tools.wmflabs.org/topviews/?project=zh.wikipedia.org&platform=all-access&date=last-month&excludes=

Yes, thank you, that's a useful link, but I'm not sure that I can get the annual data by it or not. If some articles were hit in one month but cool down soon, could I get it's viewing count during cold month?

Aklapper edited projects, added Reader-research; removed WMF-General-or-Unknown.Jan 2 2018, 12:00 PM

Shizhao added projects: Wikimedia Taiwan, Chinese-Sites.Jan 3 2018, 2:06 AM

• Tbayer edited projects, added Reading-analysis; removed Reader-research.Jan 4 2018, 7:53 AM

• Tbayer updated the task description. (Show Details)

Below is the result, in the same format as last year (with the same caveats, e.g. it includes all namespaces, but it should be enough information for you to generate the actual top 10 articles list, by restricting to mainspace and also removing the entry for the minus sign page, which does not correspond to real views for that page).

SELECT CONCAT('https://zh.wikipedia.org/wiki/',page_title), SUM(view_count) AS views
FROM wmf.pageview_hourly
WHERE
   year = 2017
   AND project = 'zh.wikipedia'
   AND agent_type = 'user'
GROUP BY page_title
ORDER BY views DESC LIMIT 100;

(NB: For convenience I included the link to the desktop version for each page, but the numbers refer to the aggregate pageviews for desktop, mobile web and apps.)

In T183903#3866690, @Framawiki wrote:

Useful link, that shows per-month stats: https://tools.wmflabs.org/topviews/?project=zh.wikipedia.org&platform=all-access&date=last-month&excludes=

Yes, but as @Reke pointed out, this doesn't directly yield the yearly top 10. T154446: Provide a yearly "Data type" option for topviews was filed on this occasion a year ago, but is still open (CC @MusikAnimal ).

Shizhao moved this task from Backlog to Closed on the Chinese-Sites board.Jan 5 2018, 2:10 AM

• Tbayer mentioned this in T192360: WMFI want to have fiwiki aggregates top 5000 articles viewed in 2018,2017 and all-time-in-the-database.Apr 17 2018, 5:27 PM

EDIT: Made a bunch of mistakes in my query here, fixed in T183903#4824489

Restricted Application added a project: Product-Analytics. · View Herald TranscriptDec 12 2018, 9:24 PM

Quiddity mentioned this in T211827: Request: Top articles of 2018 on all Wikipedias.Dec 12 2018, 10:02 PM

In T183903#4818567, @Milimetric wrote:
quickly adding an example for all wikis, in case it's needed in the future (I haven't tested it but I'll edit it if it breaks and you yell at me on IRC :)):
 SELECT CONCAT('https://', project, '.org/wiki/', page_title),
        SUM(view_count) AS views
   FROM wmf.pageview_hourly
  WHERE year = 2018
    AND project = 'zh.wikipedia'
    AND agent_type = 'user'
  GROUP BY project, page_title
  ORDER BY project, views DESC
  LIMIT 100
;

I'm guessing the project = 'zh.wikipedia' shouldn't be there? :)

The ORDER BY project also seems to not work, for whatever reason: Line 8:9 Invalid table alias or column reference 'project': (possible column names are: _c0, views) (state=42000,code=10004)

This is running without errors:

SELECT CONCAT('https://', project, '.org/wiki/', page_title), SUM(view_count) AS views
FROM wmf.pageview_hourly
WHERE
   year = 2018
   AND agent_type = 'user'
GROUP BY project, page_title
ORDER BY views DESC LIMIT 100;

I'm running this now. Who knows when/if it will finish! ;)

I'm sorry! I messed up, and completely forgot this is a little trickier than a simple query. Here's the real query that's doing this in production for the daily and monthly tops: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/daily/pageview_top_articles.hql

And here's a version that would work in theory for yearly. However, I'm pretty sure this will need more resources than we have available on the cluster, and therefore won't ever finish. The 45 minute run was probably because of that limit 100 statement. But I could be wrong, may be worth a shot. I think to improve the performance, you could either join to a table of statistics compiled per-wiki to weed out pages with low view_counts (this would dramatically reduce the number of records in the counted subquery below), or use a different approach of iterating over the data and keeping bloom filter counts:

   WITH counted AS (
         SELECT project,
                page_title,
                SUM(view_count) as views
           FROM pageview_hourly
          WHERE year=2018
            AND agent_type = 'user'
            AND page_title != '-'
            AND month=12
            AND day=12
            AND hour=12
          GROUP BY project, page_title, year, month, day, hour
        ),

        ranked AS (
         SELECT project,
                page_title,
                views,
                rank() OVER (PARTITION BY project ORDER BY views DESC) as ranking
           FROM counted
        )

 SELECT project,
        CONCAT('https://', project, '.org/wiki/', page_title) AS article,
        views
   FROM ranked
  WHERE ranking <= 100
;

(note you should remove the month/day/hour filters if running this for real. For reference it took 90 seconds on 1 hour which if it scaled linearly (which it doesn't) would imply 10 days of processing (definitely do this on a screen if you plan on doing it))

In T183903#4824489, @Milimetric wrote:

...

And here's a version that would work in theory for yearly. However, I'm pretty sure this will need more resources than we have available on the cluster, and therefore won't ever finish. The 45 minute run was probably because of that limit 100 statement. But I could be wrong, may be worth a shot. I think to improve the performance, you could either join to a table of statistics compiled per-wiki to weed out pages with low view_counts (this would dramatically reduce the number of records in the counted subquery below), or use a different approach of iterating over the data and keeping bloom filter counts:

...

(note you should remove the month/day/hour filters if running this for real. For reference it took 90 seconds on 1 hour which if it scaled linearly (which it doesn't) would imply 10 days of processing (definitely do this on a screen if you plan on doing it))

Actually the query worked just fine when applied to (almost) an entire year's worth of data - it took less than an hour to complete even while using the nice queue. This was not too surprising, considering that 1. processing one year of pageview_hourly data is not too demanding per se (judging from previous examples) and 2. the complexity of the second and third query seems to mainly depend on the number of pages present in the dataset, not the number of rows that the first query has to process.

I modified and expanded the above query a bit (P7945 ), in particular with the following:

including the percentage of mobile view for that pages (considering that it is an often used criterion to weed out anomalies)
including both the page's name and its (desktop) URL, for easier processing
ranking pages only by mainspace views (i.e. views tagged with namespace_id =0), to avoid special pages etc. clogging the results
but still reporting all views for each page (including those where the namespace wasn't logged, which still happens due to some bugs)

The result is a TSV file with several MB in size. (It still needs some manual vetting for each project to weed out anomalies, see e.g. @MusikAnimal 's notes at T211827#4838895 .)
Below is a sample excerpt illustrating the format (top 10 pages for barwiki in November 2018).

@MusikAnimal, would you like to run this sometime in the next few days and distribute the result? (If not, I might be able to do it myself later this week.)

project	page	desktopurl	views Nov 2018	mobile_percentage
bar.wikipedia	Hoamseitn	https://bar.wikipedia.org/wiki/Hoamseitn	60263	7.76
bar.wikipedia	Wikipedia	https://bar.wikipedia.org/wiki/Wikipedia	4308	0.93
bar.wikipedia	Boarisch	https://bar.wikipedia.org/wiki/Boarisch	3080	25.58
bar.wikipedia	Minga	https://bar.wikipedia.org/wiki/Minga	2756	58.67
bar.wikipedia	International Standard Book Number	https://bar.wikipedia.org/wiki/International_Standard_Book_Number	2124	0.89
bar.wikipedia	Thea Gottschalk	https://bar.wikipedia.org/wiki/Thea_Gottschalk	1586	64.44
bar.wikipedia	Enzyklopädie	https://bar.wikipedia.org/wiki/Enzyklopädie	1493	0.67
bar.wikipedia	Stean	https://bar.wikipedia.org/wiki/Stean	1289	0.47
bar.wikipedia	1956	https://bar.wikipedia.org/wiki/1956	1231	0.32
bar.wikipedia	Finnland	https://bar.wikipedia.org/wiki/Finnland	1097	1.09
bar.wikipedia	Englische Sproch	https://bar.wikipedia.org/wiki/Englische_Sproch	1082	1.39

Query for the entire year:

P7945 Query: top 100+ viewed 2 34 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

mainspace pages per wiki in 2018

1	SET mapred.job.queue.name=nice; class="k">WITH counted AS ( -- adapted from https://phabricator.wikimedia.org/T183903#4824489 : SELECT project, page_title, SUM(view_count) as views, -- Some mainspace views are wrongly logged without namespace_id (NULL) SUM(IF(namespace_id = 0,view_count,0)) AS ns0views, SUM(IF(access_method != 'desktop', view_count, 0))/SUM(view_count) AS mobile_ratio FROM wmf.pageview_hourly WHERE year=2018 AND agent_type = 'user' AND page_title != '-' GROUP BY project, page_title HAVING ns0views >= 100 -- Some small projects may have very low traffic pages in the top X ), ns0ranked AS ( SELECT project, page_title, views, ns0views, mobile_ratio, rank() OVER (PARTITION BY project ORDER BY ns0views DESC) as ranking FROM counted ) SELECT project, REGEXP_REPLACE(page_title,'_',' ') AS page, CONCAT('https://', project, '.org/wiki/', page_title) AS desktopurl, views, ROUND(100 * mobile_ratio, 2) AS mobile_percentage FROM ns0ranked WHERE ranking <= 150 -- In case ranking by all views differs a bit (cf. above) ORDER BY project ASC, views DESC LIMIT 1000000 dungodung subscribed.Jan 2 2019, 2:49 PM Stang unsubscribed.Nov 14 2021, 12:04 AM Log In to Comment Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL

SET mapred.job.queue.name=nice; class="k">WITH counted AS ( -- adapted from https://phabricator.wikimedia.org/T183903#4824489 : SELECT project, page_title, SUM(view_count) as views, -- Some mainspace views are wrongly logged without namespace_id (NULL) SUM(IF(namespace_id = 0,view_count,0)) AS ns0views, SUM(IF(access_method != 'desktop', view_count, 0))/SUM(view_count) AS mobile_ratio FROM wmf.pageview_hourly WHERE year=2018 AND agent_type = 'user' AND page_title != '-' GROUP BY project, page_title HAVING ns0views >= 100 -- Some small projects may have very low traffic pages in the top X ), ns0ranked AS ( SELECT project, page_title, views, ns0views, mobile_ratio, rank() OVER (PARTITION BY project ORDER BY ns0views DESC) as ranking FROM counted ) SELECT project, REGEXP_REPLACE(page_title,'_',' ') AS page, CONCAT('https://', project, '.org/wiki/', page_title) AS desktopurl, views, ROUND(100 * mobile_ratio, 2) AS mobile_percentage FROM ns0ranked WHERE ranking <= 150 -- In case ranking by all views differs a bit (cf. above) ORDER BY project ASC, views DESC LIMIT 1000000

dungodung subscribed.Jan 2 2019, 2:49 PM

Stang unsubscribed.Nov 14 2021, 12:04 AM

[REQUEST] WMTW want to have zhwiki yearly-aggregate top 10 articles viewed in 2017.Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

[REQUEST] WMTW want to have zhwiki yearly-aggregate top 10 articles viewed in 2017.
Closed, ResolvedPublic
Actions