Page MenuHomePhabricator

WMFI want to have fiwiki aggregates top 5000 articles viewed in 2018,2017 and all-time-in-the-database
Closed, ResolvedPublic

Description

What's requested:

  • the year-aggregate(2017/1/1-12/31) top 5000 most viewed articles on Finnish Wikipedia.
  • the all-time-aggregate(mid 2015 - now) top 5000 of most viewed articles on Finnish Wikipedia. (eg from 2015 to current)
  • the 2018(2018/1/1-now) top 5000 of most viewed articles on Finnish Wikipedia. (eg current year)
  • the 2007 to 2016 top 5000 of most viewed articles on Finnish Wikipedia (eg. from Pagecounts-raw)

Data should be in machine-readable form and include the page_title, page_id and read counts.

Why it's requested:
There was information request from Wikimedia Suomi for a news article related to the wikipedia exhibition and wikigap

When it's requested:
As soon as possible, it would be the best if that could be done before 21.4.

Event Timeline

I'm happy to run a query for the first two items, similar to what was done for T183903 and T154434. As noted there, the result may need some manual cleanup, in particular if you want to restrict it to mainspace pages. (For the same reason, 5000 may be a bit much - but technically, regarding the queries, it's not a problem.) The second request will need to cover the timespan from mid-2015 to now only, because that's where we have data available in an easily queryable format. I won't have time to work on request #4 in the foreseeable future. Regarding request #3, can we be more specific about the desired timespan?

Hi, I am perfecly happy for doing the manual cleanup :) It is 5000 because i need additional data if i need to pick specific topics like most readed scientist, most readed women articles etc.

For request #3, the whole timespan of the dataset but I would like to get the specific end and start dates with the result.

For #4 it is ok for not doing it, but I would like to know insights into the problems with the query. Eg i may need to explain why it is hard to do it.

Here is the result for the first request, I included the full (desktop) URL for convenience.

Since you said you needed both page name and page ID, I grouped the result by both. Note that this creates split entries for a small number of pages (all non-mainspace as far as I can see) which due to shortcomings of the data collection are sometimes recorded without a valid page ID in the pageview data; most notably the main page.

Data source:

SELECT page_title, SUM(view_count) AS views,
page_id, 
CONCAT('https://fi.wikipedia.org/wiki/',page_title) AS desktop_url
FROM wmf.pageview_hourly
WHERE
   year = 2017
   AND project = 'fi.wikipedia'
   AND agent_type = 'user'
GROUP BY page_title, page_id
ORDER BY views DESC LIMIT 5000;

Here is the result for the second request. The exact timespan is May 1, 2015 to March 31, 2018.
I have no idea how @Nikerabbit's user page became the fifth most popular page during that time.

Data source:

SELECT page_title, SUM(view_count) AS views,
page_id, 
CONCAT('https://fi.wikipedia.org/wiki/',page_title) AS desktop_url
FROM wmf.pageview_hourly
WHERE
   ((year < 2018 OR month <= 3) AND (year > 2015 OR month >=5))
   AND project = 'fi.wikipedia'
   AND agent_type = 'user'
GROUP BY page_title, page_id
ORDER BY views DESC LIMIT 5000;

Here is the result for the third request (covering January 1 to April 20, 2018).
For all three of the above, note that most of the views listed for "-" are artefacts rather than actual reads of https://fi.wikipedia.org/wiki/- or https://fi.wikipedia.org/wiki/Viivamerkit . (The software uses that as a fallback when the page name can't be parsed correctly for some reason; IIRC this is noted in the documentation somewhere already, but I can't find it right now.)

Data source:

SELECT page_title, SUM(view_count) AS views,
page_id, 
CONCAT('https://fi.wikipedia.org/wiki/',page_title) AS desktop_url
FROM wmf.pageview_hourly
WHERE
   year = 2018 AND (month <= 3 OR day <= 20)
   AND project = 'fi.wikipedia'
   AND agent_type = 'user'
GROUP BY page_title, page_id
ORDER BY views DESC LIMIT 5000;
Tbayer claimed this task.

...

For #4 it is ok for not doing it, but I would like to know insights into the problems with the query. Eg i may need to explain why it is hard to do it.

It's just that we don't have pageview data from before mid-2015 prepared in a unified database table or other form that would allow easy querying as above. It's totally possible to download the relevant dataset(s) for those earlier years yourself (see e.g. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pagecounts-ez ) and process them; quite a few people have done so. But it would require more time than I as WMF data analyst can justify spending on this task currently.

Thank you very much, the data was very useful and I was able to read what needed from monthly files of Pagecounts-ez.

The biggest problem for me was that the hourly data files were too large for downloading or for parsing at toollabs. Monthly files solved this. Thanks for the tip.