Page MenuHomePhabricator

Determine how to gather top-viewed article lists for use in generating ZIM files
Closed, ResolvedPublic

Description

For our offline compilation prototyping we'd like to generate a set of ZIM files including one with the 5000 most-viewed articles for a project in a month, and another with the 50,000 most-viewed articles.

It looks like the AQS Pageview API only gives us the top 1000 for this kind of query:

https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2017/07/all-days

What's the best way to get this information for greater numbers of articles? Can the API be updated to expose parameters for larger lists? (Is the data there in the backing tables?)

https://dumps.wikimedia.org/other/pagecounts-ez/merged/ (see the -totals files) looks like it has the pageview counts we need if we have to go the DIY route.

Event Timeline

Mholloway closed this task as Resolved.EditedAug 3 2017, 12:03 AM
Mholloway claimed this task.

pagecounts-ez should work fine. e.g., using pageviews for June 2017,

To generate a deduplicated list of pageviews by title for enwiki, reverse-sorted by pageview count (for reference and sanity-checking):

grep -E '^en\.[mz] |^en.zero ' pagecounts-2017-06-views-ge-5-totals | awk '{arr[$2]+=$3} END {for (i in arr) {print i,arr[i]}}' | sort -t " " -nrk2 > pagecounts-2017-06-views-ge-5-totals-enwiki

And to get the first n titles from there, to feed to mwoffliner or a similar custom tool:

head -5000 pagecounts-2017-06-views-ge-5-totals-enwiki | cut -d " " -f1 > pagecounts-2017-06-views-ge-5-totals-enwiki-top-5000

Edit: we'll likely want to filter out Special pages (and "-") as well.