Page MenuHomePhabricator

Make top pages for WP:MED articles
Closed, ResolvedPublic5 Estimated Story Points

Description

It would be great to have list of page views for members of WP:MED articles (~ 35K articles). I've put up this query that actually is a modification of Special:WhatLinksHere query but it's fast and efficient.

Event Timeline

After deploying Extension:PageAssessments things will be even easier. Nice!

This is a great start. Here's the short term plan and the longer term plan.

Short
Join the results of Amir's query to pageview_hourly for a month, and write the results to another table. Publish top(1000), and total views for the whole WikiProject medicine on datasets.wikimedia.org.

Long
Add WikiProject:Medicine as a project modifier to en.wikipedia.org queries on the pageview API. Need to think this through and how it would work on projects like commons and wikidata, but basically queries could ask for /en.wikipedia.org|WikiProject:Medicine/ and get top(1000) and per-project totals. We can get data for other languages by using Wikidata inter-language links. We can assume enwiki is authoritative for this purpose, even though it's not strictly true and we'll miss some articles with this assumption. It's better than nothing and we can improve the assumption when we have better data.

Milimetric triaged this task as Medium priority.Jul 7 2016, 4:57 PM
Milimetric moved this task from Incoming to Operational Excellence Future on the Analytics board.

This task will be an ad hoc query to get these numbers.

There is a longer task of "adding top counts for wiki projects to pageview API" : https://phabricator.wikimedia.org/T141010

Nuria set the point value for this task to 5.Jul 21 2016, 5:16 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.

Look forwards to the outcome. Andrew West does the top 5000 but we do not have a total for the entire project. Would also like similar totals for other languages.

Which articles pertain to medicine in other languages can be found through wikidata language links.

Thanks Doc, yeah I'm looking to get some preliminary numbers by the end of the quarter and then to think about productionizing access to this kind of data in Q3 (starting January next year). We just have a long backlog of work. If others are interested in this in a volunteer capacity, I'm always happy to guide them.

@Ladsgroup, I'm getting numbers for August and July, unless you want something else.

Queries done. The data's in the milimetric.wikiproject_medicine_page_counts Hive table. I collected the steps I took in this gist: https://gist.github.com/milimetric/e77e22a736cef4c973a26667a3e94d8c

@Ladsgroup let's chat about how this works if you want. As I say at the bottom of that gist, here's a way to get top 100 pages and their views for August:

 select page_title,
        view_count
   from milimetric.wikiproject_medicine_page_counts
  where year=2016
    and month=8
  order by view_count desc
  limit 100
;

Just a teaser:

page_title view_count
Zika_virus 771956
Leonardo_da_Vinci 740377
MDMA 493059
Tuberculosis 427626
Sexual_intercourse 402888
Project_MKUltra 374635
Trypophobia 348085
Diazepam 329726
Diabetes_mellitus 327654
Narcissistic_personality_disorder 321081
Asperger_syndrome 316896
Malaria 315221
Meningitis 301670
Lyme_disease 280941

Looking good. Does it generate pageviews for entire projects in a given month yet? About 33K pages for WPMED.

James

Well, so this is a one-off query. With it, I ran numbers for July 2016 and August 2016, for all 33K pages in WPMED. That's what the milimetric.wikiproject_medicine_page_counts table contains right now. If more months are required, I have the steps to get that data in my scripts. For now, we won't automate this and expose it through the Pageview API, but that's on our backlog. We're thinking we can get to it early next year (January) or sooner if someone wants to volunteer to help.

Per "That's what the milimetric.wikiproject_medicine_page_counts table contains right now." were do I find this table? Or can you provide the totals here?

The totals for all articles by month:

July 2016: 179253171
August 2016: 190445556

The table is not accessible without an NDA, so I think only Amir has access to it. It's a Hive table, accessible from stat1002.

Hey Milimetric can we get data for Sept for WPMED? Not sure if it is avaliable yet. Also by NDA you mean "non disclosure agreement"?

@Doc_James : i think @Milimetric was going to do this as a 1 off query but the intent was not to repeat it every month until we have a process by which we can run these queries automatically. I might be wrong on this @Milimetric can correct me if that is the case

Hi @Doc_James, yes, I only intended to run this manually once. We have plans to prioritize the long-term version of this in Q3 (starting January 2017). Until then, we're busy with editing data infrastructure.

People who have signed the NDA, which you're right stands for non-disclosure agreement, can run the same query using the details I provide here. So basically any analyst at WMF or researchers that have that access. I'm happy to help any such efforts if they get stuck using my code.

Okay thanks. @Ladsgroup do you think you could run September for us?