Page MenuHomePhabricator

Request for data: sites traffic by topics/ subject areas and geographies
Closed, ResolvedPublic

Description

Hello there,

I am looking for the following information and really appreciate any help or leads to get hold of this:

  • what content (topics/subject areas) is most frequently accessed on WP?
  • by geographic region? and by languages wikipedias (probably looking at the top 10 only?)

Now, you might be asking, 'what do I need this for'... :)

The Strategic Partnerships team has been in discussions with large data and content-rich organizations (World Bank, OECD, National Geographic Society...) interested in supporting our open knowledge projects and communities.

We are discussing the design of a pilot program and hope to work with a small sample of selected content to help determine what the framework looks like in these types of institutional collaborations.

I would like to work with a sample of high trafficked pages in various areas so we can evaluate, iterate and validate more rapidly.

We are hoping to test with high traffic pages as well as within specific topics that have a specific Project like WikiProjectMedicine (high demand: hearth attack, flu virus…), or WikiProjectAgriculture (lifestock, crops, fisheries…) topics where there are a sufficient number of contributors/users to help us make these evaluations.

So knowing what topics/subject areas people/organizations are most interested will help us make a strong wish list.

Please ping me if you have questions, on IRC – although I am not always on – or via email anytime sventura@wikimedia.org.

Thank you!!

Sylvia

Event Timeline

SVentura raised the priority of this task from to Needs Triage.
SVentura updated the task description. (Show Details)
SVentura added a subscriber: SVentura.

@dr0ptp4kt can we touch base on this next week? I'd like to understand if we can help you guys but have the Reading team own this. cc @kevinator

Hi all, I realize re-reading my request above that is sounds like the Strategic Partnerships team is sole responsible for driving these new cool opportunities -- I want to clarify that we/Partnerships are a very small portion of this and that I have been relying heavily on the Engineering Community's support and expertise, without which these conversations would never be possible, and also on Wikidata's team who've been super supportive from the beginning. Credit should be given were credit is due. Thank you EC, Wikidata and everyone on here for your support in exploring these new areas.

Have a great weekend!
Sylvia

@DarTar, @SVentura, @Tbayer, @kevinator: I've set a meeting to discuss further.

@Tbayer (working on regular reports for Reading) and @kevinator I marked you as Optional, @kevinator, please let me know if I need to reschedule in case you need me to reschedule.

It might be useful to get a sense of what kind of information is already out there. To start with a few links to existing academic studies and community/WMF data sources:

https://meta.wikimedia.org/wiki/Research:Newsletter/2015/April#Popularity_does_not_breed_quality_.28and_vice_versa.29

https://ironholds.shinyapps.io/WhereInTheWorldIsWikipedia/

https://meta.wikimedia.org/wiki/Research:Newsletter/2014/May#Chinese-language_time-zones_favor_Asian_pop_and_IT_topics_on_Wikipedia (results probably aren't very relevant here, but their topic classification method might be worth a look)

https://meta.wikimedia.org/wiki/Research:Newsletter/2013/May#Science_eight_times_more_popular_on_the_Spanish_Wikipedia_than_on_the_English_Wikipedia.3F

Work by Andrew West and others:

http://www.jmir.org/2015/3/e62/ "Wikipedia and Medicine: Quantifying Readership, Editors, and the Significance of Natural Language" (example of what kind of exploration can be done by focusing on one subject area)

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Medicine/Popular_pages /
https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_medical_pages , (cf. T107831: Generalize useful pageview tools)

https://en.wikipedia.org/wiki/User:West.andrew.g/2014_Popular_pages and the weekly reports top x reports linked from there (also source of the Signpost's weekly traffic report) - IMHO these lists of most popular articles can be a bit of a red herring though, as they focus attention on a small set of articles that capture just a small portion of the overall traffic

@ezachte: we met today and discussed the scope of this ask and it looks like the work you've done for GLAM around media file stats and traffic by category would meet @SVentura's immediate needs. I recommended she gets in touch with you so you can give her an overview of available tools.

@ezachte, good to meet you here, would you have time for a quick call/google hangout tomorrow or Thursday? I can explain what we are looking for. Thanks!
@DarTar, thanks for connecting!

One more link - this Word Bank document shows what kind of stats they used previously:
"The Wikipedia-World Bank Pilot Project"; see the "Hits for selected articles (September 2010)" and "How many hits does the pilot project get?" sections

thanks for the link @Tbayer. this time around we are working with World Bank's data scientists (as opposed to Program Leads like in the 2006 pilot). They might be looking at slightly different data points. Will share updates as they come in.

@SVentura, awesome initiative, will contact you offline for call today

As for World Bank data, that one is close to my heart. Some Wikistats reports use demographics harvested from English Wikipedia (population per country, internet usage, ..), e.g. http://stats.wikimedia.org/archive/squid_reports/2014-06/SquidReportPageViewsPerCountryOverview.htm

But these Wikipedia pages tend to be updated infrequently, e.g. https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users

I would love to use World Bank data. They have a great API, but it just never happened yet. Now Wikidata seems a perfect repository to store up to date demographics. Many Wikipedia pages could potentially benefit from it.

DarTar renamed this task from request for data: sites traffic by topics/ subject areas and geographies to Request for data: sites traffic by topics/ subject areas and geographies.Aug 6 2015, 2:30 PM
DarTar assigned this task to ezachte.
DarTar triaged this task as Low priority.
DarTar moved this task from Staged to In Progress on the Research board.

I made a small change to the script which collects pageviews per category hierarchy, such that if the root category is some WikiProject the script will collect pageviews for actual articles, even when the WikiProject category tag has really been inserted on the talk page. So in 'WikiProject Agriculture', if the category tag is on page 'Talk:Apple', still views for 'Apple' are collected. The resulting reports are quite neat, no manual pruning of run-away subcategories seems needed.
See results at http://stats.wikimedia.org/wikimedia/pageviews/categorized/wp-en/2015-06/pageviews_wp-en_cat_WikiProject_Agriculture_2015-06.html, more at http://stats.wikimedia.org/wikimedia/pageviews/categorized/wp-en/2015-06/

@Qgil We have a short list of topics we can work with for the pilot project with the World Bank.
If there are additional requests on this pilot, I'll post them to the World Bank project page on Wiki Loves Open Data/WorldBank.

Does it sound good? Ok as far as process? Thanks!

Thank you @ezachte for the awesome work and responsiveness on this!!