Make domas' pageviews data available in semi-publicly queryable database format
OpenPublic

Description

This doesn't seem to be tracked yet.
It's been discussed countless times in the past few years: for all sorts of GLAM initiatives and any other initiative to improve content on the projects, we currently rely on Henrik's stats.grok.se data in JSON format, e.g. https://toolserver.org/~emw/index.php?c=wikistats , http://toolserver.org/~magnus/glamorous.php etc.
The data on domas' logs should be available for easy querying on the Toolserver databases and elsewhere, but previous attempts to create such a DB lead nowhere as far as I know.

I suppose this is already one of the highest priorities in the analytics team plans for the new infrastructure, but I wasn't able to confirm it by reading the public documents and it needs to be done anyway sooner or later.

(Not in "Usage statistics" aka "Statistics" component because that's only about raw pageviews data.)


Version: wmf-deployment
Severity: enhancement
Whiteboard: http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/43657 https://intern.wikimedia.ch/lists/private/cultural-partners/2010-November/000281.html https://intern.wikimedia.ch/lists/private/cultural-partners/2011-July/001476.html https://intern.wikimedia.ch/lists/private/cultural-partners/2011-December/002477.html https://intern.wikimedia.ch/lists/private/cultural-partners/2012-December/005100.html http://lists.wikimedia.org/pipermail/analytics/2013-January/000351.html http://lists.wikimedia.org/pipermail/analytics/2013-May/000618.html https://intern.wikimedia.ch/lists/private/cultural-partners/2013-June/006163.html http://lists.wikimedia.org/pipermail/wikitech-l/2013-June/069692.html http://lists.wikimedia.org/pipermail/wikitech-l/2013-September/071714.html http://lists.wikimedia.org/pipermail/wikimedia-l/2013-September/128060.html http://lists.wikimedia.org/pipermail/analytics/2013-October/001062.html https://intern.wikimedia.ch/lists/private/cultural-partners/2013-October/006753.html http://magnusmanske.de/wordpress/?p=173 https://en.wikipedia.org/w/index.php?title=User_talk:Henrik&diff=600917917&oldid=600897425 http://comments.gmane.org/gmane.org.wikimedia.analytics/142 https://en.wikipedia.org/?curid=43853841
URL: http://lists.wikimedia.org/pipermail/analytics/2012-December/000266.html

bzimport set Reference to bz42259.
Nemo_bis created this task.Via LegacyNov 19 2012, 11:18 AM
drdee added a comment.Via ConduitNov 19 2012, 5:41 PM

This is totally on our roadmap, and the Analytics Team is working on this as part of Kraken.

bzimport added a comment.Via ConduitNov 20 2012, 2:23 AM

emw.wiki wrote:

Diederik, does the Analytics Team plan to make hourly data queryable? I think being able to see how hourly viewing patterns change over long time periods would be pretty valuable.

drdee added a comment.Via ConduitNov 20 2012, 2:27 AM

YES! we totally are planning on doing that.

bzimport added a comment.Via ConduitFeb 11 2013, 3:05 PM

pf2k-wlkn wrote:

Henrik's Pageviews tool linked from the History tab on English Wikipedia seems buggy or broken, as mentioned [[User talk:Henrik#What article rank means exactly|here]]. I think that it would be trivial to fix it or replace it, as I mention [[User talk:West.andrew.g/Popular_pages/Archive 1#Possible WMF labs support for your good work|here]]. There already is [http://toolserver.org/~johang/wikitrends/english-most-visited-this-week.html this], but it's only for the top ten, and it's only linked to from English (and Japanese, for the Japanese version) Wikipedias (though it'd take a lot of looking to find it even there). I'd guess that maybe 10% of the articles get 90% of the traffic. If this is the case, it would be useful to have a list of the top 10% (in the past month or the past year) so as to determine which articles are most popular but badly need improvement (improving much-viewed pages has more effect on the perceived quality of Wikipedia than improving seldom-viewed pages). Such a list, done only once a month -- or even only once a year -- would be extremely useful.

bzimport added a comment.Via ConduitFeb 12 2013, 1:59 AM

pf2k-wlkn wrote:

Some WikiProjects are compiling popularity data and using it to improve popular articles, see [[Wikipedia:WikiProject Medicine/Popular_pages]]. But popularity data really needs to be readily available to other projects and other (foreign-) language Wikipedias. Already one person has done a [http://toolserver.org/~johang/2012.html top 100 for 2012] (including other-language Wikipedias) but ideally this would be extended to the top 5000 or top 10% -- and also linked to from the other foreign-language Wikipedias, as few people seem to know about it.
Hourly data is surely only of commercial interest -- it would help people know which hours and days are best for paid advertising in search engines. [[User:LittleBenW]]

MZMcBride added a comment.Via ConduitFeb 12 2013, 2:15 AM

(In reply to comment #3)

YES! we totally are planning on doing that.

Is there a status update (or page on mediawiki.org) tracking this feature request?

drdee added a comment.Via ConduitMar 21 2013, 3:10 PM

See https://mingle.corp.wikimedia.org/projects/analytics/cards/113 for progress. Would love your input regarding In Scope / Out of Scope and User stories. Just add them to this thread in Bugzilla and I will add them to the mingle card.

scfc added a comment.Via ConduitMay 28 2013, 8:03 PM

(In reply to comment #7)

See https://mingle.corp.wikimedia.org/projects/analytics/cards/113 for
progress. Would love your input regarding In Scope / Out of Scope and User
stories. Just add them to this thread in Bugzilla and I will add them to the
mingle card.

In http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/67248 you planned a sprint at the Amsterdam hackathon. Was it successful?

drdee added a comment.Via ConduitJun 3 2013, 10:55 AM

I've got a first draft of the puppet manifest, it needs some more work.
@Nemo: I don't have access to the private conversations on the cultural wikimedia mailinglists. Can we have these discussions on wikimedia-analytics mailinglist?

Nemo_bis added a comment.Via ConduitJun 3 2013, 11:11 AM

(In reply to comment #9)

@Nemo: I don't have access to the private conversations on the cultural
wikimedia mailinglists. Can we have these discussions on wikimedia-analytics
mailinglist?

You could ask access to see the archives. Most of those discussions happened before the analytics list existed, anyway I don't control where the topic is raised: this is a widespread need so it pops up everywhere, I just collect links.

Doc_James added a comment.Via ConduitSep 19 2013, 2:19 AM

We at Wikiproject Medicine are really interested to know how many medical articles there are and which are top viewed in other languages. https://meta.wikimedia.org/wiki/WikiProject_Med/Tech#Metrics_requests

Look forwards to seeing a tool that can do this. Please let me know if there is anything I can do to help. The other thing that may be needed is a bot to automatically tag articles in other languages either by "Wikiproject" or as categories.

Nemo_bis added a comment.Via ConduitOct 21 2013, 2:50 PM

For last updates see the last http://lists.wikimedia.org/pipermail/analytics/2013-October/thread.html#1062 "[Analytics] Back of the envelope data size" where requirements for this bug were discussed a bit.

Nemo_bis added a comment.Via ConduitNov 14 2013, 10:35 AM

According to corridor rumors :), Dario at Wikimania said that some sort of pageview data is going, at some point, to be integrated into [[mw:Wikimetrics]]. If true, where is this tracked/documented and is it a parallel effort or something depending on this bug?

Nemo_bis added a comment.Via ConduitJan 9 2014, 4:30 PM

Given the silence since October, I checked the project pages a bit. I can't find any real mention of pageviews under [[wikitech:Analytics]] and under [[mw:Analytics]] there are no actual mentions of them related to Kraken other than "examples of the sort of thing Kraken might store" at [[mw:Analytics/Kraken/Researcher analysis]].

As such, I believe that in the current state of things the move of this bug under "Analytics" and specifically "Kraken" was premature cookie-licking and I'm moving it to the generic component for this sort of issues so that it can be picked up by whatever person or project wishes so. The "analytics" keyword stays.

bzimport added a comment.Via ConduitJan 13 2014, 8:48 PM

brassratgirl wrote:

Just a +1 that the stats that you can get from http://stats.grok.se (pageviews for a particular article, presented in a pretty little graph) are VERY helpful for all of us who do outreach work, presentations, education, work with GLAMs, etc -- not to mention for simple curiousity :) I've love to see a tool that could do this for all languages.

bzimport added a comment.Via ConduitApr 5 2014, 3:43 PM

emw.wiki wrote:

Any updates on this?

My Wikipedia traffic visualization tool (https://toolserver.org/~emw/wikistats/) was among those listed as motivation for this ticket, but I recently decommissioned it. I see that a pageview API is on the Analytics team's Q2 2014 priorities list, but it's last: https://www.mediawiki.org/w/index.php?title=Analytics/Prioritization_Planning&oldid=850355.

My reasons for ceasing development and eventually maintenance of that tool are mostly unrelated to the lack of progress on this issue, but it was a notable factor. For now I'm pointing folks to http://tools.wmflabs.org/wikiviewstats/. However, I'm not aware of any tool other than mine that enables users to, in a single graph, visualize daily page views for up to 5 years, compare such data for multiple articles in a language / one article in multiple languages, or to view that data in table format and download it as a CSV.

bzimport added a comment.Via ConduitApr 5 2014, 7:00 PM

mrjohncummings wrote:

I'd like to mention Magnus Manske's blog post about this, it includes a reply by Toby Negrin, Head of Analytics at the Wikimedia Foundation http://magnusmanske.de/wordpress/?p=173

Nemo_bis added a comment.Via ConduitApr 12 2014, 12:31 PM

(In reply to mrjohncummings from comment #17)

I'd like to mention Magnus Manske's blog post about this, it includes a
reply by Toby Negrin, Head of Analytics at the Wikimedia Foundation
http://magnusmanske.de/wordpress/?p=173

I also asked a question at https://meta.wikimedia.org/wiki/Grants_talk:APG/Proposals/2013-2014_round2/Wikimedia_Foundation/Proposal_form#Multiplication_of_tools

Nemo_bis awarded a token.Via WebDec 12 2014, 8:20 AM
Ricordisamoa added a subscriber: Ricordisamoa.Via WebDec 18 2014, 4:41 PM
MZMcBride added a comment.Via WebFeb 11 2015, 5:26 AM

I'm told there's now an internal Hive cluster populated with page view data that's being regularly used/queried by the Wikimedia Foundation and select researchers. The remaining piece for this task is then to expose this Hive cluster to the outside world.

Nemo_bis added a comment.Via WebFeb 11 2015, 8:35 AM

"wmf.webrequest contains the 'refined' webrequest data. This table is currently considered experimental." https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive

Milimetric added a comment.Via WebFeb 20 2015, 2:21 PM

For operational reasons, it's nearly impossible to expose a Hive cluster to the world. Someone could easily do a query "where YEAR > 0" and make the cluster look through Terrabytes of data. I agree that the existence of wmf.webrequest is a fantastic step forward, and it took a lot of work to get it stable and monitored the way it is today. The remaining work is roughly:

  • sanitize any remaining operational data in the logs (IPs etc)
  • aggregate to the page and hour level
  • find a simple but spacious database from where to serve the data. RESTBase seems like a good candidate, but it may be tough to support this probably large new usage of it.
Milimetric added a comment.Via WebFeb 20 2015, 2:22 PM

work on all three remaining items has been started, by the way.

Arjunaraoc added a subscriber: Arjunaraoc.Via WebFri, Mar 6, 11:17 AM

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.