Make domas' pageviews data available in semi-publicly queryable database format
OpenPublic

Assigned To
Milimetric
Priority
Normal
Author
Nemo_bis
Blocks
T56184: Pageviews for Wikiprojects and Task Forces in Languages other than English
Subscribers
Ijon, JAllemandou, Wittylama and 23 others
Projects
Tokens
"Like" token, awarded by Ijon."Like" token, awarded by Nemo_bis.
Security
None
Reference
bz42259
Description

This doesn't seem to be tracked yet.
It's been discussed countless times in the past few years: for all sorts of GLAM initiatives and any other initiative to improve content on the projects, we currently rely on Henrik's stats.grok.se data in JSON format, e.g. https://toolserver.org/~emw/index.php?c=wikistats , http://toolserver.org/~magnus/glamorous.php etc.
The data on domas' logs should be available for easy querying on the Toolserver databases and elsewhere, but previous attempts to create such a DB lead nowhere as far as I know.

I suppose this is already one of the highest priorities in the analytics team plans for the new infrastructure, but I wasn't able to confirm it by reading the public documents and it needs to be done anyway sooner or later.

(Not in "Usage statistics" aka "Statistics" component because that's only about raw pageviews data.)


Version: wmf-deployment
Severity: enhancement
Discussions (partial list):

bzimport set Reference to bz42259.
Nemo_bis created this task.Via LegacyNov 19 2012, 11:18 AM
drdee added a comment.Via ConduitNov 19 2012, 5:41 PM

This is totally on our roadmap, and the Analytics Team is working on this as part of Kraken.

bzimport added a comment.Via ConduitNov 20 2012, 2:23 AM

emw.wiki wrote:

Diederik, does the Analytics Team plan to make hourly data queryable? I think being able to see how hourly viewing patterns change over long time periods would be pretty valuable.

drdee added a comment.Via ConduitNov 20 2012, 2:27 AM

YES! we totally are planning on doing that.

bzimport added a comment.Via ConduitFeb 11 2013, 3:05 PM

pf2k-wlkn wrote:

Henrik's Pageviews tool linked from the History tab on English Wikipedia seems buggy or broken, as mentioned [[User talk:Henrik#What article rank means exactly|here]]. I think that it would be trivial to fix it or replace it, as I mention [[User talk:West.andrew.g/Popular_pages/Archive 1#Possible WMF labs support for your good work|here]]. There already is [http://toolserver.org/~johang/wikitrends/english-most-visited-this-week.html this], but it's only for the top ten, and it's only linked to from English (and Japanese, for the Japanese version) Wikipedias (though it'd take a lot of looking to find it even there). I'd guess that maybe 10% of the articles get 90% of the traffic. If this is the case, it would be useful to have a list of the top 10% (in the past month or the past year) so as to determine which articles are most popular but badly need improvement (improving much-viewed pages has more effect on the perceived quality of Wikipedia than improving seldom-viewed pages). Such a list, done only once a month -- or even only once a year -- would be extremely useful.

bzimport added a comment.Via ConduitFeb 12 2013, 1:59 AM

pf2k-wlkn wrote:

Some WikiProjects are compiling popularity data and using it to improve popular articles, see [[Wikipedia:WikiProject Medicine/Popular_pages]]. But popularity data really needs to be readily available to other projects and other (foreign-) language Wikipedias. Already one person has done a [http://toolserver.org/~johang/2012.html top 100 for 2012] (including other-language Wikipedias) but ideally this would be extended to the top 5000 or top 10% -- and also linked to from the other foreign-language Wikipedias, as few people seem to know about it.
Hourly data is surely only of commercial interest -- it would help people know which hours and days are best for paid advertising in search engines. [[User:LittleBenW]]

MZMcBride added a comment.Via ConduitFeb 12 2013, 2:15 AM

(In reply to comment #3)

YES! we totally are planning on doing that.

Is there a status update (or page on mediawiki.org) tracking this feature request?

drdee added a comment.Via ConduitMar 21 2013, 3:10 PM

See https://mingle.corp.wikimedia.org/projects/analytics/cards/113 for progress. Would love your input regarding In Scope / Out of Scope and User stories. Just add them to this thread in Bugzilla and I will add them to the mingle card.

scfc added a comment.Via ConduitMay 28 2013, 8:03 PM

(In reply to comment #7)

See https://mingle.corp.wikimedia.org/projects/analytics/cards/113 for
progress. Would love your input regarding In Scope / Out of Scope and User
stories. Just add them to this thread in Bugzilla and I will add them to the
mingle card.

In http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/67248 you planned a sprint at the Amsterdam hackathon. Was it successful?

drdee added a comment.Via ConduitJun 3 2013, 10:55 AM

I've got a first draft of the puppet manifest, it needs some more work.
@Nemo: I don't have access to the private conversations on the cultural wikimedia mailinglists. Can we have these discussions on wikimedia-analytics mailinglist?

Nemo_bis added a comment.Via ConduitJun 3 2013, 11:11 AM

(In reply to comment #9)

@Nemo: I don't have access to the private conversations on the cultural
wikimedia mailinglists. Can we have these discussions on wikimedia-analytics
mailinglist?

You could ask access to see the archives. Most of those discussions happened before the analytics list existed, anyway I don't control where the topic is raised: this is a widespread need so it pops up everywhere, I just collect links.

Doc_James added a comment.Via ConduitSep 19 2013, 2:19 AM

We at Wikiproject Medicine are really interested to know how many medical articles there are and which are top viewed in other languages. https://meta.wikimedia.org/wiki/WikiProject_Med/Tech#Metrics_requests

Look forwards to seeing a tool that can do this. Please let me know if there is anything I can do to help. The other thing that may be needed is a bot to automatically tag articles in other languages either by "Wikiproject" or as categories.

Nemo_bis added a comment.Via ConduitOct 21 2013, 2:50 PM

For last updates see the last http://lists.wikimedia.org/pipermail/analytics/2013-October/thread.html#1062 "[Analytics] Back of the envelope data size" where requirements for this bug were discussed a bit.

Nemo_bis added a comment.Via ConduitNov 14 2013, 10:35 AM

According to corridor rumors :), Dario at Wikimania said that some sort of pageview data is going, at some point, to be integrated into [[mw:Wikimetrics]]. If true, where is this tracked/documented and is it a parallel effort or something depending on this bug?

Nemo_bis added a comment.Via ConduitJan 9 2014, 4:30 PM

Given the silence since October, I checked the project pages a bit. I can't find any real mention of pageviews under [[wikitech:Analytics]] and under [[mw:Analytics]] there are no actual mentions of them related to Kraken other than "examples of the sort of thing Kraken might store" at [[mw:Analytics/Kraken/Researcher analysis]].

As such, I believe that in the current state of things the move of this bug under "Analytics" and specifically "Kraken" was premature cookie-licking and I'm moving it to the generic component for this sort of issues so that it can be picked up by whatever person or project wishes so. The "analytics" keyword stays.

bzimport added a comment.Via ConduitJan 13 2014, 8:48 PM

brassratgirl wrote:

Just a +1 that the stats that you can get from http://stats.grok.se (pageviews for a particular article, presented in a pretty little graph) are VERY helpful for all of us who do outreach work, presentations, education, work with GLAMs, etc -- not to mention for simple curiousity :) I've love to see a tool that could do this for all languages.

bzimport added a comment.Via ConduitApr 5 2014, 3:43 PM

emw.wiki wrote:

Any updates on this?

My Wikipedia traffic visualization tool (https://toolserver.org/~emw/wikistats/) was among those listed as motivation for this ticket, but I recently decommissioned it. I see that a pageview API is on the Analytics team's Q2 2014 priorities list, but it's last: https://www.mediawiki.org/w/index.php?title=Analytics/Prioritization_Planning&oldid=850355.

My reasons for ceasing development and eventually maintenance of that tool are mostly unrelated to the lack of progress on this issue, but it was a notable factor. For now I'm pointing folks to http://tools.wmflabs.org/wikiviewstats/. However, I'm not aware of any tool other than mine that enables users to, in a single graph, visualize daily page views for up to 5 years, compare such data for multiple articles in a language / one article in multiple languages, or to view that data in table format and download it as a CSV.

bzimport added a comment.Via ConduitApr 5 2014, 7:00 PM

mrjohncummings wrote:

I'd like to mention Magnus Manske's blog post about this, it includes a reply by Toby Negrin, Head of Analytics at the Wikimedia Foundation http://magnusmanske.de/wordpress/?p=173

Nemo_bis added a comment.Via ConduitApr 12 2014, 12:31 PM

(In reply to mrjohncummings from comment #17)

I'd like to mention Magnus Manske's blog post about this, it includes a
reply by Toby Negrin, Head of Analytics at the Wikimedia Foundation
http://magnusmanske.de/wordpress/?p=173

I also asked a question at https://meta.wikimedia.org/wiki/Grants_talk:APG/Proposals/2013-2014_round2/Wikimedia_Foundation/Proposal_form#Multiplication_of_tools

Nemo_bis awarded a token.Via WebDec 12 2014, 8:20 AM
Ricordisamoa added a subscriber: Ricordisamoa.Via WebDec 18 2014, 4:41 PM
MZMcBride added a comment.Via WebFeb 11 2015, 5:26 AM

I'm told there's now an internal Hive cluster populated with page view data that's being regularly used/queried by the Wikimedia Foundation and select researchers. The remaining piece for this task is then to expose this Hive cluster to the outside world.

Nemo_bis added a comment.Via WebFeb 11 2015, 8:35 AM

"wmf.webrequest contains the 'refined' webrequest data. This table is currently considered experimental." https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive

Milimetric added a comment.Via WebFeb 20 2015, 2:21 PM

For operational reasons, it's nearly impossible to expose a Hive cluster to the world. Someone could easily do a query "where YEAR > 0" and make the cluster look through Terrabytes of data. I agree that the existence of wmf.webrequest is a fantastic step forward, and it took a lot of work to get it stable and monitored the way it is today. The remaining work is roughly:

  • sanitize any remaining operational data in the logs (IPs etc)
  • aggregate to the page and hour level
  • find a simple but spacious database from where to serve the data. RESTBase seems like a good candidate, but it may be tough to support this probably large new usage of it.
Milimetric added a comment.Via WebFeb 20 2015, 2:22 PM

work on all three remaining items has been started, by the way.

Arjunaraoc added a subscriber: Arjunaraoc.Via WebMar 6 2015, 11:17 AM
jeremyb-phone added a subscriber: jeremyb.Via WebMon, May 18, 6:13 PM
jeremyb-phone set Security to None.
Nemo_bis edited the task description. (Show Details)Via WebMon, May 18, 7:47 PM
PKM added a subscriber: PKM.Via WebMon, May 18, 9:09 PM
Aubrey added a comment.Via WebWed, May 20, 5:47 PM

Hello. Is there any kind of update on this? I think this is one of the most wanted feature/tool of the whole Wikimedia world, it would be nice to understand if we are any closer to a real, usable, non-enwikipedia centric tool :-) Thanks to whoever is working on this. We wikimedians are nitpicking, but we love you.

Milimetric added a comment.Via WebThu, May 21, 4:31 PM

I'd love to start a more open discussion about our progress on this. Here's the recent history and where we are:

  • February 2015: with data flowing into the Hadoop cluster, we defined which raw webrequests were "page views". The research is here and the code is here
  • March 2015: we used this page view definition to create a raw pageview table in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So we don't have the resources to expose it publicly
  • April 2015: we used this data internally to query but it overloaded our cluster and queries were slow
  • May 2015: we're working on an intermediate aggregation that would total up page counts by hour over the dimensions that we think most people care about. We estimate this will cut down size by a factor of 50

Progress has been slow mostly because Event Logging is our main priority and it's been having serious scaling issues. We think we have a good handle on the Event Logging issues after our latest patch, and in a week or so we're going to mostly focus on the Pageview API.

Once this new intermediate aggregation is done, we'll hopefully free up some cluster resources and be in a better position to load up a public API. Right now, we are evaluating two possible data pipelines:

Pipeline 1:

  • Put daily aggregates into PostgreSQL. We think per article hourly data would be too big for PostgreSQL.

Pipeline 2:

  • Query data from the Hive tables directly with Impala. Impala is good for medium to small data, but is much faster than Hive. We might be able to query the hourly data if we use this method.

Common Pipeline after we make the choice above:

  • Mondrian builds OLAP cubes and handles caching which is very useful with this much data
  • point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. This will be a reliable public API that people can build tools around
  • point Saiku to Mondrian and make a new public website for exploratory analytics. Saiku is an open source OLAP cube visualization and analysis tool

Hope that helps. As we get closer to making this API real, we would love your input, participation, questions, etc.

ezachte added a subscriber: ezachte.Via WebThu, May 21, 5:50 PM

Awesome to see this progress!

Will there be normalization of titles? In Domas' pagecount files any title appears as typed (or changed by browser), and some of those variations are not significant but merely trivial encoding differences (thus making collecting overall access counts harder, but also bloating file size). For media file request counts Christian Aistleitner wrote udf to get rid of these meaningless variations, by first decoding the string, then encoding in standard format (which includes %44 for comma to ease use of output in csv files). See https://gerrit.wikimedia.org/r/#/c/169346/1/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/IdentifyMediaFileUrl.java

Will redirects be resolved? I have no idea how complex, and resource intensive, that would be in hive environment, but it would mimic stats.grok.se better, where this also happens.

Wittylama added a subscriber: Wittylama.Via WebThu, May 21, 7:29 PM
Milimetric added a subscriber: JAllemandou.Via WebFri, May 22, 1:53 PM

Will there be normalization of titles? In Domas' pagecount files any title appears as typed (or changed by browser), and some of those variations are not significant but merely trivial encoding differences (thus making collecting overall access counts harder, but also bloating file size). For media file request counts Christian Aistleitner wrote udf to get rid of these meaningless variations, by first decoding the string, then encoding in standard format (which includes %44 for comma to ease use of output in csv files). See https://gerrit.wikimedia.org/r/#/c/169346/1/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/IdentifyMediaFileUrl.java

Thanks for this pointer, I didn't know Christian had done that (he's so awesome :)). One of the things @JAllemandou is currently working on is parsing out the article title from uri_path and uri_query. I'm sure he'll find Christian's work useful and I agree that we should normalize the titles as part of the same parse.

Will redirects be resolved? I have no idea how complex, and resource intensive, that would be in hive environment, but it would mimic stats.grok.se better, where this also happens.

We are exposing the HTTP status code in our intermediate aggregation but we haven't decided yet on how to handle the data further towards the API end of the pipeline. If article A is a redirect to article B, I would think it'd be useful to count hits to A, as well as include hits to B regardless of how people got there. This would mean that adding up hits for article A and B would equal a sum greater than 100% of the hits, but maybe we should make it easy to exclude redirects from such aggregations.

Ijon awarded a token.Via WebSat, May 23, 2:28 AM
Ijon added a subscriber: Ijon.
Ijon added a comment.Via WebSat, May 23, 2:34 AM

Thanks for the detailed update, @Milimetric! Very encouraging. I look forward to being able to build a tool around the RESTBase API endpoint.

Add Comment