Make domas' pageviews data available in semi-publicly queryable database format
OpenPublic

Assigned To
Milimetric
Priority
Normal
Author
Nemo_bis
Blocks
T56184: Pageviews for Wikiprojects and Task Forces in Languages other than English
Subscribers
Sadads, Ocaasi, Tgr and 29 others
Projects
Tokens
"Like" token, awarded by Ijon."Like" token, awarded by Nemo_bis.
Security
None
Reference
bz42259
Description

This doesn't seem to be tracked yet.
It's been discussed countless times in the past few years: for all sorts of GLAM initiatives and any other initiative to improve content on the projects, we currently rely on Henrik's stats.grok.se data in JSON format, e.g. https://toolserver.org/~emw/index.php?c=wikistats , http://toolserver.org/~magnus/glamorous.php etc.
The data on domas' logs should be available for easy querying on the Toolserver databases and elsewhere, but previous attempts to create such a DB lead nowhere as far as I know.

I suppose this is already one of the highest priorities in the analytics team plans for the new infrastructure, but I wasn't able to confirm it by reading the public documents and it needs to be done anyway sooner or later.

(Not in "Usage statistics" aka "Statistics" component because that's only about raw pageviews data.)


Version: wmf-deployment
Severity: enhancement
Discussions (partial list):

bzimport set Reference to bz42259.
Nemo_bis created this task.Via LegacyNov 19 2012, 11:18 AM
drdee added a comment.Via ConduitNov 19 2012, 5:41 PM

This is totally on our roadmap, and the Analytics Team is working on this as part of Kraken.

bzimport added a comment.Via ConduitNov 20 2012, 2:23 AM

emw.wiki wrote:

Diederik, does the Analytics Team plan to make hourly data queryable? I think being able to see how hourly viewing patterns change over long time periods would be pretty valuable.

drdee added a comment.Via ConduitNov 20 2012, 2:27 AM

YES! we totally are planning on doing that.

bzimport added a comment.Via ConduitFeb 11 2013, 3:05 PM

pf2k-wlkn wrote:

Henrik's Pageviews tool linked from the History tab on English Wikipedia seems buggy or broken, as mentioned [[User talk:Henrik#What article rank means exactly|here]]. I think that it would be trivial to fix it or replace it, as I mention [[User talk:West.andrew.g/Popular_pages/Archive 1#Possible WMF labs support for your good work|here]]. There already is [http://toolserver.org/~johang/wikitrends/english-most-visited-this-week.html this], but it's only for the top ten, and it's only linked to from English (and Japanese, for the Japanese version) Wikipedias (though it'd take a lot of looking to find it even there). I'd guess that maybe 10% of the articles get 90% of the traffic. If this is the case, it would be useful to have a list of the top 10% (in the past month or the past year) so as to determine which articles are most popular but badly need improvement (improving much-viewed pages has more effect on the perceived quality of Wikipedia than improving seldom-viewed pages). Such a list, done only once a month -- or even only once a year -- would be extremely useful.

bzimport added a comment.Via ConduitFeb 12 2013, 1:59 AM

pf2k-wlkn wrote:

Some WikiProjects are compiling popularity data and using it to improve popular articles, see [[Wikipedia:WikiProject Medicine/Popular_pages]]. But popularity data really needs to be readily available to other projects and other (foreign-) language Wikipedias. Already one person has done a [http://toolserver.org/~johang/2012.html top 100 for 2012] (including other-language Wikipedias) but ideally this would be extended to the top 5000 or top 10% -- and also linked to from the other foreign-language Wikipedias, as few people seem to know about it.
Hourly data is surely only of commercial interest -- it would help people know which hours and days are best for paid advertising in search engines. [[User:LittleBenW]]

MZMcBride added a comment.Via ConduitFeb 12 2013, 2:15 AM

(In reply to comment #3)

YES! we totally are planning on doing that.

Is there a status update (or page on mediawiki.org) tracking this feature request?

drdee added a comment.Via ConduitMar 21 2013, 3:10 PM

See https://mingle.corp.wikimedia.org/projects/analytics/cards/113 for progress. Would love your input regarding In Scope / Out of Scope and User stories. Just add them to this thread in Bugzilla and I will add them to the mingle card.

scfc added a comment.Via ConduitMay 28 2013, 8:03 PM

(In reply to comment #7)

See https://mingle.corp.wikimedia.org/projects/analytics/cards/113 for
progress. Would love your input regarding In Scope / Out of Scope and User
stories. Just add them to this thread in Bugzilla and I will add them to the
mingle card.

In http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/67248 you planned a sprint at the Amsterdam hackathon. Was it successful?

drdee added a comment.Via ConduitJun 3 2013, 10:55 AM

I've got a first draft of the puppet manifest, it needs some more work.
@Nemo: I don't have access to the private conversations on the cultural wikimedia mailinglists. Can we have these discussions on wikimedia-analytics mailinglist?

Nemo_bis added a comment.Via ConduitJun 3 2013, 11:11 AM

(In reply to comment #9)

@Nemo: I don't have access to the private conversations on the cultural
wikimedia mailinglists. Can we have these discussions on wikimedia-analytics
mailinglist?

You could ask access to see the archives. Most of those discussions happened before the analytics list existed, anyway I don't control where the topic is raised: this is a widespread need so it pops up everywhere, I just collect links.

Doc_James added a comment.Via ConduitSep 19 2013, 2:19 AM

We at Wikiproject Medicine are really interested to know how many medical articles there are and which are top viewed in other languages. https://meta.wikimedia.org/wiki/WikiProject_Med/Tech#Metrics_requests

Look forwards to seeing a tool that can do this. Please let me know if there is anything I can do to help. The other thing that may be needed is a bot to automatically tag articles in other languages either by "Wikiproject" or as categories.

Nemo_bis added a comment.Via ConduitOct 21 2013, 2:50 PM

For last updates see the last http://lists.wikimedia.org/pipermail/analytics/2013-October/thread.html#1062 "[Analytics] Back of the envelope data size" where requirements for this bug were discussed a bit.

Nemo_bis added a comment.Via ConduitNov 14 2013, 10:35 AM

According to corridor rumors :), Dario at Wikimania said that some sort of pageview data is going, at some point, to be integrated into [[mw:Wikimetrics]]. If true, where is this tracked/documented and is it a parallel effort or something depending on this bug?

Nemo_bis added a comment.Via ConduitJan 9 2014, 4:30 PM

Given the silence since October, I checked the project pages a bit. I can't find any real mention of pageviews under [[wikitech:Analytics]] and under [[mw:Analytics]] there are no actual mentions of them related to Kraken other than "examples of the sort of thing Kraken might store" at [[mw:Analytics/Kraken/Researcher analysis]].

As such, I believe that in the current state of things the move of this bug under "Analytics" and specifically "Kraken" was premature cookie-licking and I'm moving it to the generic component for this sort of issues so that it can be picked up by whatever person or project wishes so. The "analytics" keyword stays.

bzimport added a comment.Via ConduitJan 13 2014, 8:48 PM

brassratgirl wrote:

Just a +1 that the stats that you can get from http://stats.grok.se (pageviews for a particular article, presented in a pretty little graph) are VERY helpful for all of us who do outreach work, presentations, education, work with GLAMs, etc -- not to mention for simple curiousity :) I've love to see a tool that could do this for all languages.

bzimport added a comment.Via ConduitApr 5 2014, 3:43 PM

emw.wiki wrote:

Any updates on this?

My Wikipedia traffic visualization tool (https://toolserver.org/~emw/wikistats/) was among those listed as motivation for this ticket, but I recently decommissioned it. I see that a pageview API is on the Analytics team's Q2 2014 priorities list, but it's last: https://www.mediawiki.org/w/index.php?title=Analytics/Prioritization_Planning&oldid=850355.

My reasons for ceasing development and eventually maintenance of that tool are mostly unrelated to the lack of progress on this issue, but it was a notable factor. For now I'm pointing folks to http://tools.wmflabs.org/wikiviewstats/. However, I'm not aware of any tool other than mine that enables users to, in a single graph, visualize daily page views for up to 5 years, compare such data for multiple articles in a language / one article in multiple languages, or to view that data in table format and download it as a CSV.

bzimport added a comment.Via ConduitApr 5 2014, 7:00 PM

mrjohncummings wrote:

I'd like to mention Magnus Manske's blog post about this, it includes a reply by Toby Negrin, Head of Analytics at the Wikimedia Foundation http://magnusmanske.de/wordpress/?p=173

Nemo_bis added a comment.Via ConduitApr 12 2014, 12:31 PM

(In reply to mrjohncummings from comment #17)

I'd like to mention Magnus Manske's blog post about this, it includes a
reply by Toby Negrin, Head of Analytics at the Wikimedia Foundation
http://magnusmanske.de/wordpress/?p=173

I also asked a question at https://meta.wikimedia.org/wiki/Grants_talk:APG/Proposals/2013-2014_round2/Wikimedia_Foundation/Proposal_form#Multiplication_of_tools

Nemo_bis awarded a token.Via WebDec 12 2014, 8:20 AM
Ricordisamoa added a subscriber: Ricordisamoa.Via WebDec 18 2014, 4:41 PM
MZMcBride added a comment.Via WebFeb 11 2015, 5:26 AM

I'm told there's now an internal Hive cluster populated with page view data that's being regularly used/queried by the Wikimedia Foundation and select researchers. The remaining piece for this task is then to expose this Hive cluster to the outside world.

Nemo_bis added a comment.Via WebFeb 11 2015, 8:35 AM

"wmf.webrequest contains the 'refined' webrequest data. This table is currently considered experimental." https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive

Milimetric added a comment.Via WebFeb 20 2015, 2:21 PM

For operational reasons, it's nearly impossible to expose a Hive cluster to the world. Someone could easily do a query "where YEAR > 0" and make the cluster look through Terrabytes of data. I agree that the existence of wmf.webrequest is a fantastic step forward, and it took a lot of work to get it stable and monitored the way it is today. The remaining work is roughly:

  • sanitize any remaining operational data in the logs (IPs etc)
  • aggregate to the page and hour level
  • find a simple but spacious database from where to serve the data. RESTBase seems like a good candidate, but it may be tough to support this probably large new usage of it.
Milimetric added a comment.Via WebFeb 20 2015, 2:22 PM

work on all three remaining items has been started, by the way.

Arjunaraoc added a subscriber: Arjunaraoc.Via WebMar 6 2015, 11:17 AM
jeremyb-phone added a subscriber: jeremyb.Via WebMay 18 2015, 6:13 PM
jeremyb-phone set Security to None.
Nemo_bis edited the task description. (Show Details)Via WebMay 18 2015, 7:47 PM
PKM added a subscriber: PKM.Via WebMay 18 2015, 9:09 PM
Aubrey added a comment.Via WebMay 20 2015, 5:47 PM

Hello. Is there any kind of update on this? I think this is one of the most wanted feature/tool of the whole Wikimedia world, it would be nice to understand if we are any closer to a real, usable, non-enwikipedia centric tool :-) Thanks to whoever is working on this. We wikimedians are nitpicking, but we love you.

Milimetric added a comment.Via WebMay 21 2015, 4:31 PM

I'd love to start a more open discussion about our progress on this. Here's the recent history and where we are:

  • February 2015: with data flowing into the Hadoop cluster, we defined which raw webrequests were "page views". The research is here and the code is here
  • March 2015: we used this page view definition to create a raw pageview table in Hadoop. This is queryable by Hive but it's about 3 TB per day of data. So we don't have the resources to expose it publicly
  • April 2015: we used this data internally to query but it overloaded our cluster and queries were slow
  • May 2015: we're working on an intermediate aggregation that would total up page counts by hour over the dimensions that we think most people care about. We estimate this will cut down size by a factor of 50

Progress has been slow mostly because Event Logging is our main priority and it's been having serious scaling issues. We think we have a good handle on the Event Logging issues after our latest patch, and in a week or so we're going to mostly focus on the Pageview API.

Once this new intermediate aggregation is done, we'll hopefully free up some cluster resources and be in a better position to load up a public API. Right now, we are evaluating two possible data pipelines:

Pipeline 1:

  • Put daily aggregates into PostgreSQL. We think per article hourly data would be too big for PostgreSQL.

Pipeline 2:

  • Query data from the Hive tables directly with Impala. Impala is good for medium to small data, but is much faster than Hive. We might be able to query the hourly data if we use this method.

Common Pipeline after we make the choice above:

  • Mondrian builds OLAP cubes and handles caching which is very useful with this much data
  • point RESTBase to Mondrian and expose API publicly at restbase.wikimedia.org. This will be a reliable public API that people can build tools around
  • point Saiku to Mondrian and make a new public website for exploratory analytics. Saiku is an open source OLAP cube visualization and analysis tool

Hope that helps. As we get closer to making this API real, we would love your input, participation, questions, etc.

ezachte added a subscriber: ezachte.Via WebMay 21 2015, 5:50 PM

Awesome to see this progress!

Will there be normalization of titles? In Domas' pagecount files any title appears as typed (or changed by browser), and some of those variations are not significant but merely trivial encoding differences (thus making collecting overall access counts harder, but also bloating file size). For media file request counts Christian Aistleitner wrote udf to get rid of these meaningless variations, by first decoding the string, then encoding in standard format (which includes %44 for comma to ease use of output in csv files). See https://gerrit.wikimedia.org/r/#/c/169346/1/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/IdentifyMediaFileUrl.java

Will redirects be resolved? I have no idea how complex, and resource intensive, that would be in hive environment, but it would mimic stats.grok.se better, where this also happens.

Wittylama added a subscriber: Wittylama.Via WebMay 21 2015, 7:29 PM
Milimetric added a subscriber: JAllemandou.Via WebMay 22 2015, 1:53 PM

Will there be normalization of titles? In Domas' pagecount files any title appears as typed (or changed by browser), and some of those variations are not significant but merely trivial encoding differences (thus making collecting overall access counts harder, but also bloating file size). For media file request counts Christian Aistleitner wrote udf to get rid of these meaningless variations, by first decoding the string, then encoding in standard format (which includes %44 for comma to ease use of output in csv files). See https://gerrit.wikimedia.org/r/#/c/169346/1/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/IdentifyMediaFileUrl.java

Thanks for this pointer, I didn't know Christian had done that (he's so awesome :)). One of the things @JAllemandou is currently working on is parsing out the article title from uri_path and uri_query. I'm sure he'll find Christian's work useful and I agree that we should normalize the titles as part of the same parse.

Will redirects be resolved? I have no idea how complex, and resource intensive, that would be in hive environment, but it would mimic stats.grok.se better, where this also happens.

We are exposing the HTTP status code in our intermediate aggregation but we haven't decided yet on how to handle the data further towards the API end of the pipeline. If article A is a redirect to article B, I would think it'd be useful to count hits to A, as well as include hits to B regardless of how people got there. This would mean that adding up hits for article A and B would equal a sum greater than 100% of the hits, but maybe we should make it easy to exclude redirects from such aggregations.

Ijon awarded a token.Via WebMay 23 2015, 2:28 AM
Ijon added a subscriber: Ijon.
Ijon added a comment.Via WebMay 23 2015, 2:34 AM

Thanks for the detailed update, @Milimetric! Very encouraging. I look forward to being able to build a tool around the RESTBase API endpoint.

MartinPoulter added a subscriber: MartinPoulter.Via WebMay 27 2015, 2:16 PM
ezachte added a comment.Via WebMay 27 2015, 4:20 PM

@Milimetric I am a bit confused about redirects, and since I asked about it:

I remember Christian told me the site does return a 3xx code after a redirect and expects the browser to reissue the request with new url, but that was about http redirects, not #REDIRECT redirects.

I tried https://en.wikipedia.org/wiki/Rembrand and I get https://en.wikipedia.org/wiki/Rembrandt with 200 status.

So it seems "If article A is a redirect to article B, I would think it'd be useful to count hits to A, as well as include hits to B regardless of how people got there." the latter (as well as ..) is automatically taken care of. That would leave two possible functionalities for the former part

1 as you said "Also presenting hits to A separately" (this is also what stats.grok.se does)
2 Also resolving redirects on API query time (so that people who ask the API about "Barack" get the same response as "Barack Obama" (plus maybe a signal that a redirect occurred) (otherwise some inquisitive journalist who doesn't know or care about redirects might find a low count and misinterpret it)

My 2 cents

jeremyb added a comment.Via WebMay 27 2015, 4:27 PM

I don't understand what that means.

GET https://en.wikipedia.org/wiki/Rembrand [HTTP/1.1 200 OK 463ms]

your browser never fetches https://en.wikipedia.org/wiki/Rembrandt at all. it's just https://en.wikipedia.org/wiki/Rembrand setting a new URL for the page without doing a new fetch. see T37045: Use history.replaceState to rewrite redirect urls.

ezachte added a comment.Via WebMay 27 2015, 4:30 PM

A separate comment on unexpected results from a test: I created 2 pages

https://meta.wikimedia.org/wiki/User:Erik_Zachte_(WMF)/test and
https://meta.wikimedia.org/wiki/User:Erik_Zachte_(WMF)/test_redirect which redirect to the first url

I requested the second url about 20 times in Chrome and found
pagecounts-20150524-130000.gz:meta.m User:Erik_Zachte_(WMF)/test 2 12389
pagecounts-20150524-130000.gz:meta.m User:Erik_Zachte_(WMF)/test_redirect 1 5859

So I figured the browser cache got in the way. So tried again the next day with local cache disabled (inspect element | network | disable cache) and requested the 2nd url again about ten times, but none showed up in pagecounts-20150525-* at all

jeremyb added a comment.Via WebMay 27 2015, 4:36 PM

ten times doesn't sound like a lot?

is there some kind of threshold for making sure it came from at least X IPs?

anyway, I suggest testing with curl (CLI) not just a browser.

MartinPoulter added a comment.Via WebMay 27 2015, 4:36 PM

I'm pleased to see updates on this: I'm presently Wikipedian In Residence at the Bodleian Libraries in Oxford, UK and in contact with other organisations that have lots of text. I'm trying to persuade cultural organisations to take Wikisource seriously and to make it part of their workflow for sharing free text. This is an awesome opportunity for Wikisource, but of course these organisations want the same kind of reporting tools for shared text that they rely on for monitoring hits on their images.

As a volunteer editor, I don't care much what the hits are on a category of pages, but as someone building relationships in the cultural sector, it's a handicap if I can't yet tell partner organisations how many people have seen their stuff. Again, good to see progress and I will be watching developments keenly.

jeremyb added a comment.EditedVia EmailMay 27 2015, 4:42 PM

also, tangent from redirects but also useful in other cases: put page id (primary key from DB) in logs. maybe even revision ID too. if not already there then then can be packed into X-Analytics header by MediaWiki and pulled from that header by the aggregator.

could be used e.g. to track pages across page moves.

Milimetric added a comment.Via WebMay 27 2015, 6:44 PM

@jeremyb and @ezachte: the page id is included for some requests (except mobile apps I believe), in the X-Analytics header (should be documented here but is not: https://wikitech.wikimedia.org/wiki/X-Analytics).

Until page id is available everywhere, which we are trying to make happen, we have decided to try to parse the article name in as canonical a way possible. I suppose the page id would solve both canonical article names and part of the problem with redirects. Erik, thanks for thinking that through, I'll make sure to reference your comment when I code review the aggregation.

Milimetric added a comment.Via WebMay 29 2015, 4:21 PM

I have a bit more clear of an update on the redirects issue. Basically, we will only show pageviews as defined in the pageview definition, and this means only status codes in the 200s.

We hope this will cover most use cases for the pageview API, but if people need more we can build other datasets or allow access to the cluster where needed and possible. Help us think through if this is a problem, and if we're missing a major use case.

ezachte added a comment.Via WebMay 31 2015, 10:10 AM

As for status codes other than 200. A few people asked for a list of most requested non-existing pages, by counting 404s. I see some merit in this, but limited, low priority. I also think this should not be part of an API per se, more a periodically updated static list.

See for example 2nd, 3rd, 4th table on
http://stats.wikimedia.org/wikimedia/pagecounts/reports/2012-12/most-requested-pages-2012-12-wikipedia-EN.html (one-off)

Nemo_bis added a comment.Via WebMay 31 2015, 4:54 PM

@jeremyb and @ezachte: the page id is included for some requests (except mobile apps I believe), in the X-Analytics header (should be documented here but is not: https://wikitech.wikimedia.org/wiki/X-Analytics).

Until page id is available everywhere, which we are trying to make happen,

Cf T92875: Add page_id and namespace to X-Analytics header in App / api requests

Milimetric added a comment.Via WebFri, Jun 5, 1:24 PM

Status update / We'd like your opinion

We have finished analyzing the intermediate hourly aggregate with all the columns that we think are interesting. The data is too large to query and anonymize in real time. We'd rather get an API out faster than deal with that problem, so we decided to produce smaller "cubes" [1] of data for specific purposes. We have two cubes in mind and I'll explain those here. For each cube, we're aiming to have:

  • Direct access to a postgresql database in labs with the data
  • API access through RESTBase
  • Mondrian / Saiku access in labs for dimensional analysis
  • Data will be pre-aggregated so that any single data point has k-anonymity (we have not determined a good k yet)
  • Higher level aggregations will be pre-computed so they use all data

And, the cubes are:

stats.grok.se Cube: basic pageview data

Hourly resolution. Will serve the same purpose as stats.grok.se has served for so many years. The dimensions available will be:

  • project - 'Project name from requests host name'
  • dialect - 'Dialect from requests path (not set if present in project name)'
  • page_title - 'Page Title from requests path and query'
  • access_method - 'Method used to access the pages, can be desktop, mobile web, or mobile app'
  • is_zero - 'accessed through a zero provider'
  • agent_type - 'Agent accessing the pages, can be spider or user'
  • referer_class - 'Can be internal, external or unknown'

Geo Cube: geo-coded pageview data

Daily resolution. Will allow researchers to track the flu, breaking news, etc. Dimensions will be:

  • project - 'Project name from requests hostname'
  • page_title - 'Page Title from requests path and query'
  • country_code - 'Country ISO code of the accessing agents (computed using MaxMind GeoIP database)'
  • province - 'State / Province of the accessing agents (computed using MaxMind GeoIP database)'
  • city - 'Metro area of the accessing agents (computed using MaxMind GeoIP database)'

So, if anyone wants another cube, now is the time to speak up. We'll probably add cubes later, but it may be a while.

[1] OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube

Rubin16 added a subscriber: Rubin16.Via WebFri, Jun 5, 1:29 PM
ezachte added a comment.Via WebFri, Jun 5, 1:43 PM

I'd like to suggest monthly resolution as well. There is now a monthly aggregation job (perl script) and the outcome is used by several community scripts. Also by Wikistats itself for ranking page views within a set of categories by popularity.
(making a list of qualifying articles can stay a separate function to be dealt with later, Wikistats does traverse the category tree via the API and can prune blacklisted categories).

See for example reports e.g.
http://stats.wikimedia.org/wikimedia/pageviews/categorized/wp-en/2014-03/ ,
http://stats.wikimedia.org/wikimedia/pageviews/categorized/

Daily and monthly aggregations are at
http://dumps.wikimedia.org/other/pagecounts-ez/merged/

Aubrey added a comment.Via WebFri, Jun 5, 1:46 PM

@Milimetric Wow, thanks. I think that this deserves a proper Wikimedia-l thread: it should be important to make many people aware of this opportunity. Please consider it. Thanks.

Steko added a subscriber: Steko.Via WebFri, Jun 5, 2:10 PM
jeremyb added a comment.Via EmailFri, Jun 5, 2:52 PM
  • referer_class - 'Can be internal, external or unknown'

maybe would be useful to have an extra class between external and internal?
e.g. for an interwiki from one WMF wiki to another. or from wmflabs. or
maybe that would mess too much with anonymization.

Wittylama removed a subscriber: Wittylama.Via WebFri, Jun 5, 2:55 PM
Ragesoss added a subscriber: Ragesoss.Via WebFri, Jun 5, 5:57 PM
Magnus added a comment.Via WebFri, Jun 5, 6:20 PM

First, thank you for getting this going after so many years!

That said, I second ezachte's monthly aggregate. SHould be tiny compared to the other cubes, and my GLAM tools basically use monthly aggregates anyway.

Milimetric added a comment.Via WebFri, Jun 5, 9:08 PM

@Magnus and @ezachte, as far as monthly aggregates go, those will be included. Anything above the resolution specified will be a part of the cube. I'll explain in a bit more detail what I mean by that, as I'm curious for both of your opinion:

Let's say we want k = 1000 anonymity, and we have the following simple data:

project C, page A viewed 10 times hourly
project C, page B viewed 2000 times hourly

then, our cube would have, for one month, let's call it "month 1":

month 1, day [1-30], hour [0-23], page A: removed by k-anonymizer, only 10 hits
month 1, day [1-30], all hours, page A: removed by k-anonymizer, only 240 hits
month 1, all days, all hours, page A: 7200 hits

month 1, day [1-30], hour [0-23], page B: 2000 hits
month 1, day [1-30], all hours, page B: 48000 hits
month 1, all days, all hours, page B: 1440000 hits

and here it gets interesting

month 1, day [1-30], hour [0-23], all pages: 2010 hits
month 1, day [1-30], all hours, all pages: 48240 hits
month 1, all days, all hours, all pages: 1447200 hits

So note that aggregate data will include data that ends up being removed at the lower aggregate levels. Obviously if there were only two pages this would cause a problem, and we have to look very carefully at things like this, that could allow de-anonymization on smaller wikis. We already release the basic pageview data, but there are potential problems now that we're going to release the geo data as well. People have spoken up concerned about maintaining anonymity with this extra dataset. If anyone is an expert in the area, I would love love your help or input. This article is of interest here: http://dl.acm.org/citation.cfm?id=2660766

Milimetric added a comment.Via WebFri, Jun 5, 9:10 PM
  • referer_class - 'Can be internal, external or unknown'

maybe would be useful to have an extra class between external and internal?
e.g. for an interwiki from one WMF wiki to another. or from wmflabs. or
maybe that would mess too much with anonymization.

Jeremy, that's certainly useful, but we're trying to limit the dimensions as much as possible. Some people were concerned there would already be too much data, so we may have to rethink some of the less popular fields. It would be awesome of if people on this thread could give different versions of cubes that would satisfy their use cases, then we could make informed decisions with the community in mind.

Tgr added a subscriber: Tgr.Via WebSat, Jun 6, 12:51 AM

Awesome, thanks for everyone who has worked on this!

stats.grok.se Cube: basic pageview data

Will that count the same way stats.grok.se does (ie. no aggregation of redirects)?

Some wikis use a hack once introduced by wikistics, where search results are served from Special:Search/<search string> is then counted as a separate wiki page so people can get statistics about search queries (the ones that did not match an article, anyway). Will that continue to work?

Milimetric added a comment.Via WebMon, Jun 8, 10:19 PM

stats.grok.se Cube: basic pageview data

Will that count the same way stats.grok.se does (ie. no aggregation of redirects)?

I believe so, redirects are not counted as pageviews in the new definition and we're only exposing pageview data as per the new definition. I think that's how the old pageview data handled it as well, but I'm not 100% sure.

Some wikis use a hack once introduced by wikistics, where search results are served from Special:Search/<search string> is then counted as a separate wiki page so people can get statistics about search queries (the ones that did not match an article, anyway). Will that continue to work?

That ... should work, though I haven't checked it myself.

JAllemandou added a comment.Via WebTue, Jun 9, 9:23 AM

I confirm the pattern Special:Search/<search string> is present, with what seems to be search values as <search string>.

ezachte added a comment.Via WebTue, Jun 9, 1:41 PM

@Milimetric, about k-anonimity (new for me, interesting): does that apply only to "geo coded cube"? I see little harm in showing all page counts for "basic pageview data cube" even when the resolution is hourly in this proposal and daily at stats.grok.se. After all we already publish hourly counts for every page since 2008 in gz files (and the extra dimensions proposed here are very generic by their nature). I certainly can see how geo coded breakdown of sensitive (e.g. politically) pages up to city level would require some protection.

ezachte added a comment.Via WebTue, Jun 9, 1:59 PM

Also can I ask about the level of detail in dimensions and its performance implications? Breaking down geo data to city level and temporal data to hourly level will make the data store huge. I can see some use cases for both, but maybe just a few.

https://rest.wikimedia.org/en.wikipedia.org/v1/?doc says "As a general rule, don't perform more than 200 requests/s to this API." Is that per user, or all users combined?

Might we be better be prudent by e.g. adding city level only later, once we have better insight in performance and popularity of the API? I have no strong opinion here, just asking. And I'm also curious about the use cases for such detailed breakdown. Here are two: I can see how people probably want to breakdown page requests for politicians by smallest allowed region (especially during elections). I can see how trending articles (e.g. world disasters) best are plotted by the hour. Any other common use cases?

ezachte added a comment.EditedVia WebTue, Jun 9, 2:12 PM

Here is a question from Christoph Braun (posted on cultural-partners list).

"Being a German Wikipedia editor I wondered how Umlauts and other non-standard characters are affecting statistics. E.g. Does Österreich count as Oesterreich or are there any other character encodings that are counted as well? " (note: the mail says two times Österreich but I suppose the variations were lost in transmission)

My personal take is this would complicate think considerably, as these variations would be language specific. Better resolve outside the API? And of course we use redirects for this already, although those need manual care.

Nemo_bis added a comment.Via WebTue, Jun 9, 2:42 PM

Thanks for getting this moving. I only have some minor comments.

  • Direct access to a postgresql database in labs with the data
  • API access through RESTBase
  • Mondrian / Saiku access in labs for dimensional analysis
  • Data will be pre-aggregated so that any single data point has k-anonymity (we have not determined a good k yet)
  • Higher level aggregations will be pre-computed so they use all data

(The comprehensiveness here is great.)
So RESTBase will take care of the caching for web API access? How will the access point look like? The parsoid restbase only allows to query by page_title or rev_id, what arguments will be available?

stats.grok.se Cube: basic pageview data

Hourly resolution. Will serve the same purpose as stats.grok.se has served for so many years.

How to test compatibility with existing tools before the specs are set in stone?

The dimensions available will be:
  • project - 'Project name from requests host name'

So the domain, not the dbname nor the "classic" code? How expensive would it be to add resolution/a column for the old codes as well (like la, la.q, meta.m)?

  • dialect - 'Dialect from requests path (not set if present in project name)'

It took me some minutes to guess what you meant here. These are called "variants" in languages/LanguageConverter.php.

  • page_title - 'Page Title from requests path and query'

With what encoding?

  • access_method - 'Method used to access the pages, can be desktop, mobile web, or mobile app'

So there will be three "rows" for each page and the clients will take care of the aggregation?

  • is_zero - 'accessed through a zero provider'
  • agent_type - 'Agent accessing the pages, can be spider or user'
  • referer_class - 'Can be internal, external or unknown'

How do you know what's a spider? Is there a way to include the really interesting information, i.e. how many requests the same IP (or hash thereof) has performed in a certain timespan?
I don't understand the point of including is_zero and referer_class in this cube, they seem more relevant for the geo cube. Are the definitions consistent with the mediacounts' referrer columns?

Let's say we want k = 1000 anonymity, and we have the following simple data: [...]

I assume this is only for the geo cube? Even a threshold of 1/hour or 1/day is very high for most pages in our wikis, would make most stats vanish. Once the additional information currently not present in stats is move to secondary cubes, what's the reason to scrap data from the main cube? (I'll read the paper later, probably next week.)

Tgr added a comment.Via WebTue, Jun 9, 4:50 PM

(note: the mail says two times Österreich but I suppose the variations were lost in transmission)

I would guess he is talking about Unicode normalization (ie. you can write Ö as a single unicode character or as an O + a combining umlaut).

Tgr added a comment.Via WebTue, Jun 9, 4:52 PM

I see little harm in showing all page counts for "basic pageview data cube" even when the resolution is hourly in this proposal and daily at stats.grok.se.

For the above mentioned search hack, it would be useful (if I understand correctly how it would work) as it would filter out weird stuff accidentally pasted into the search box (I've heard passwords sometimes end up there, for example).

Milimetric added a comment.Via WebTue, Jun 9, 5:35 PM

@Milimetric, about k-anonimity (new for me, interesting): does that apply only to "geo coded cube"? I see little harm in showing all page counts for "basic pageview data cube" even when the resolution is hourly in this proposal and daily at stats.grok.se. After all we already publish hourly counts for every page since 2008 in gz files (and the extra dimensions proposed here are very generic by their nature). I certainly can see how geo coded breakdown of sensitive (e.g. politically) pages up to city level would require some protection.

At first, that was my thinking, that k-anonymity would only be really needed for the geo cube. Now, after reading up on deanonymization attacks, I'm much more cautious. From the basic cube, I came up with several attacks that could take advantage of access_method and referrer_class. So we're considering leaving those dimensions out of the basic cube, or altering them in some way to prevent these kinds of attacks. But ideally we'd have someone who's really good at this kind of puzzle work with us.

Also can I ask about the level of detail in dimensions and its performance implications? Breaking down geo data to city level and temporal data to hourly level will make the data store huge. I can see some use cases for both, but maybe just a few.

If geo is reported at city level, we would only allow daily resolution. Even if we didn't, k-anonymity would truncate the vast majority of data points here.

https://rest.wikimedia.org/en.wikipedia.org/v1/?doc says "As a general rule, don't perform more than 200 requests/s to this API." Is that per user, or all users combined?

I think that's per user, but this does not affect us, because we'd have a separate RESTBase backend. So we'd have to figure out what level of service the new backend could offer.

Might we be better be prudent by e.g. adding city level only later, once we have better insight in performance and popularity of the API? I have no strong opinion here, just asking. And I'm also curious about the use cases for such detailed breakdown. Here are two: I can see how people probably want to breakdown page requests for politicians by smallest allowed region (especially during elections). I can see how trending articles (e.g. world disasters) best are plotted by the hour. Any other common use cases?

The main geo use case that started the discussion around this cube is flu tracking. They, in combination with Dario speaking for possible future use cases, feel that daily resolution is ok as long as geo data is as accurate as possible.

Milimetric added a comment.Via WebTue, Jun 9, 5:37 PM

Here is a question from Christoph Braun (posted on cultural-partners list).

"Being a German Wikipedia editor I wondered how Umlauts and other non-standard characters are affecting statistics. E.g. Does Österreich count as Oesterreich or are there any other character encodings that are counted as well? " (note: the mail says two times Österreich but I suppose the variations were lost in transmission)

My personal take is this would complicate think considerably, as these variations would be language specific. Better resolve outside the API? And of course we use redirects for this already, although those need manual care.

We have an article title normalization function that we *could* do this in. But, as you say, I think it might make sense to do it outside of that function. Maybe someone who likes doing this manual redirect care could use the API to fetch pageviews for different versions of popular topics. Something like this maybe:

  • fetch top 10 articles
  • fetch variations on top 10 articles
  • come up with a list of redirects that should be put in place.
Milimetric added a comment.EditedVia WebTue, Jun 9, 5:54 PM

Thanks for getting this moving. I only have some minor comments.

  • Direct access to a postgresql database in labs with the data
  • API access through RESTBase
  • Mondrian / Saiku access in labs for dimensional analysis
  • Data will be pre-aggregated so that any single data point has k-anonymity (we have not determined a good k yet)
  • Higher level aggregations will be pre-computed so they use all data

(The comprehensiveness here is great.)
So RESTBase will take care of the caching for web API access? How will the access point look like? The parsoid restbase only allows to query by page_title or rev_id, what arguments will be available?

We took a stab at this in the past and it's a bit complex to post here. I think we would go for an incrementally better API by first making endpoints that present data similarly to stats.grok.se, then go on from there to add endpoints that people find useful. The Mondrian / Saiku access may be better than custom end points though, so I'd like to wait and see how that works.

stats.grok.se Cube: basic pageview data

Hourly resolution. Will serve the same purpose as stats.grok.se has served for so many years.

How to test compatibility with existing tools before the specs are set in stone?

I don't think it's feasible to go for a drop-in replacement. We'll consider it when we think of the output format, but it might not make sense. In any case, no decisions were taken there. My personal preference would be to aim for general usefulness instead of backwards compatibility, but feel free to open a discussion here and come up with a format. I'm happy to override my personal preference :)

The dimensions available will be:
  • project - 'Project name from requests host name'

So the domain, not the dbname nor the "classic" code? How expensive would it be to add resolution/a column for the old codes as well (like la, la.q, meta.m)?

It's less expensive and more annoying to keep baking that code into all our pipelines. I feel like we could maybe provide an API for this translation, maybe that's more generally useful. What do people think? Which codes specifically would make your lives easier?

  • dialect - 'Dialect from requests path (not set if present in project name)'

It took me some minutes to guess what you meant here. These are called "variants" in languages/LanguageConverter.php.

Thanks :) we'll have to change "dialect" to "variant" then. This is a useful field, though, right?

  • page_title - 'Page Title from requests path and query'

With what encoding?

UTF8, I guess, any opinions?

  • access_method - 'Method used to access the pages, can be desktop, mobile web, or mobile app'

So there will be three "rows" for each page and the clients will take care of the aggregation?

All aggregation levels will be accessible through one of the paths above (API, Mondrian, etc). So the clients shouldn't have to do that. This said, as I was saying in a separate reply to Erik Z, I think the access_method might give away too much information in some cases. I'd love to talk to a privacy specialist before we do anything with that field.

  • is_zero - 'accessed through a zero provider'
  • agent_type - 'Agent accessing the pages, can be spider or user'
  • referer_class - 'Can be internal, external or unknown'

How do you know what's a spider? Is there a way to include the really interesting information, i.e. how many requests the same IP (or hash thereof) has performed in a certain timespan?

Spider is defined by a function we have on the User Agent. They should include most well behaved bots, crawlers, etc. They don't include all automated access agents and we're not shooting for that. We're just going for the best possible. And for information specific to certain bots or IPs, a researcher would have to sign an NDA and gain access to the raw data.

I don't understand the point of including is_zero and referer_class in this cube, they seem more relevant for the geo cube. Are the definitions consistent with the mediacounts' referrer columns?

Yeah, is_zero and referer_class have been deemed not that useful and kind of scary from certain privacy attack scenarios we just realized yesterday. So those will probably not be included. I'll have to make a comprehensive update when we have our ducks in order.

Let's say we want k = 1000 anonymity, and we have the following simple data: [...]

I assume this is only for the geo cube? Even a threshold of 1/hour or 1/day is very high for most pages in our wikis, would make most stats vanish. Once the additional information currently not present in stats is move to secondary cubes, what's the reason to scrap data from the main cube? (I'll read the paper later, probably next week.)

Yes, same as above, we want to get rid of the dimensions that are scary from privacy and then we won't have to apply k-anonymity here. I think that's a good trade-off unless others differ here.

Tgr added a comment.EditedVia WebTue, Jun 9, 8:39 PM
  • page_title - 'Page Title from requests path and query'

With what encoding?

UTF8, I guess, any opinions?

Not sure I understand the question. Isn't the request path / query part of an URL just a sequence of bytes without any associated encoding? Are you going to do something other than just return that byte sequence? (Which is in practice almost always UTF-8 but I don't think you have any control over it.)

Milimetric added a comment.Via WebTue, Jun 9, 11:33 PM
In T44259#1350583, @Tgr wrote:
  • page_title - 'Page Title from requests path and query'

With what encoding?

UTF8, I guess, any opinions?

Not sure I understand the question. Isn't the request path / query part of an URL just a sequence of bytes without any associated encoding? Are you going to do something other than just return that byte sequence? (Which is in practice almost always UTF-8 but I don't think you have any control over it.)

I guess we could set the encoding on the HTTP response, which is not something I want to do unless there's a really good reason for it. Normally that's just set to UTF8, yeah.

Nemo_bis added a comment.Via WebWed, Jun 10, 6:30 AM

Thanks milimetric.

Not sure I understand the question. Isn't the request path / query part of an URL just a sequence of bytes without any associated encoding? Are you going to do something other than just return that byte sequence? (Which is in practice almost always UTF-8 but I don't think you have any control over it.)

I'll be more explicit: in the past people have mentioned issues aggregating stats for UTF-8 titles vs. percent-encoded titles, if I remember correctly. I don't know if there are further issues, but it would be nice for the field to be normalised.

JAllemandou added a comment.Via WebWed, Jun 10, 7:13 AM

It's a good point. For the moment, we assume page titles are UTF-8 encoded, and we percent decode page titles coming from url path and query strings. An additional trick is applied for the query strings: we change spaces to underscore to match with titles from url path.

Ocaasi added a subscriber: Ocaasi.Via WebTue, Jun 16, 9:07 PM
Sadads added a subscriber: Sadads.Via WebWed, Jun 17, 6:13 PM

Hi all: I just wanted to add a +1 interest in this thread from the Wikipedia Library. In part, we would like to be able to pair pageviews with external links (our best proxy for relative impact of citations from partner resourcces). In general, we are trying to figure out what the best strategies are for tracking a) links to partner resources, b) which editors created those links and c) relative visibility of the partner resources via their presence on pages. Having a cube that makes it easy to connect external links or some other variable (like DOIs) to page views would be ideal. I have created a bug in our larger link metrics collection task at : https://phabricator.wikimedia.org/T102855?workflow=102064

Milimetric added a comment.Via WebFri, Jun 19, 5:48 PM

Thanks @Sadads, I think I remember other people here doing analytics work on citations and references, @DarTar, am I imagining things? Is there another cube that would be useful for what @Sadads is talking about?

Add Comment