Make domas' pageviews data available in semi-publicly queryable database format
Closed, ResolvedPublic

Description

This doesn't seem to be tracked yet.
It's been discussed countless times in the past few years: for all sorts of GLAM initiatives and any other initiative to improve content on the projects, we currently rely on Henrik's stats.grok.se data in JSON format, e.g. https://toolserver.org/~emw/index.php?c=wikistats , http://toolserver.org/~magnus/glamorous.php etc.
The data on domas' logs should be available for easy querying on the Toolserver databases and elsewhere, but previous attempts to create such a DB lead nowhere as far as I know.

I suppose this is already one of the highest priorities in the analytics team plans for the new infrastructure, but I wasn't able to confirm it by reading the public documents and it needs to be done anyway sooner or later.

(Not in "Usage statistics" aka "Statistics" component because that's only about raw pageviews data.)


Involved sub-tasks:


Version: wmf-deployment
Severity: enhancement
Discussions (partial list):

Older changes are hidden. Show older changes.
Milimetric added a comment.Via WebJun 9 2015, 5:37 PM

Here is a question from Christoph Braun (posted on cultural-partners list).

"Being a German Wikipedia editor I wondered how Umlauts and other non-standard characters are affecting statistics. E.g. Does Österreich count as Oesterreich or are there any other character encodings that are counted as well? " (note: the mail says two times Österreich but I suppose the variations were lost in transmission)

My personal take is this would complicate think considerably, as these variations would be language specific. Better resolve outside the API? And of course we use redirects for this already, although those need manual care.

We have an article title normalization function that we *could* do this in. But, as you say, I think it might make sense to do it outside of that function. Maybe someone who likes doing this manual redirect care could use the API to fetch pageviews for different versions of popular topics. Something like this maybe:

  • fetch top 10 articles
  • fetch variations on top 10 articles
  • come up with a list of redirects that should be put in place.
Milimetric added a comment.EditedVia WebJun 9 2015, 5:54 PM

Thanks for getting this moving. I only have some minor comments.

  • Direct access to a postgresql database in labs with the data
  • API access through RESTBase
  • Mondrian / Saiku access in labs for dimensional analysis
  • Data will be pre-aggregated so that any single data point has k-anonymity (we have not determined a good k yet)
  • Higher level aggregations will be pre-computed so they use all data

(The comprehensiveness here is great.)
So RESTBase will take care of the caching for web API access? How will the access point look like? The parsoid restbase only allows to query by page_title or rev_id, what arguments will be available?

We took a stab at this in the past and it's a bit complex to post here. I think we would go for an incrementally better API by first making endpoints that present data similarly to stats.grok.se, then go on from there to add endpoints that people find useful. The Mondrian / Saiku access may be better than custom end points though, so I'd like to wait and see how that works.

stats.grok.se Cube: basic pageview data

Hourly resolution. Will serve the same purpose as stats.grok.se has served for so many years.

How to test compatibility with existing tools before the specs are set in stone?

I don't think it's feasible to go for a drop-in replacement. We'll consider it when we think of the output format, but it might not make sense. In any case, no decisions were taken there. My personal preference would be to aim for general usefulness instead of backwards compatibility, but feel free to open a discussion here and come up with a format. I'm happy to override my personal preference :)

The dimensions available will be:
  • project - 'Project name from requests host name'

So the domain, not the dbname nor the "classic" code? How expensive would it be to add resolution/a column for the old codes as well (like la, la.q, meta.m)?

It's less expensive and more annoying to keep baking that code into all our pipelines. I feel like we could maybe provide an API for this translation, maybe that's more generally useful. What do people think? Which codes specifically would make your lives easier?

  • dialect - 'Dialect from requests path (not set if present in project name)'

It took me some minutes to guess what you meant here. These are called "variants" in languages/LanguageConverter.php.

Thanks :) we'll have to change "dialect" to "variant" then. This is a useful field, though, right?

  • page_title - 'Page Title from requests path and query'

With what encoding?

UTF8, I guess, any opinions?

  • access_method - 'Method used to access the pages, can be desktop, mobile web, or mobile app'

So there will be three "rows" for each page and the clients will take care of the aggregation?

All aggregation levels will be accessible through one of the paths above (API, Mondrian, etc). So the clients shouldn't have to do that. This said, as I was saying in a separate reply to Erik Z, I think the access_method might give away too much information in some cases. I'd love to talk to a privacy specialist before we do anything with that field.

  • is_zero - 'accessed through a zero provider'
  • agent_type - 'Agent accessing the pages, can be spider or user'
  • referer_class - 'Can be internal, external or unknown'

How do you know what's a spider? Is there a way to include the really interesting information, i.e. how many requests the same IP (or hash thereof) has performed in a certain timespan?

Spider is defined by a function we have on the User Agent. They should include most well behaved bots, crawlers, etc. They don't include all automated access agents and we're not shooting for that. We're just going for the best possible. And for information specific to certain bots or IPs, a researcher would have to sign an NDA and gain access to the raw data.

I don't understand the point of including is_zero and referer_class in this cube, they seem more relevant for the geo cube. Are the definitions consistent with the mediacounts' referrer columns?

Yeah, is_zero and referer_class have been deemed not that useful and kind of scary from certain privacy attack scenarios we just realized yesterday. So those will probably not be included. I'll have to make a comprehensive update when we have our ducks in order.

Let's say we want k = 1000 anonymity, and we have the following simple data: [...]

I assume this is only for the geo cube? Even a threshold of 1/hour or 1/day is very high for most pages in our wikis, would make most stats vanish. Once the additional information currently not present in stats is move to secondary cubes, what's the reason to scrap data from the main cube? (I'll read the paper later, probably next week.)

Yes, same as above, we want to get rid of the dimensions that are scary from privacy and then we won't have to apply k-anonymity here. I think that's a good trade-off unless others differ here.

Tgr added a comment.EditedVia WebJun 9 2015, 8:39 PM
  • page_title - 'Page Title from requests path and query'

With what encoding?

UTF8, I guess, any opinions?

Not sure I understand the question. Isn't the request path / query part of an URL just a sequence of bytes without any associated encoding? Are you going to do something other than just return that byte sequence? (Which is in practice almost always UTF-8 but I don't think you have any control over it.)

Milimetric added a comment.Via WebJun 9 2015, 11:33 PM
In T44259#1350583, @Tgr wrote:
  • page_title - 'Page Title from requests path and query'

With what encoding?

UTF8, I guess, any opinions?

Not sure I understand the question. Isn't the request path / query part of an URL just a sequence of bytes without any associated encoding? Are you going to do something other than just return that byte sequence? (Which is in practice almost always UTF-8 but I don't think you have any control over it.)

I guess we could set the encoding on the HTTP response, which is not something I want to do unless there's a really good reason for it. Normally that's just set to UTF8, yeah.

Nemo_bis added a comment.Via WebJun 10 2015, 6:30 AM

Thanks milimetric.

Not sure I understand the question. Isn't the request path / query part of an URL just a sequence of bytes without any associated encoding? Are you going to do something other than just return that byte sequence? (Which is in practice almost always UTF-8 but I don't think you have any control over it.)

I'll be more explicit: in the past people have mentioned issues aggregating stats for UTF-8 titles vs. percent-encoded titles, if I remember correctly. I don't know if there are further issues, but it would be nice for the field to be normalised.

JAllemandou added a comment.Via WebJun 10 2015, 7:13 AM

It's a good point. For the moment, we assume page titles are UTF-8 encoded, and we percent decode page titles coming from url path and query strings. An additional trick is applied for the query strings: we change spaces to underscore to match with titles from url path.

Ocaasi added a subscriber: Ocaasi.Via WebJun 16 2015, 9:07 PM
Sadads added a subscriber: Sadads.Via WebJun 17 2015, 6:13 PM

Hi all: I just wanted to add a +1 interest in this thread from the Wikipedia Library. In part, we would like to be able to pair pageviews with external links (our best proxy for relative impact of citations from partner resourcces). In general, we are trying to figure out what the best strategies are for tracking a) links to partner resources, b) which editors created those links and c) relative visibility of the partner resources via their presence on pages. Having a cube that makes it easy to connect external links or some other variable (like DOIs) to page views would be ideal. I have created a bug in our larger link metrics collection task at : https://phabricator.wikimedia.org/T102855?workflow=102064

Milimetric added a comment.Via WebJun 19 2015, 5:48 PM

Thanks @Sadads, I think I remember other people here doing analytics work on citations and references, @DarTar, am I imagining things? Is there another cube that would be useful for what @Sadads is talking about?

Tomayac added a subscriber: Tomayac.Via WebJul 3 2015, 7:45 AM
dcausse added a subscriber: dcausse.Via WebJul 16 2015, 6:33 AM
Blahma added a subscriber: Blahma.Via WebJul 17 2015, 12:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptVia HeraldJul 17 2015, 12:14 PM
Nettrom added a subscriber: Nettrom.Via WebJul 17 2015, 3:44 PM
Milimetric added a comment.Via WebAug 14 2015, 7:28 PM

Sorry for the long pause between updates. Here's where we are with our quarterly goal to put up a pageview API.

  • We have programmed the RESTBase endpoints and are getting ready to submit a pull request today or Monday, that's ahead of schedule and we're happy about it
  • We believe the hardware we're decomissioning from Kafka work can be repurposed for the RESTBase / Cassandra cluster to host the Pageview API. This hasn't been approved by ops yet but we're optimistic which is good because these things can take the most time
  • We have started the puppet work to configure this new RESTBase cluster, and that's ahead of schedule too.
  • We ran into some bugs and incomplete work in normalizing the page titles. Some of these fixes are done and others are scheduled, and we think we can go back and fix all the data we'd be putting into the API.

In short, I think we're on or ahead of schedule overall with no known blockers.

Mrjohncummings added a subscriber: Mrjohncummings.Via WebAug 14 2015, 9:11 PM

My top request would be an inclusion of mobile views on something similar to stats.grok.se, currently it doesn't include mobile views which according to metrics Wikiproject medicine have created is missing over 50% of the actual page views.

https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_medical_pages

intracer added a subscriber: intracer.Via WebAug 16 2015, 8:02 AM
Milimetric added a comment.EditedVia WebAug 17 2015, 2:11 PM

You can drill down into mobile views a little bit on: https://vital-signs.wmflabs.org/ (just click the data breakdowns button on the left).

As for the ongoing pageview API work, the current end points don't have a mobile breakdown, because with such low traffic on mobile for some articles, you could identify that an editor is using a mobile device. We are still talking about whether to include mobile views at the project level. In most cases it's ok, but there are some projects that also have very few pageviews, and there we would have the same problem of identifying that active editors are probably using mobile devices to edit.

We're a bit torn on what to do with the vital signs breakdowns. They're also suffering from the same problem, and we should remove them. Especially when you consider the zero site.

Releasing data is hard :)

Milimetric added a comment.Via WebAug 18 2015, 1:11 PM

...

As for the ongoing pageview API work, the current end points don't have a mobile breakdown, because with such low traffic on mobile for some articles, you could identify that an editor is using a mobile device. We are still talking about whether to include mobile views at the project level. In most cases it's ok, but there are some projects that also have very few pageviews, and there we would have the same problem of identifying that active editors are probably using mobile devices to edit.

We're a bit torn on what to do with the vital signs breakdowns. They're also suffering from the same problem, and we should remove them. Especially when you consider the zero site.

Never mind this, edits that happen on the mobile site are tagged as "mobile" and that data is available publicly anyway. So I'll file a task to change our endpoints to expose mobile / not mobile.

Kopiersperre added a subscriber: Kopiersperre.Via WebAug 30 2015, 10:59 AM
Milimetric added a comment.Via WebSep 1 2015, 2:49 PM

Quick update: we're talking with ops and making all the necessary preparations. The code is mostly done from our point of view, but these next few weeks will be a reality check in terms of hardware, network setup, etc. I've added the puppetization of the pageview API deployment as a blocking tasks to this. A lot of work ahead, but exciting times! :)

Milimetric added a comment.Via WebSep 11 2015, 10:02 PM

Quick question that we'd love some opinions on. We have two choices for how to go forward with the "top" endpoint and we're not sure what would be most useful to folks consuming this. I asked this on analytics-l so feel free to ignore this if you're participating there. Here are the choices:

Choice 1. /top/{project}/{access}/{days-in-the-past}

Example: top articles via all en.wikipedia sites for the past 30 days: /top/en.wikipedia/all-access/30

Choice 2. /top/{project}/{access}/{start}/{end}

Example: top articles via all en.wikipedia sites from June 12th, 2014 to August 30th, 2015: /top/en.wikipedia/all-access/2014-06-12/2015-08-30

(in all of those,

  • {project} means en.wikipedia, commons.wikimedia, etc.
  • {access} means access method as in desktop, mobile web, mobile app

)

Which do you prefer? Would any other query style be useful?

Tgr added a comment.Via WebSep 11 2015, 10:32 PM

Clearly the second. It gives more information, allows you to leverage Varnish and does not change. (Linking to some statistics on a wiki page and that link showing something different every way would be rather confusing.) Convenience can be left to the frontend.

HYanWong added a subscriber: HYanWong.Via WebOct 9 2015, 10:12 PM

It's fantastic that this is almost done. Will this project also provide downloadable dump files, roughly equivalent to those at http://dumps.wikimedia.org/other/pagecounts-ez/merged/ or will that url continue to be the main source of aggregated page view dumps?

Milimetric added a comment.Via WebOct 13 2015, 2:00 PM

Will this project also provide downloadable dump files, roughly equivalent to those at http://dumps.wikimedia.org/other/pagecounts-ez/merged/ or will that url continue to be the main source of aggregated page view dumps?

This API won't provide aggregated dumps, we'll still publish those at dumps.wikimedia.org. But we're currently trying to simplify that site, the differences between the different data sets are confusing.

zhuyifei1999 added a subscriber: zhuyifei1999.Via WebOct 17 2015, 9:24 AM
scfc added a subscriber: ErrantX.Via WebOct 20 2015, 7:10 PM
Milimetric added a comment.Via WebOct 22 2015, 4:13 PM

I know I owe everyone an update here. We've run into a bunch of little tiny annoying problems, and we're working through all of them, nothing too interesting to this list I think.

The API will be deployed today or tomorrow, depending on the services team's availability. But at that point we'll be still backfilling some data, especially the per-article view counts. So the endpoints will be available to hit, with some data in there, but we won't announce it publicly. I'll post here as soon as we have some good news. If anyone's interested in the problems and details, ping me privately or on IRC or something.

Milimetric added a subscriber: Henrik.Via WebOct 23 2015, 12:42 PM

Good News :)

API has been launched. We're not announcing it widely yet, because we haven't finished loading in all the data. The per-article data will take some time, the others should be ready relatively soon. However, I wanted to give folks on this list a heads-up so they can start writing code, the spec is final.

Find the docs here: https://wikimedia.org/api/rest_v1/?doc#!/Pageviews_data/get_metrics_pageviews_per_article_project_access_agent_article_granularity_start_end

And here are some example queries:

https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2015/10/01

https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/daily/2015100100/2015100200

https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/daily/2015100100/2015100200

Huge huge thanks to everyone involved:

  • The community for kicking our butt until we did this (@Magnus and @MZMcBride especially)
  • @Henrik for keeping stats.grok.se up as long as he has, hopefully this will be a welcome addition and maybe spark interest in a new version of stats.grok.se
  • Ops and Services teams for holding our hand throughout
  • Everyone on the analytics team, past and present, we all worked on this at some point

We'll have an official public announcement on the analytics list when all the data is loaded. And most likely a blog post soon after. Until then let's keep this among people who need to know to update code and deserve to know in general :)

Glaisher added a subscriber: Glaisher.Via WebOct 23 2015, 12:46 PM
Magnus added a comment.Via EmailOct 23 2015, 12:52 PM

Fantastic news indeed!

Can't wait for the per-article data. At least, I now have a URL schema to
code against :-)

Detail question: Will per-article work with "out-of-bounds" dates? So, if
my date range is 2015090100-2015093200 (or 3124, or3123 for a
30-day-month), will that work?

Tnegrin added a comment.Via EmailOct 23 2015, 2:09 PM

Congrats Dan and team -- nice to see this so close.

We should talk about moving the page view statistics from the wiki to this
service when it's had a chance to bake some.

-Toby

Nettrom added a comment.Via WebOct 23 2015, 3:35 PM

First of all, I'll join the celebrations, this if absolutely fantastic, huge thanks to everyone involved!

I'm looking forward to testing it with SuggestBot, since the bot delivers view data to en-wiki users every day. Reading through the documentation I had a question about the date format when requesting views per day for articles, does it simply strip off the hour from the time then? Meaning a start/end time spec of '2015102300' and '2015102301' are equivalent? Not sure where to ask, maybe I should open a separate ticket for it? And as you probably can tell, a bit eager to play around with this.

Milimetric added a comment.Via WebOct 23 2015, 5:04 PM

Detail question: Will per-article work with "out-of-bounds" dates? So, if
my date range is 2015090100-2015093200 (or 3124, or3123 for a
30-day-month), will that work?

The timestamps are validated to be valid dates, so 2015093200 or 2015093100 will be invalid and will return a proper message explaining what's wrong.

Hours are from 00 to 23, so 2015100100 will include the first hour of 2015-10-01. If you want all of september at an hourly level, this is the correct range: 2015090100-2015093023

@Nettrom:

question about the date format when requesting views per day for articles, does it simply strip off the hour from the time then? Meaning a start/end time spec of '2015102300' and '2015102301' are equivalent?

Actually, you have to pass 2015102300 if you want data for the 23rd. 2015102301 will give you a 404, since it specifies an hour for the daily level. This is ... confusing. I'm open to suggestions, but we may not want to mess with the URL structure too much once this is publicly launched.

Not sure where to ask, maybe I should open a separate ticket for it?

Anyone can ask details like this in #wikimedia-analytics on freenode, someone should be able to answer there. Or on the analytics-l list.

Tgr added a comment.Via WebOct 23 2015, 6:23 PM

Actually, you have to pass 2015102300 if you want data for the 23rd. 2015102301 will give you a 404, since it specifies an hour for the daily level. This is ... confusing. I'm open to suggestions, but we may not want to mess with the URL structure too much once this is publicly launched.

The intuitive format would IMO be https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/daily/20151001/20151002 but if you don't want to change the format at least it should be a 301 instead of a 404.

Nemo_bis edited the task description. (Show Details)Via WebOct 25 2015, 11:08 AM
Verena added a subscriber: Verena.Via WebOct 27 2015, 10:42 AM
Milimetric added a comment.Via WebOct 28 2015, 4:28 PM

Quick update: October has finished loading. We tried to optimize but we couldn't get hourly resolution data per-article to fit in Cassandra. Because of that, we're looking at Druid and Elastic Search as replacements [1].

So at this point, people can query this data freely, and expect it to be reliable. Let us know if you have problems. We will continue to fill in all the rest of the data we have, back to May 2015, and we'll keep it up to date with new data.

[1] https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI/DataStore

Mrjohncummings added a comment.Via WebOct 28 2015, 6:01 PM

It is really wonderful these metrics are now available :)

Has anyone started work on a user interface or can anyone suggest an easy way to visualise the results?

Milimetric added a comment.Via WebOct 28 2015, 10:01 PM

We haven't made an interface sort of on purpose to see what the level of interest is, etc. We're working pretty hard on the back-end to add more types of data and possible queries.

But the data that comes back is JSON and should be very easy to visualize with anything like d3, dygraphs, etc. I'm happy to help as a volunteer to write that kind of code, and I humbly suggest dashiki as a platform to build it with. Anyone who wants to work on this should open another task and cc me.

Milimetric added a comment.Via WebNov 6 2015, 2:27 PM

Update:

  • I want to talk about the Pageview API and future Analytics Data APIs at this year's Mediawiki Developer Summit. I will cc some of you on this proposal: https://phabricator.wikimedia.org/T112956. Let's discuss there where we want to go next
  • Marcel wrote a simple demo of what's possible to do with the API, we'll be showing that off soon
  • We are getting ready to make a blog post about the API
Ragesoss added a comment.Via WebNov 12 2015, 10:27 PM
In T44259#1748904, @Tgr wrote:

The intuitive format would IMO be https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/daily/20151001/20151002 but if you don't want to change the format at least it should be a 301 instead of a 404.

I agree. If I'm trying to get daily numbers, then it makes sense to have the dates in YYYYMMDD format. I tried both that and with HH = 01 before figuring out that the only way to get it to work was to use HH = 00.

Milimetric added a comment.Via WebNov 13 2015, 3:01 AM
In T44259#1748904, @Tgr wrote:

The intuitive format would IMO be https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia/all-access/user/daily/20151001/20151002 but if you don't want to change the format at least it should be a 301 instead of a 404.

I agree. If I'm trying to get daily numbers, then it makes sense to have the dates in YYYYMMDD format. I tried both that and with HH = 01 before figuring out that the only way to get it to work was to use HH = 00.

I'm happy to do this, just wondering if you guys thought it wasn't too confusing to have different date formats based on different values for the other parameters. It seems easier for humans and harder for machines, and this API leans slightly towards machines.

Ragesoss added a comment.Via WebNov 13 2015, 3:02 AM

I'm happy to do this, just wondering if you guys thought it wasn't too confusing to have different date formats based on different values for the other parameters. It seems easier for humans and harder for machines, and this API leans slightly towards machines.

Why not support both? Just interpret YYYYMMDD as meaning YYYYMMDD00.

Milimetric added a comment.Via WebNov 13 2015, 3:06 AM

Why not support both? Just interpret YYYYMMDD as meaning YYYYMMDD00.

Makes sense, filed: https://phabricator.wikimedia.org/T118543

Magnus added a comment.Via WebNov 13 2015, 12:41 PM

Thanks, I'll manage on my own, once daily (or monthly) views are available on the new API. Or did I miss a mail and they already are?

Milimetric added a comment.Via WebNov 13 2015, 3:46 PM

@Magnus, the API is up and being used already, we just haven't announced it on a list yet. I have a draft email explaining some details that I'll send probably today or Monday to analytics-l, engineering, and wikitech.

Monthly pageviews aren't ready quite yet. But daily pageviews are stable, and filled back to October (with more data being added as we go): https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Selfie/daily/2015010100/2015120100

Milimetric closed this task as "Resolved".Via WebNov 16 2015, 10:34 PM

I'm super duper excited to announce that the API has been announced publicly on wikitech, analytics, and engineering lists. Therefore I'm resolving this. Feel free to stick around, share stories, etc. But if you want to talk about what's next for this API, head on over to T112956 where I've most likely already subscribed you :)

Thank you again to everyone on this thread. It means a lot to me to be able to move this project forward, and I'm excited to see where we want to go next.

He7d3r added a subscriber: He7d3r.Via WebNov 22 2015, 5:40 PM
Ricordisamoa added a comment.Via WebNov 25 2015, 1:14 PM

What about unique viewers?

Milimetric added a comment.Via WebNov 25 2015, 5:17 PM

What about unique viewers?

That requires gathering data in a different way. We don't really like the whole idea of fingerprinting at WMF, so we don't do that.

Ricordisamoa added a comment.Via WebNov 25 2015, 8:08 PM

What about unique viewers?

That requires gathering data in a different way. We don't really like the whole idea of fingerprinting at WMF, so we don't do that.

Sounds sensible, thanks.

Mrjohncummings added a comment.Via WebDec 11 2015, 1:12 PM

We haven't made an interface sort of on purpose to see what the level of interest is, etc. We're working pretty hard on the back-end to add more types of data and possible queries.

But the data that comes back is JSON and should be very easy to visualize with anything like d3, dygraphs, etc. I'm happy to help as a volunteer to write that kind of code, and I humbly suggest dashiki as a platform to build it with. Anyone who wants to work on this should open another task and cc me.

Can you recommend a guide that gives baby steps to reuse the data in one of the tools you suggest? I'd be very happy to work with you on a tool as a beta tester etc (not a programmer)

Milimetric added a comment.Via WebDec 11 2015, 3:19 PM

Can you recommend a guide that gives baby steps to reuse the data in one of the tools you suggest? I'd be very happy to work with you on a tool as a beta tester etc (not a programmer)

We're about to put out a blog post. At the bottom of that I'm trying to have such a guide. If that's not rich enough I'll keep trying :)

@Mrjohncummings: please open another Phabricator task and assign it to me so I can reference it from the work I do.

@Milimetric great, thanks, I've written something here, I think I've made a bit of a pigs ear with the wording, please change to make sense.

https://phabricator.wikimedia.org/T121314

Milimetric added a comment.Via WebDec 15 2015, 5:40 PM

To keep the archives happy, this is the blogpost announcing the release of the API: http://blog.wikimedia.org/2015/12/14/pageview-data-easily-accessible/

Slaporte added a subscriber: Slaporte.Via WebJan 6 2016, 5:27 PM

@Milimetric, when is pageview data from the previous day published? I noticed that data from January 5 isn't available yet (unless I'm doing something wrong) 17 hours since UTC midnight.

Kelson added a subscriber: Kelson.Via WebJan 11 2016, 12:30 PM
M.Schwendener added a subscriber: M.Schwendener.Via WebJan 11 2016, 2:25 PM
Nemo_bis added a comment.Via WebJan 11 2016, 3:06 PM

@Milimetric, when is pageview data from the previous day published? I noticed that data from January 5 isn't available yet (unless I'm doing something wrong) 17 hours since UTC midnight.

Probably same issue as T116286.

JAllemandou added a comment.Via WebJan 12 2016, 11:07 AM

when is pageview data from the previous day published? I noticed that data from January 5 isn't available yet (unless I'm doing something wrong) 17 hours since UTC midnight.

@Slaporte: We have experienced a cluster issue 2016 Jan 4th which slowed down our computation for the next two days. Everything is now back in order.
Sorry for the inconvenience.

Slaporte added a comment.Via WebJan 12 2016, 6:46 PM

when is pageview data from the previous day published? I noticed that data from January 5 isn't available yet (unless I'm doing something wrong) 17 hours since UTC midnight.

@Slaporte: We have experienced a cluster issue 2016 Jan 4th which slowed down our computation for the next two days. Everything is now back in order.
Sorry for the inconvenience.

Glad that's resolved. Thanks for the update!

Milimetric added a comment.Via WebThu, Jan 14, 2:42 PM

@Milimetric, when is pageview data from the previous day published? I noticed that data from January 5 isn't available yet (unless I'm doing something wrong) 17 hours since UTC midnight.

@Slaporte, data shows up "as soon as possible". In theory, the earliest it could show up is a couple of hours after the respective time period is finished (so at 02:00 UTC on day X+1, we should have day X ready). But that sometimes may be much slower if the cluster is overloaded, data gets lost and we have to restart jobs, etc. In general I haven't seen it take more than 24 hours, so if you see really long wait times beyond that, it might be worth reporting.

Milimetric added a comment.Via WebTue, Jan 19, 4:19 PM

Heh, funny, that's just a copy of my code from: https://github.com/mediawiki-utilities/python-mwviews/blob/master/mwviews/api/pageviews.py

I've seen a couple of better python implementations and there are also clients in R, JS, and more. This thing's heating up :)

Nuria added a subscriber: Nuria.Via WebSun, Jan 31, 4:13 PM

@Nemo_bis: Are there any actionables for analytics here? seems that we can close this ticket right?

Nemo_bis added a comment.Via WebSun, Jan 31, 5:45 PM

@Nemo_bis: Are there any actionables for analytics here? seems that we can close this ticket right?

This was already closed over 2 months ago.

Nemo_bis edited the task description. (Show Details)Via WebTue, Feb 2, 7:43 AM
Multichill removed a subscriber: Multichill.Via WebTue, Feb 2, 11:36 AM

Add Comment