Page MenuHomePhabricator

Add monthly request stats per article title to pageview api
Closed, ResolvedPublic5 Story Points

Description

UPDATE: tagged as easy because the new cluster should be able to do the monthly aggregation in AQS (people are already requesting this data and aggregating it themselves so it doesn't increase the load on Cassandra, just the CPU usage on AQS, for which there is plenty of overhead).
Pageview API: https://wikitech.wikimedia.org/wiki/Analytics/PageviewAPI

Over many years [1] quite a few people asked for monthly page view stats that can be queried via an API.

Since medium 2015 we have the awesome api which serves highly granular pageview data. But for trend analysis or merely monthly reports a higher aggregation level would be very useful. And in particular monthly totals.

Daily and monthly aggregates are already available, as huge downloads from https://dumps.wikimedia.org/other/pagecounts-ez/merged/

[1] http://toolserver-l.wikimedia.narkive.com/1c9rHFQP/monthly-pageviews

Event Timeline

ezachte created this task.Jul 11 2016, 1:45 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 11 2016, 1:45 PM

An alternative might be a sqlite3 database, per project. Since those are "serverless", all that would be required is code/CPU for generating the files, and disk space. Vanilla sqlite3 is not compressed, so it would eat quite some space, but not require any support beyond nfs access.

While the solutions proposed here are simple enough, the API's scaling problems were due mostly to it running on HDDs instead of SSDs. The transition to SSDs is almost finished (see the latest updates) so I'm fairly confident that it will happen before we can get any other solution in place. By the way though, the python client can get you monthly views for lists of articles [1], just not huge lists. And we're looking at adding pre-defined lists like WikiProjects [2].

[1] https://github.com/mediawiki-utilities/python-mwviews/blob/master/mwviews/api/pageviews.py#L73
[2] https://phabricator.wikimedia.org/T139324

Akeron added a subscriber: Akeron.Jul 11 2016, 11:17 PM
Milimetric triaged this task as Normal priority.Aug 1 2016, 4:52 PM
Milimetric added a project: good first bug.
Milimetric updated the task description. (Show Details)
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptAug 1 2016, 4:52 PM
Milimetric moved this task from Incoming to Backlog (Later) on the Analytics board.Aug 1 2016, 4:52 PM
Nuria added a comment.Aug 1 2016, 4:53 PM

Careful, the unique devices endpoint should not be affected by these changes.

That's a great idea !
Thanks @Milimetric !

Milimetric added a subscriber: Nuria.
Nuria updated the task description. (Show Details)Dec 8 2016, 9:59 PM

Just reporting back about the current way this is being implemented (mini request for comment):

The per-article API is currently in this shape [2]:

/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}

And currently, {granularity} can only be "daily". This change, being worked on in GCI, would allow "monthly" to be passed as well. It seems to me it would be ugly to require different {start} and {end} formats based on the {granularity} parameter, so that means we'll get a range in the YYYYMMDD format, meaning they specify day. So two questions:

  1. If someone passes in dates with partial months, should we strip the partial months and return data for only the complete months? or should we just fetch the daily data from Cassandra and aggregate it monthly and return anything that comes back? The latter is easiest to implement and understand for consumers, I think. I think if we want to do anything else, we should make a different endpoint.
  2. Currently we return data with "timestamp" in each item we return. The format for timestamp is YYYYMMDD00. For monthly data, should we return the timestamp as YYYYMM0100 or YYYYMM? I think that YYYYMM0100 makes the most sense because it would be compatible with current clients. That's going with the general approach to this API which is to be more machine-friendly than human-friendly.

https://github.com/wikimedia/analytics-aqs/blob/master/sys/pageviews.yaml#L2

Nuria added a comment.Dec 12 2016, 6:03 PM

On 1) I think aggregating partial months might be confusing, it is hard to relate to data that you do not see often and when you see monthly data you expect it to be an aggregation over the whole month.
On 2) I concur with data format 'YYYYMMDD00'

Change 326545 had a related patch set uploaded (by Phantom42):
Monthly request stats per article title

https://gerrit.wikimedia.org/r/326545

Change 326545 merged by Milimetric:
Monthly request stats per article title

https://gerrit.wikimedia.org/r/326545

Milimetric edited projects, added Analytics-Kanban; removed Pageviews-API, Analytics.
Milimetric moved this task from Next Up to Done on the Analytics-Kanban board.
Milimetric moved this task from Done to Ready to Deploy on the Analytics-Kanban board.

The change got merged some time ago, so I think we can mark this as resolved, right?

Nuria added a comment.Jan 3 2017, 9:00 PM

No, we do not mark tickets as resolved until code is deployed.

@Phantom42, we've been on break and I wasn't able to merge the other supporting changes to get this deployed. We have a week-long staff meeting now (end/beginning of year is busy for us). I'll try to deploy this as soon as possible though, and when it's live we'll mark it resolved.

Nuria added a comment.Jan 10 2017, 4:02 PM

We need to load test this code to make sure results are being deliver within SLAs before making data available for public consumption.

Nuria set the point value for this task to 3.Jan 19 2017, 4:51 PM
Nuria changed the point value for this task from 3 to 5.
Nuria claimed this task.Jan 20 2017, 4:04 PM

@Phantom42: thanks to your work, this is now live: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/de.wikipedia/all-access/all-agents/Barack_Obama/monthly/2016010100/2016013100

For future reference, there was another pull request needed to the restbase code: https://github.com/wikimedia/restbase/pull/746, which Petr was nice enough to help us with. Thanks to everyone!

Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Jan 24 2017, 3:57 PM
Nuria closed this task as Resolved.Jan 26 2017, 3:54 PM