Page MenuHomePhabricator

Expose pageview data in each project's REST API
Closed, DeclinedPublic

Description

This is a continuation of T114830. We decided in that task that we would configure en.wikipedia.org/api/rest_v1/ to serve the pageviews endpoints the same way that wikimedia.org/api/rest_v1/ currently does. See the discussion on that task for background and where we are with bike-shedding on the URL structure.

(Note on priority from Analytics point of view: this becomes useful when someone wants to deploy an extension or gadget that uses the pageview API)

Event Timeline

Milimetric raised the priority of this task from to Needs Triage.
Milimetric updated the task description. (Show Details)
Milimetric added a subscriber: Milimetric.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 19 2015, 5:21 PM
Milimetric triaged this task as Normal priority.Nov 19 2015, 5:59 PM
Milimetric updated the task description. (Show Details)
Milimetric set Security to None.
Milimetric moved this task from Incoming to Backlog on the Analytics-Backlog board.
mobrovac raised the priority of this task from Normal to High.Apr 11 2016, 1:47 PM
mobrovac added a subscriber: mobrovac.

Raised the priority to High as we should settle on this ASAP. It's been around for a while now without any real action.

I think there are no real blockers to doing this other than the most likely painful bikeshed. So we can start bikeshedding about the route structure. Once that's done the actual work should be really fast especially after all the improvements in config that yall have made.

In order to put a strawman forward and progress the discussion, I'm reposting my proposal from T114830:

Public API endpointAQS endpoint
//{domain}/api/rest_v1/page/stats/{title}/{access}/{agent}/{granularity}/{start}/{end}aqs://analytics.wm.org/v1/pageviews/per-article/{domain}/{access}/{agent}/{article}/{granularity}/{start}/{end}
//{domain}/api/rest_v1/data/stats/range/{access}/{agent}/{granularity}/{start}/{end}aqs://analytics.wm.org/v1/pageviews/per-project/{domain}/{access}/{agent}/{granularity}/{start}/{end}
//{domain}/api/rest_v1/data/stats/top/{access}/{year}/{month}/{day}aqs://analytics.wm.org/v1/pageviews/top/{domain}/{access}/{year}/{month}/{day}

Let the bike-shedding begin!

JAllemandou added a subscriber: JAllemandou.EditedApr 20 2016, 9:13 AM

Bike-shediiiiiiiing !
To mirror a bit more the current format, I'd suggest:

Public API endpointAQS endpoint
//{domain}/api/rest_v1/metrics/pageviews/per-article/{title}/{access}/{agent}/{granularity}/{start}/{end}aqs://analytics.wm.org/v1/pageviews/per-article/{domain}/{access}/{agent}/{article}/{granularity}/{start}/{end}
//{domain}/api/rest_v1/metrics/pageviews/total/{access}/{agent}/{granularity}/{start}/{end}aqs://analytics.wm.org/v1/pageviews/per-project/{domain}/{access}/{agent}/{granularity}/{start}/{end}
//{domain}/api/rest_v1/metrics/pageviews/top/{access}/{year}/{month}/{day}aqs://analytics.wm.org/v1/pageviews/top/{domain}/{access}/{year}/{month}/{day}
//{domain}/api/rest_v1/metrics/unique-devices/{access-site}/{granularity}/{start}/{end}aqs://analytics.wm.org/v1/unique-devices/{domain}/{access-site}/{granularity}/{start}/{end}

I like the general direction, @JAllemandou . But the first one should really go under /{domain}/api/rest_v1/page/as that is where most of information pertaining to titles go. I realise that, in some way, that fragments the pageview API from a conceptual point, but I think there's more value in properly integrating it in the current public outline than in trying to keep it together.

@mobrovac : Interesting. In my opinion the main concern of endpoint conceptual coherency applies here to both sides: /{domain}/api/rest_v1/page/ and/or /{domain}/api/rest_v1/metrics/.
The decision of keeping one or the other should really be a matter of usage rather than developer view, but it'll be difficult (not to say impossible) to get.
My view is opposite to yours in which endpoint to keep coherent, with the argument of having analytics interested cllients already used to the format.
But as said before, it'll be very difficult to get data documenting my view or yours.

I think the misunderstanding here comes form the POV: you look it from the I'm a pageview API user and I try to look at it from the I'm a REST API user. Given that here we are trying to integrate the pageview API into the broader REST API, it makes to view it from the latter perspective. Taking that into account, I think answering what can I get for a page outweighs what can I get from the pageview API, since the focus of the REST API is on content not on its sub-APIs (by stating this I am not implying the pageview API is less worthy or anything like that).

Recall that here we are discussing the API for each domain. If one is really interested exclusively in the pageview API, they can always resort to using https://wikimedia.org/api/rest_v1/ where the pageview API is consolidated.

After weekend thinking: I don't there is any misunderstanding in the POV I describe :)
Whether it is decided to for "I can find metrics here and I can find content there" or "I can find page content and metrics here, and other project metrics (aggregated, uniques devices, top) there" doesn't really matter to me personally.
I just really think it's worth describing the dimensions we are discussing a bit more than "I'm a global REST API". To me the latter encompasses everything and therefore doesn't give any credit to using one dimension priorization or another.
This point made, as I said before, deconstructing metrics into content or not doesn't really matter as long as people can find everything easily.

just assigning this to myself so I can catch up with the bike-shedding and start work at some point hopefully soon.

Milimetric moved this task from Backlog (Later) to Dashiki on the Analytics board.Jun 2 2016, 5:06 PM
Nuria lowered the priority of this task from High to Normal.Jul 4 2016, 4:54 PM
Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.

@GWicke : do we still care about this issue or can this ticket be closed?

@Nuria, this is still something we'd like to do. The difficult part is settling on a layout.

Here is another proposal:

/page/views/{title}/{access}/{agent}/...
/feed/views/total/{access}/{agent}/...
/feed/views/top/{access}/...
/feed/views/unique-devices/{access}/...

This follows @mobrovac's proposal to group per-title statistics under the /page/ hierachy. Instead of metrics, it uses the /feed/ hierarchy for project-global and time-varying data & content. I don't have strong objections to using metrics either, but see the distinction between metrics and feeds to be quite blurry. We chose feed to be quite generic, so that it can accommodate all kinds of time-changing data.

An orthogonal & somewhat more drastic change would be to replace the selection of granularity and time range with a single interval parameter:

/page/views/{title}/{access}/{agent}/{interval}
/feed/views/total/{access}/{agent}/{interval}
/feed/views/top/{access}/{interval}
/feed/views/unique-devices/{access}/{interval}

The interval parameter would combine granularity & range, offering only the combinations that make sense. Example values:

  • 2016: Returns totals for 2016, as well as totals for each month in 2016.
  • 2016-12: Returns totals for December 2016, as well as daily totals for each day of the month.

Later, we could add:

  • 2016-12-22: Returns the daily total for December 22, 2016, as well as hourly totals.

Advantages:

  • Makes caching effective by eliminating fragmentation.
  • Scales well to large time ranges, as each response can be fully pre-computed. No range requests or dynamic aggregations are necessary.

Disadvantages:

  • Fetching random intervals like "last 60 days" requires users to manually make multiple requests for sub-ranges. This can be addressed by providing a client-side library offering an interface for requesting arbitrary time ranges.
Nuria added a comment.Dec 19 2016, 5:24 PM

@GWicke: What is the value proposition of this work? Consistancy? Doesn't seems like it would value from a performance or caching standpoint (maybe I am missing something here)

@GWicke: What is the value proposition of this work? Consistancy? Doesn't seems like it would value from a performance or caching standpoint (maybe I am missing something here)

Could you say more on why you think this would not improve performance & reduce cost?

Krinkle removed a subscriber: Krinkle.Apr 17 2017, 10:14 PM
GWicke renamed this task from configure RESTBase pageview proxy to Analytics' cluster on wiki-specific domains to Expose pageview data in each project's REST API.Jul 12 2017, 7:46 PM
GWicke edited projects, added Services (later); removed Services.
fdans reassigned this task from Milimetric to Nuria.Jul 27 2017, 3:53 PM
fdans edited projects, added Analytics-Kanban; removed Analytics.
mobrovac edited projects, added RESTBase-API; removed RESTBase.Aug 1 2017, 3:25 AM
Nuria closed this task as Declined.Aug 1 2017, 4:25 AM
Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptAug 1 2017, 4:25 AM
Nuria added a comment.Aug 1 2017, 4:46 AM

Declining cause we do not feel this items delivers enough value to change our current url scheme, also semantics proposed breaks with aggregation does it not?

Example:

Get data for all projects aggregated daily
GET http://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/all-access/all-agents/daily/2015100100/2015103100

Better than:
GET http://en.wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/all-access/all-agents/daily/2015100100/2015103100

Which sounds strange as you are requesting to a particular end point and agreggation for all

mobrovac reopened this task as Open.Aug 1 2017, 2:47 PM

Declining cause we do not feel this items delivers enough value to change our current url scheme, also semantics proposed breaks with aggregation does it not?

The idea here is not to change the URI scheme that you (or your tools?) use, but to complete the public API and to make pageview data more discoverable by exposing it on the project level. In other words, when the user navigates to https://en.wikipedia.org/api/rest_v1/ having pageview data appear there (a) completes the set; and (b) informs the user of the availability of the data for the project.

Example:
Get data for all projects aggregated daily
GET http://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/all-access/all-agents/daily/2015100100/2015103100
Better than:
GET http://en.wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/all-projects/all-access/all-agents/daily/2015100100/2015103100
Which sounds strange as you are requesting to a particular end point and agreggation for all

The idea is to expose only project-related data there. The aggregation would still need to be going through the global wm.org domain.

Nuria added a comment.EditedAug 1 2017, 3:00 PM

having pageview data appear there (a) completes the set; and (b) informs the user of the availability of the data for the project.

Sorry, we disagree, where does the unique devices data go with that url scheme? Again, seems like there is no place on the semantics for it. This ticket was open when we had 1 api that was pageview centric, we have now two pageview based apis (but with different breakdowns, pagecounts has old data) plus unique devices data per domain (not project) and we are going to have more api endpoints end of this quarter or next for which the project-based url scheme does not seem so fitting. For example, project-family uniques, that is unique devices in all *.wikipedia.org domains deduplicated.

Nuria closed this task as Resolved.Aug 9 2017, 12:27 AM
Nuria changed the task status from Resolved to Declined.
mobrovac reopened this task as Open.Aug 9 2017, 3:26 AM

Sorry, we disagree, where does the unique devices data go with that url scheme? Again, seems like there is no place on the semantics for it. This ticket was open when we had 1 api that was pageview centric,

This ticket is about the pagewview data, so the fact of having one more API now does not invalidate it.

we have now two pageview based apis (but with different breakdowns, pagecounts has old data) plus unique devices data per domain (not project)

Yes! Per domain which is what this ticket is about.

and we are going to have more api endpoints end of this quarter or next for which the project-based url scheme does not seem so fitting. For example, project-family uniques, that is unique devices in all *.wikipedia.org domains deduplicated.

Again, this does not (necessarily) fall within the scope of this particular ticket. For other APIs we can open new tickets and discuss there.

Nuria added a comment.Aug 9 2017, 4:09 PM

This ticket is about the pagewview data, so the fact of having one more API now does not invalidate it.

Again, I disagree, i think is confusing that api semantics for what we call AQS (analytics query service) varies per data type requested. I really see little value of doing these changes for the user and less so when the url scheme will be different per different APIs.

Nuria closed this task as Declined.Aug 16 2017, 5:48 PM
mobrovac reopened this task as Open.Sep 5 2017, 10:13 AM
mobrovac lowered the priority of this task from Normal to Low.

@Nuria there seems to some confusion in this conversation. I am not proposing to migrate the existing end-points to project-specific domains, I am proposing to add these there so that the API is more easily discoverable.

Nuria added a comment.EditedSep 6 2017, 6:22 PM

Again, we do not feel it is worth to do these change, to be honest the benefit is small for the amount of work it will require.

Nuria closed this task as Declined.Oct 26 2017, 6:10 PM