Page MenuHomePhabricator

configure RESTBase pageview proxy to Analytics' cluster {slug} [34 pts]
Closed, ResolvedPublic

Description

The main RESTBase cluster needs to be configured to proxy requests to the Analytics Pageview service.

The discussion from T103811#1639220 onwards surfaced two main options:

Global vs. per-project stats

Some of the exposed page view metadata will be global (across projects), while other data will be per project or page. There are two main options for exposing this:

1) Per-project stats in each project, global stats at wikimedia.org

This would use a per-project layout like https://en.wikipedia.org/api/rest_v1/data/stats/views/enwiki/.... or https://en.wikipedia.org/api/rest_v1/stats/views/enwiki/.... , plus a similar global hierarchy at www.wikimedia.org.

2) Everything at wikimedia.org

The alternative is to expose everything in a single hierarchy. Per-project data would form a sub-hierarchy as in https://wikimedia.org/api/rest_v1/stats/{project}/views/...., in many ways mirroring the per-project hierarchy.

Discussion

  1. has a few advantages from the services perspective:
  • Security*: Restricted access to per-project stats can be handled fairly easily in RESTBase, using the regular hierarchical ACL infrastructure.
  • Discoverability: Page stats show up alongside other project API documentation, which makes it easier to discover.
  • Performance: While accesses to page view data should be low volume initially, it's conceivable that gadgets or event production features would request & use this information as part of regular page views eventually. In this case, using the main project domain will lower access latency significantly by reusing an existing connection.

Advantages of the global option are:

  • Single entry point for all stats: Users can be pointed to a single repository of all stats, irrespective of the project they are interested in.

See also

Details

Related Gerrit Patches:
operations/puppet : productionRESTBase: Set up the AQS public API
operations/puppet : productionAdd a public endpoint for AQS

Event Timeline

Eevans created this task.Oct 6 2015, 10:54 PM
Eevans raised the priority of this task from to Normal.
Eevans updated the task description. (Show Details)
Eevans added a project: RESTBase.
Eevans added subscribers: Eevans, Milimetric, GWicke, mobrovac.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 6 2015, 10:54 PM
Eevans updated the task description. (Show Details)Oct 6 2015, 10:57 PM
Eevans added a project: Analytics.
Eevans set Security to None.

These endpoints are the ones exposed by the AQS RESTBase instance. We yet have to bike-shed on the exact public layout. Yay bike-shedding :)

GWicke updated the task description. (Show Details)Oct 7 2015, 12:07 AM
GWicke updated the task description. (Show Details)

I went ahead and hijacked the task description with a summary of my own. Please edit to reflect the discussion more accurately!

GWicke updated the task description. (Show Details)Oct 7 2015, 12:09 AM
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.
Milimetric added a comment.EditedOct 7 2015, 1:54 AM

Here's what I understood from our last discussion, I think it's fairly close to the task summary, but it's a bit simpler in my mind. We will have two endpoints that call our backend AQS RESTBase cluster:

  1. wikimedia.org/<<pass these parameters exactly how they are to the backend, the documentation in analytics/v1/pageviews.yaml can be used verbatim>>
  1. [language].[wiki project name].org/<<pass these parameters almost exactly to the backend, except don't expose the {project} parameter, just use the {domain} in place of it, documentation needs to be adjusted>>

Let me know if that makes sense. If not, let's set up a meeting soon to talk, I'll follow up tomorrow morning. If it makes sense, how can I help make it happen?

GWicke added a comment.EditedOct 7 2015, 3:55 PM

@Milimetric, that's basically option 1). Sounds good to me.

Regarding the root, shall we use /api/rest_v1/stats/, /api/rest_v1/metrics/, /api/rest_v1/data/stats/?

My personal vote would be for /api/rest_v1/stats/ or /api/rest_v1/metrics/, reserving /api/rest_v1/data/ for actual content data, rather than metrics / stats about its usage. It's a blurry line though. Also, where should the metrics for the metrics usage go? ;P

Since it's proxying to the analytics query service, we chatted a bit and figured /api/rest_v1/analytics/ would make the most sense. I agree /data is something else.

If that's settled then, who's writing the patch? If I understand correctly it would be a change to the config.yaml.erb template in puppet, right?

GWicke added a comment.EditedOct 7 2015, 4:11 PM

To me, 'metrics' seems to be a slightly more accurate description of what is available in that hierarchy. 'Analytics' has a higher-level ring to me. The wiki says "Analytics is the discovery and communication of meaningful patterns in data.", while metrics and stats are more closely associated with the raw statistical data, which is what we are exposing here.

My main concern is about making sure that we model this API so that it makes sense from a consumer perspective. Consumers don't know (and probably care) that this is backed by the analytics service or team; all they care about is finding the right content / data.

I suggested analytics not because it is connected to the 'Analytics team' or the 'Analytics cluster', but that we have named this endpoint the Analytics Query Service.

I like analytics a little better than metrics, but I am not strongly opinionated about it, and metrics is fine.

I agree with the wiki on what "analytics" means, and I think this particular part of the path isn't too related to the analytics query service. We may hit the analytics query service in other ways that don't go through this endpoint.

So let's settle on "metrics" since there are no strong objections.

Milimetric renamed this task from configure RESTBase pageview proxy to Analytics' cluster to configure RESTBase pageview proxy to Analytics' cluster {slug} [3 pts].Oct 7 2015, 5:56 PM
Milimetric claimed this task.
GWicke added a comment.EditedOct 7 2015, 6:24 PM

@Milimetric, the new x-request-handler syntax should make this relatively straightforward to accomplish. You could use a simple proxy hierarchy with a /{+path} suffix to get something out quickly. ({+path} matches one or more optional path segments; also useful: {/optional} to match *one* optional path segment, both following http://tools.ietf.org/html/rfc6570.) The downside of this is relatively poor documentation. You could however start out with a simple prefix, and then expand with more docs later.

@GWicke, it seems like the right way would be to factor out the docs from specs/analytics/v1/pageviews.yaml and use them in both the front-end and back-end somehow, is that possible?

In any case, I'd rather do it right than rush it. If I understand correctly the {+path} method would just allow me to very quickly rewrite basically anything the user passes in and pass it to the back-end, but the user would have no documentation. That doesn't seem great.

The other thing I don't get at all is where the configs for the wikimedia.org "global" domain would go.

@Milimetric: yes, that ought to be possible. Only complication is that we'll have to template the URL of the actual backend service, and sub-specs don't currently support parametrization.

There are three main solutions to this:

  1. Template the sub-spec using puppet as well. This means code duplication and the need to keep the puppet & code versions in sync.
  2. Set up another indirection in /sys/, in config.yaml. The url templating / config then happens here, and the public spec just points to sys.
  3. Wrap the spec into a module, and parametrize the service url (based on the config vars passed to the module) in a small function that manipulates the parsed yaml before returning it as the spec object. This should already work, but isn't something we have done much yet.

I think 3) is the most useful variant longer term, as it should be the solution that makes it easiest to keep the analytics and rest api use cases nicely packaged.

I'm all for doing 3. if it's the best way, but I'm not sure what you mean. Any more details are welcome, or I'll just ask tomorrow.

mobrovac moved this task from Backlog to Under discussion on the RESTBase board.Oct 8 2015, 8:00 AM

Hm hm, we are starting to create a mess already :) The API needs to be structured. Really. As discussed in T103811, we ought to have at most 3 or 4 top-level hierarchies, and we seem to be adding a new one every time something comes around. </rant>

I suggest the following translation:

Public API endpointAQS endpoint
//{domain}/api/rest_v1/page/stats/{title}/{access}/{agent}/{granularity}/{start}/{end}aqs://analytics.wm.org/v1/pageviews/per-article/{domain}/{access}/{agent}/{article}/{granularity}/{start}/{end}
//{domain}/api/rest_v1/data/stats/range/{access}/{agent}/{granularity}/{start}/{end}aqs://analytics.wm.org/v1/pageviews/per-project/{domain}/{access}/{agent}/{granularity}/{start}/{end}
//{domain}/api/rest_v1/data/stats/top/{access}/{year}/{month}/{day}aqs://analytics.wm.org/v1/pageviews/top/{domain}/{access}/{year}/{month}/{day}
//wm.org/api/rest_v1/data/stats/{project}/range/{access}/{agent}/{granularity}/{start}/{end}aqs://analytics.wm.org/v1/pageviews/per-project/{project}/{access}/{agent}/{granularity}/{start}/{end}
//wm.org/api/rest_v1/data/stats/{project}/top/{access}/{year}/{month}/{day}aqs://analytics.wm.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day}

Why don't you guys like the /data/ hierarchy for this? IMO, it fits perfectly there.

May the bike-shedding force be with you :)

GWicke added a comment.EditedOct 8 2015, 5:11 PM

@mobrovac, basically all entry points in the API return some kind of data. I think it makes sense to see the top-level hierarchy as "kinds" of data, rather than data vs. non-data. As such, "metrics" seems to be more descriptive and helpful to a user looking for the right API entry points.

We could still add a "data" or "miscellaneous" catch-all category if we encounter a long tail of random kinds, but we should be mindful that this won't be as helpful to a user than more specific kinds.

@mobrovac, basically all entry points in the API return some kind of data. I think it makes sense to see the top-level hierarchy as "kinds" of data, rather than data vs. non-data. As such, "metrics" seems to be more descriptive and helpful to a user looking for the right API entry points.

Makes sense. metrics works for me.

We could still add a "data" or "miscellaneous" catch-all category if we encounter a long tail of random kinds, but we should be mindful that this won't be as helpful to a user than more specific kinds.

Keeping data around might be useful (wikidata, citoid, etc).

Still, please take a look at the table in my previous comment. I really think it's worth separating per-article, per-project and top routes.

Change 245887 had a related patch set uploaded (by Milimetric):
[WIP] Add a public endpoint for AQS

https://gerrit.wikimedia.org/r/245887

Milimetric added a comment.EditedOct 13 2015, 9:23 PM

I hesitate to disagree here, because it will ultimately cost me time. And a few people are blocked on this service, so I'm costing them time. Hopefully whatever decision we make we can make fairly quickly.

So it seems we agree with /metrics as a root to all analytics related data. That's a good start. There are three other points made that I'll respond to here:

  1. The AQS per-article endpoint should be available publicly at /page/metrics/pageviews
    • This makes sense because /page is a more natural place for per-page statistics. But then we (the analytics team) have to communicate to people that they have to go to /page to find this kind of data and /metrics for this other kind of data, and potentially to a couple of other places for other kinds of data. That's confusing for the consumer of analytics data. So I agree it's intuitive for the general API consumer who's browsing the structure. But I don't think that person's important to the analytics team just yet. If that person becomes real and starts complaining about /page not having analytics data, we can add it there. But until then, I'd like to cater to the analytics consumer, which I think would want everything in one place.
    • This also makes sense because we could bring problems like redirect , special characters in titles, renames, etc. all in one place. And if there's a good solution across the board, we could solve all those problems once instead of N times, for each endpoint that deals with article title as a parameter. But in reality our pipeline for getting data into Cassandra is very different and won't be easily merged with the other pipelines. So we can't solve problems in one place anyway. This is something I'd like to keep in mind for the future of the overall data pipeline at WMF.
  1. /per-project should be renamed /range. I think range is too broad, so I disagree. I don't love /per-project but it's descriptive.
  2. The {project} parameter should come before /top and /per-project. I disagree with this for the same reason as #1 above. All metrics should follow the same pattern so they're easily discovered and easily talked about in announcements / upgrade notices / fixes / etc.
  1. The AQS per-article endpoint should be available publicly at /page/metrics/pageviews
    • This makes sense because /page is a more natural place for per-page statistics. But then we (the analytics team) have to communicate to people that they have to go to /page to find this kind of data and /metrics for this other kind of data, and potentially to a couple of other places for other kinds of data. That's confusing for the consumer of analytics data. So I agree it's intuitive for the general API consumer who's browsing the structure. But I don't think that person's important to the analytics team just yet. If that person becomes real and starts complaining about /page not having analytics data, we can add it there. But until then, I'd like to cater to the analytics consumer, which I think would want everything in one place.

You care only about your stuff, and that's comprehensible. But the point of this ticket is to find a suitable way of integrating your needs with everybody else's in a coherent way and not solely to get your stuff out (even though we'd all like it to see the light of day ASAP) ;) What I'd like to point out here is that once a public API layout (for metrics here, e.g.) is out in the wild, it's rather hard to change it. I don't like bike-shedding any more than the next person, but we are making potentially far-reaching decisions here.

To get back to the discussion at hand, I think we agree that, conceptually, article-related data and information should go under /page/. So, that's settled. Now, I'm not opposed to have your whole API structure exposed under the global domain as well (see below), which I think takes care of your point about confusing metrics users as to which route to use.

  • This also makes sense because we could bring problems like redirect , special characters in titles, renames, etc. all in one place. And if there's a good solution across the board, we could solve all those problems once instead of N times, for each endpoint that deals with article title as a parameter. But in reality our pipeline for getting data into Cassandra is very different and won't be easily merged with the other pipelines. So we can't solve problems in one place anyway. This is something I'd like to keep in mind for the future of the overall data pipeline at WMF.

Yup, good point re: renames and redirects. But for the title names themselves, we're constrained by the URI-composition laws.

  1. /per-project should be renamed /range. I think range is too broad, so I disagree. I don't love /per-project but it's descriptive.

Heh, true. The reason I don't like per-project is that when I look at the hierarchy - first I put the domain, then say I want the metrics, and then say I want it per-project - that seems really out of context. How about aggregate ?

  1. The {project} parameter should come before /top and /per-project. I disagree with this for the same reason as #1 above. All metrics should follow the same pattern so they're easily discovered and easily talked about in announcements / upgrade notices / fixes / etc.

I'd be also fine with having the following structure for the global domain:

https://wm.org/api/rest_v1/metrics/{project}
  -- per-project  # or aggregate
  -- per-article
  -- top

That is, mirror all of the single-domain endpoints as well in the global domain. That way, metrics-centric users can use only that, while others are still able to discover domain- or article-related metrics. That would entail having multiple routes pointing to the same data which hurts caching. But we can't have it both ways, I guess.

Also, if we do end up with exposing everything under the global domain, but also have per-project and per-article routes under each domain, we could separate that in two steps:

  1. expose the public API under the global domain; and
  2. flesh out and expose domain-specific and title-specific API endpoints

Note though, that in either case we need to wait for Ops (so next week is the fastest we can move).

https://wm.org/api/rest_v1/metrics/{project}

  • per-project # or aggregate
  • per-article
  • top

It seems to me that the global domain is much easier to get consensus on for now. So let's just configure that to start, and then continue the discussion about URI structure for the domain-specific endpoints without the time pressure.

So then for only the global domain, we have two issues left.

  1. per-project vs. aggregate. I see your point, if we use {project} before the {per-project|per-article|top} choices. But, again, I don't understand how that impacts documentation / discoverability. Do we have to re-write our pageviews spec? Are we re-using that or making a new one? Some hints about how this is done would be useful. I agree in principle, and I think "aggregate" is a good name, especially since some of the possible values for {project} might be things like "all-en" or "all-wikipedia".
  1. {project} being in front of the rest of the URI. We talked about this a few times, I thought we settled on {project} being later in the URI structure. The reason is that it's not always a project, it could be "all-en" in the future, and we are not going to list all the possible projects for people, we're just going to give them the template for a project name and all list the special values.

It seems to me that the global domain is much easier to get consensus on for now. So let's just configure that to start, and then continue the discussion about URI structure for the domain-specific endpoints without the time pressure.

Yay for consensus :)

So then for only the global domain, we have two issues left.

  1. per-project vs. aggregate. I see your point, if we use {project} before the {per-project|per-article|top} choices. But, again, I don't understand how that impacts documentation / discoverability. Do we have to re-write our pageviews spec? Are we re-using that or making a new one? Some hints about how this is done would be useful. I agree in principle, and I think "aggregate" is a good name, especially since some of the possible values for {project} might be things like "all-en" or "all-wikipedia".

Cool, let's go with aggregate then in the global domain as well. As for the spec, frankly, the most easiest thing to do right now is to simply copy the existing analytics.yaml spec and make the desired changes there directly. A more long-term solution (which we can also defer for later, IMHO) would be to create a module and feed it the analytics.yaml spec and let it dynamically modify the path segments.

  1. {project} being in front of the rest of the URI. We talked about this a few times, I thought we settled on {project} being later in the URI structure. The reason is that it's not always a project, it could be "all-en" in the future, and we are not going to list all the possible projects for people, we're just going to give them the template for a project name and all list the special values.

In your earlier comment you said:

  1. The {project} parameter should come before /top and /per-project.

So I'm confused now. Honestly, for the global domain, I'm fine with either, even though I'm leaning towards /metrics/{project}/top/... because of the URI layout logic - I'm asking from the global domain metric data for a specific project and top articles. But, as I said, /metrics/top/{project}/... works for me as well.

  1. The {project} parameter should come before /top and /per-project.

So I'm confused now. Honestly, for the global domain, I'm fine with either, even though I'm leaning towards /metrics/{project}/top/... because of the URI layout logic - I'm asking from the global domain metric data for a specific project and top articles. But, as I said, /metrics/top/{project}/... works for me as well.

Sorry, my bad. I was saying that was one of the points you made, which I wanted to discuss, I'd prefer {project} to go after the /top part.

So, to wrap this up. The only thing needed is renaming "per-project" with "aggregate". I'd appreciate some help with the spec itself, as I'm pretty sure I'm forever confused about that stuff. I'll ask in IRC if that's ok, so as to not clutter this discussion.

Change 247935 had a related patch set uploaded (by Mobrovac):
RESTBase: Set up MobileApps storage and AQS public API

https://gerrit.wikimedia.org/r/247935

Change 245887 abandoned by Milimetric:
Add a public endpoint for AQS

Reason:
abandoned in favor of I17ae36660ebb374e7062cd1e4ad4634ffddf66a7

https://gerrit.wikimedia.org/r/245887

Change 247935 merged by Alexandros Kosiaris:
RESTBase: Set up the AQS public API

https://gerrit.wikimedia.org/r/247935

mobrovac removed a subscriber: gerritbot.
Milimetric renamed this task from configure RESTBase pageview proxy to Analytics' cluster {slug} [3 pts] to configure RESTBase pageview proxy to Analytics' cluster {slug} [34 pts].Oct 23 2015, 4:53 PM
Milimetric moved this task from In Progress to In Code Review on the Analytics-Kanban board.
Milimetric moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Milimetric moved this task from Ready to Deploy to Done on the Analytics-Kanban board.
Nuria closed this task as Resolved.Nov 16 2015, 7:44 PM
mobrovac reopened this task as Open.Nov 16 2015, 7:46 PM

This hasn't been resolved yet as we haven't made any progress on domain-specific URI paths. Currently only the global domain exposes the AQS public endpoints.

I think we should open up a new task for that, in the sake of marking that some progress was made. But I'll leave that up to you. I'm ok leaving this open too because we did commit to do that work eventually, and it's good to get it done.

kevinator closed this task as Resolved.Nov 19 2015, 5:02 PM
kevinator added a subscriber: kevinator.