Page MenuHomePhabricator

Public API endpoints for new services
Closed, ResolvedPublic

Description

We need a plan regarding the outlook and extension of our public API hierarchy.

Current Status

Thus far, RESTBase has been exposing only the API pertaining to pages retrieved from Parsoid, their transforms and information about revisions fetched from the MW API. With the latest addition of end-points proxying requests to Graphoid, a domain's v1 root tree looks like this:

/{domain}/v1
-- /page
   -- /revision
   -- /title
   -- /namespace
   -- /html
   -- /data-parsoid
   -- /graph
-- /transform

Most of the endpoints are grouped under /page because, in one way or another, they relate to a page or an aspect of it; all but /revision require the client to supply the page title. Another aspect to note is that currently page formats (/html) are mixed with types (such as /graph).

New Endpoints

The API needs to evolve because we want/need to:

  • expose more endpoints mapping to MW API calls to increase the comprehensiveness of the provided endpoints
  • proxy new services and ultimately remove the Parsoid caches

The next services we want to proxy include Mathoid, Citoid, the MobileApps service and Revscoring.

MobileApps

The aim of the MobileApps service is to provide page content massaged for the needs of (native) mobile readers. It it able to generate two types: the full-blown version targeted at newer, more powerful mobile devices, and a lite version targeting older and slower devices. Since it deals directly with page content, it seems logical to put its endpoints under /page:

/{domain}/v1
-- /page
   -- /mobile-text/{title}
   -- /mobile-html/{title}
   -- /mobile-html-sections/{title}
   -- /mobile-html-sections-lead/{title}
   -- /mobile-html-sections-remaining/{title}

Revscoring

Likewise, revscoring is about classifying revisions, and is, thus, tied to a particular revision. Hence, it is natural to place it under /page/revision:

/{domain}/v1
-- /page/revision/{revision}
   -- /score/
   -- /score/wp10
   -- /score/reverted

Notes:

  • /score/ would give a listing of available scoring methods, while the other would yield their respective results for a given revision.
  • The service itself supports revision batching - supplying multiple revisions for which to get the score. This is currently not covered by these API endpoints.

Mathoid

Math endpoints have no particular relation to the page where mathematical symbols and elements are included, so it can be moved into a separate sub-tree:

/{domain}/v1
-- /media
   -- /math
      -- /{format}/{query}

The rationale for media here is that Mathoid emits media representations (images) of mathematical formulae. query contains the textual representation of the formula to render. While it can be comprehensive, thus making the URL long, in practice browsers and servers support at least 2K-long URLs, which is sufficient in this case.

Citoid

In the same vein as Mathoid, Citoid's output does not depend on the page on which the citation is displayed:

/{domain}/v1
-- /data
   -- /citation
      -- /{format}/{query}

Other Services

These examples establish a satisfactory root hierarchy:

/{domain}/v1
-- /page
-- /transform
-- /media
-- /data

Using such an outline, it is rather easy to place new services, such as Hierator or Wikidata:

/{domain}/v1
-- /media
   -- /hieroglyph
   -- /thumb
-- /data
   -- /wikidata

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mobrovac raised the priority of this task from to High.
mobrovac added a project: RESTBase-API.
mobrovac added subscribers: mobrovac, GWicke.

Some ideas from IRC:

/{domain}/v1
  /page
    ...
  /media
    /math
    /thumb
    /graph
  /data
    /citation

So far all entry points under /page/ follow the pattern of /page/{type}/. The proposal for the mobile entry points instead introduces a sub-hierarchy of /page/mobile/{type}/, which emphasizes a distinction between mobile and everything else.

I think we should encourage the development of generally useful entry points wherever possible. A way to do this is to follow the same /page/{type}/ pattern as with other entry points. For content types that are truly mobile-specific, we could perhaps include 'mobile' in the type.

Strawman:

/{domain}/v1
-- /page
   -- /mobile-html-lead/{title}{/revision}
   -- /mobile-html-full/{title}{/revision}  # or /standard/{title}
   -- /mobile-html-remainder/{title}{/revision}
   // Could be mobile- prefixed too, but maybe generally useful?
   -- /text-lead/{title}{/revision}
   -- /remainder-text/{title}{/revision}

In any case, we should mark these entry points as experimental until things settle down. Generally useful entry points could then graduate via unstable to stable later on.

@bearND, @Dbrant, what do you think?

How about the following?

/{domain}/v1
-- /page
   -- /mobile-html/{title}{/revision} # or /mobile-html-full/{title}{/revision} # but there is currently only one planned
   -- /mobile-json-lite/{title}{/revision}
   -- /mobile-json-full/{title}{/revision} # the following two (*-lead and *-remainder) are subsets of this output
   -- /mobile-json-lead/{title}{/revision}
   -- /mobile-json-remainder/{title}{/revision}

The {revision} option would come later once we either get rid of the action=mobileview dependency or add the ability to specify a revision in action=mobileview. See also T106143.

@bearND: That looks reasonable to me as well.

Do you think any of those end points could be useful outside of the mobile context? For example, the lead section plus some metadata sounds like it might potentially be useful for mobile web and hovercards-like functionality as well. The downside of making it more general purpose & having other users would be a need for more stability; you couldn't treat it as a private end point.

If some of these end points are more general-purpose, then it might make sense to reflect that in the naming, so that it's clear to consumers.

@GWicke I'd prefer to keep the mobile- prefix for now.

  1. The html route would conflict with Parsoid. Even if we renamed it slightly it would probably be confusing to RESTBase users.
  2. Having one of our routes with mobile- prefix and another without seems confusing.
  3. I think we could move from a specific to a more general URI later (through redirections/rewrites) but not the other way if someone else wants to claim it. -> Preserves flexibility.
  4. It makes the intent of the service to be a custom aggregation API for mobile (apps, maybe web later) clearer.

@bearND: Good points. Lets keep the prefix.

For the lite end point, do you possibly expect a lead / remainder split later?

@GWicke. Good point. I think a lead / remainder split could be a future option. @Dbrant, what do you think?

@bearND, @Dbrant: mobile-json-lite-{full,lead,remainder} ?

How about mobile-light as the base name (without json), and then mobile-light-lead? That could perhaps go well with mobile-html-lead and mobile-html-remainder.

Either one sounds fine. Would we also change mobile-json-full to just mobile-json?

(I was originally planning to have mobile/json and the sub-routes mobile/json/lead + mobile/json/remainder.)

mobile/json/... does sound more REST-y, but, on the other hand, breaks the RESTBase hierarchy. As for the lead section (and prefix), I agree we should keep the prefix, mostly because, while it delivers only the lead section, it is still optimised for mobile views.

@GWicke Sorry, I misread your earlier comment. I don't think we'd want to use light or lite as a base name.

The lite route is different from the json route. The lite route is for a pure native presentation on the apps. The other json routes will have some HTML included, at least initially, so that the full page can still be displayed in a WebView. You could see it as a hybrid, since the enhanced link preview would be native but once the user goes to the full page it would be presented inside a WebView.

I would also like to express that the lead and remainder routes are sub-resources of mobile-json-full. So, I could also see the lead and remainder routes as sub-routes of mobile-json-full.

Option A:

-- /mobile-json-lite/{title}
-- /mobile-json/{title}
-- /mobile-json/lead/{title}
-- /mobile-json/remainder/{title}

What do you think of that? If that's too confusing or if we should really stick to one level then I could also go for option B.

Option B:

-- /mobile-json-lite/{title}
-- /mobile-json-full/{title}
-- /mobile-json-lead/{title}
-- /mobile-json-remainder/{title}

@bearND: I think it would be great if those names made sense to a third-party user as well as to you. The fact that something is JSON-encoded is probably less interesting to those users than *what* kind of content is JSON-encoded. If the 'lite' version provides basically textual content, how about calling it 'mobile-text' ? The 'hybrid' version.. maybe mobile-html or mobile-simplehtml?

Regarding the layout, the nested hierarchy would lead to a conflict for articles named 'lead' or 'remainder'. IMHO, keeping that at one level is cleaner.

@bearND: I think it would be great if those names made sense to a third-party user as well as to you. The fact that something is JSON-encoded is probably less interesting to those users than *what* kind of content is JSON-encoded. If the 'lite' version provides basically textual content, how about calling it 'mobile-text' ? The 'hybrid' version.. maybe mobile-html or mobile-simplehtml?

I agree. Having everything prefixed with mobile-json seems redundant too.

Regarding the layout, the nested hierarchy would lead to a conflict for articles named 'lead' or 'remainder'.

+1. I've updated the task description to reflect @bearND's option B:

-- /mobile-json-lite/{title}
-- /mobile-json-full/{title}
-- /mobile-json-lead/{title}
-- /mobile-json-remainder/{title}

I'll keep the task description with the complete list of routes updated as the conversation evolves.

I added a section about the Revscore service as well

@Halfak I am curious if people will want to access revscoring data not by passing revision IDs but other types of identifiers (for example a page ID or a page title, which would resolve to the head revision of the corresponding page). Mapping an article title to a head rev ID probably belongs to a wrapper/client, not to the API, but I'd like to hear your thoughts on what makes most sense to the end user if we need to come up with a generic API layout for revscoring.

DarTar, you're right. This is a use-case we're already working on. There's a generalization that we can make relatively easily to the scoring service to allow it to accept different inputs (like page_id and title). It's just that we have limited our focus for the API at the moment. We actually support scoring in a much more general way internally. For example, I started working with @Harej to develop an article importance predictor that would not have a relevant rev_id to draw from, but rather a page_id. The scoring/feature extraction part of the system is designed to use these different types of inputs and so long as the required features can operate based on that input.

But for the meantime and in the context of RESTBase, it seems like the revision-based model makes the most sense.

@bearND: I think it would be great if those names made sense to a third-party user as well as to you. The fact that something is JSON-encoded is probably less interesting to those users than *what* kind of content is JSON-encoded. If the 'lite' version provides basically textual content, how about calling it 'mobile-text' ? The 'hybrid' version.. maybe mobile-html or mobile-simplehtml?

I agree. Having everything prefixed with mobile-json seems redundant too.

Ok, I've left out the html route which we still have for now but not publishing. So, not everything would actually start with json. And there could be a mobile-html route as well.

I agree that` json` in the name is not the most meaningful. I just haven't come up with a better name yet. The current plan for .../mobile-json-full/{title} is to be somewhat similar in scope as the current mobileview action PHP API in that it has JSON, an array of sections which in turn have embedded HTML for the section text. The JSON structure could be a bit different from mobileview, though. I basically only want to put in what the mobile apps need, and provide most if not all DOM transformations on the service side. The lead and remainder routes are basically subsets of the full one. I'm open to suggestions for better names.

This thread is becoming a bit unwieldy. We will need to split the general top-level discussion from the per-service discussions soon.

@bearND, let me respond on T102130.

Bike-shed: we need to find a good candidate for naming the shared domain to be used by sub-APIs which do not really depend on the supplied domain or aggregate some information for all domains. Such an example is Mathoid's API: it renders formulae in exactly the same way regardless of the wiki form which the request is being made.

Some possible candidates:

  • common.domain
  • shared.domain
  • general
  • no.domain

Personally, I could go for common.domain as I think it captures both usages we need - aggregated / shared resources, as well as domain-independent APIs.

Other candidates previously discussed:

  • global.wikimedia.org
  • central.wikimedia.org
  • window.wikimedia.org (haha!)

I wonder if it'd be wise to use a valid DNS record for this.

We'll probably want to expose some of the global data. An example use case would be global traffic stats across projects.

We'll probably want to expose some of the global data. An example use case would be global traffic stats across projects.

Right, but since we plan to kill rest.wm.org and use only domains' /api/rest_v1/ rewrites, this can effectively be a hidden, i.e. practically unreacheable, domain. The global stats (or other aggregated data) can then be redirected to use our special, internal domain.

Basically, while rest.wm.org is still active, choosing something like common.domain relieves us (only for a short while, though!) from securing the needed DNS sub-domain record.

Alternatively, we may go with something like no.domain which we can view as a domain internal to RESTBase only, where only domain-independent sub-APIs reside. Concretely, /no.domain/sys/ could be used by all back-end services not needing to be tied to a specific domain.

That leaves us the room and possibility to decide on the concrete name of the domain for public, aggregated resources.

I meant that we'll probably want to *publicly* expose some of the global data, so will need a resolvable domain. We don't need to expose all information stored in this domain, but that's a matter of defining the public API and authentication settings.

By the way, a bit relevant to this, we decided we would serve all of our Pageview API endpoints from the common endpoint as opposed to the wiki-specific endpoints. The main reason for this is that "project" in our case does not map to "site domain" in a clear way. For example:

/en.wikipedia/mobile-web/ would map to en.m.wikipedia.org maybe?
but then /en.wikipedia/mobile-app/ would be really confusing to map to something
and in any case /en.wikipedia/all-access would be confusing too

Let us know if you want to talk this over.

@Milimetric, the choice is basically between something like

https://en.wikipedia.org/api/rest_v1/stats/access/mobile_web/...
https://en.wikipedia.org/api/rest_v1/stats/access/mobile_app/...
https://en.wikipedia.org/api/rest_v1/stats/access/desktop_web/...

and

https://global.wikimedia.org/api/rest_v1/stats/access/mobile_web/en.wikipedia.org/....
https://global.wikimedia.org/api/rest_v1/stats/access/mobile_apps/en.wikipedia.org/....
https://global.wikimedia.org/api/rest_v1/stats/access/desktop_web/en.wikipedia.org/....

or some variant thereof.

I'm curious which advantages made the difference for your decision to go with the latter option.

In my personal opinion, the former (using the regular domain) has the advantage of making per-project stats more discoverable by including them in the regular APIs. It also avoids introducing another ad-hoc project hierarchy.

If those stats are going to be fetched on view, then there is also a significant performance advantage in reusing an existing connection to the project domain, rather than performing a DNS lookup, TCP and finally TLS handshake for a separate domain.

We'll have an easier time locking down such paths as well, as general ACLs can be inherited from the domain. Stats for private wikis can be protected along with the data.

I meant that we'll probably want to *publicly* expose some of the global data, so will need a resolvable domain. We don't need to expose all information stored in this domain, but that's a matter of defining the public API and authentication settings.

Yup. I'm suggesting to have two - an internal domain which would hold domain-independent data for back-end services (think Mathoid) and a resolvable one which exposes certain resources to the public (think pageviews).

By the way, a bit relevant to this, we decided we would serve all of our Pageview API endpoints from the common endpoint as opposed to the wiki-specific endpoints. The main reason for this is that "project" in our case does not map to "site domain" in a clear way. For example:

/en.wikipedia/mobile-web/ would map to en.m.wikipedia.org maybe?
but then /en.wikipedia/mobile-app/ would be really confusing to map to something
and in any case /en.wikipedia/all-access would be confusing too

I don't think you need to map a specific access to a domain. After all, that's counter-intuitive since most of the domains can be accessed by desktop or mobile devices. Adding to that the fact that en.m.wikipedia.org is just a rewrite of en.wikipedia.org, it is clear they should be treated as equals. As suggested by @GWicke, putting the access type as a URI parameter would achieve what you want: https://en.wikipedia.org/api/rest_v1/stats/{access}/... (where {access} can be desktop, mobile-web or mobile-app).

In my personal opinion, the former (using the regular domain) has the advantage of making per-project stats more discoverable by including them in the regular APIs. It also avoids introducing another ad-hoc project hierarchy.

Looking at the overall public API structure, it also may be confusing to have two different domains/project names in the URI: https://global.wikimedia.org/api/rest_v1/stats/en.wikipedia.org/...

If those stats are going to be fetched on view, then there is also a significant performance advantage in reusing an existing connection to the project domain, rather than performing a DNS lookup, TCP and finally TLS handshake for a separate domain.

I share this concern, but AFAIK this is not an important factor for Analytics, since I presume most clients fetch their data outside of the scope of a project (@Milimetric correct me if I'm wrong).

We'll have an easier time locking down such paths as well, as general ACLs can be inherited from the domain. Stats for private wikis can be protected along with the data.

But this is a serious concern. +1

I'm suggesting to have two - an internal domain which would hold domain-independent data for back-end services (think Mathoid) and a resolvable one which exposes certain resources to the public (think pageviews).

Which benefit do you see in having two domains?

Which benefit do you see in having two domains?

While some endpoints could be shared between the resolvable and normal domains, it is going to have a different spec. The same goes for the domain-independent internal domain, for which I can't see sharing much with either of the other two. Even in cases where that is the case (e.g. https://global.wm.org/api/rest_v1/media/math/{hash}), this can be easily mapped / fixed to the internal domain.

There seem to be four themes: URI structure, discoverability, performance, and security.

  1. structure. In general, I think our notion of "project" doesn't map very well to the notion of "domain". In the "project" parameter we can specify different types of aggregations as well as specific projects. We may, in the future, choose to add other types of aggregations such as "all-en-projects" or "all-wiktionary-projects" which would further confuse matters. Our modules are also not domain-specific, and we'll add modules that will not even have any project parameter at all. So a smart proxy that knows how to serve all our endpoints is not possible. And a manual one that we maintain doesn't sound like fun.
  1. discoverability and 3. performance. These two seem tied together, the closer the request to the wiki itself, the faster humans and computers can find it :) Marko is right that a lot of our initial requests will probably not come from the wikis themselves. But we have heard of pageview API use cases from extension developers and teams at WMF building features, so we want to support this as a first class use case. However, we think a simple pass-through proxy, as ugly as that may be, is better here than a smart one that tries to transform the domain into the project. So if performance becomes a concern, could we just forward:
en.wikipedia.org/api/rest_v1/pageviews/top/all-access/fr.wikipedia/2015/all-months/all-days

to

global.wikipedia.org/api/rest_v1/pageviews/top/all-access/fr.wikipedia/2015/all-months/all-days
  1. security

We'll have an easier time locking down such paths as well, as general ACLs can be inherited from the domain. Stats for private wikis can be protected along with the data.

Right now we're not going to make private wiki stats available via this API. I'm not familiar with what's possible to lock down and what is not, so maybe we need to talk more here.

While some endpoints could be shared between the resolvable and normal domains, it is going to have a different spec

It can also be in the same spec. We can expose exactly the things we want publicly, and leave everything else private in /sys/.

In general, I think our notion of "project" doesn't map very well to the notion of "domain".

I think it's clear that global aggregations will be useful and needed. Those aggregations don't overlap with existing domains though, so there shouldn't be much of an issue of having many ways to do / name the same thing.

en.wikipedia.org/api/rest_v1/pageviews/top/all-access/fr.wikipedia/2015/all-months/all-days

This doesn't make much sense to me. Something like en.wikipedia.org/api/rest_v1/stats/views/top/all-access/2015/all-months/all-days rewriting to global.wikipedia.org/api/rest_v1/stats/views/en.wikipedia.org/top/all-access/2015/all-months/all-days could work, but I don't think that it makes sense to expose other project's views at en.wikipedia.org. How many top-level hierarchies do you expect to have beyond views?

It can also be in the same spec. We can expose exactly the things we want publicly, and leave everything else private in /sys/.

Touché, but having two separate domains would allow us to continue working on expanding the API (e.g. Mathoid, Citoid) while bike-shedding on the DNS record ;)

Other candidates previously discussed:

  • global.wikimedia.org
  • central.wikimedia.org
  • window.wikimedia.org (haha!)

I actually like window.wm.org the most, as in this is a window into the WM world. global has some political connotations IMHO, while central sounds like a domain that ought to be used for the future auth(n|z) service.

central sounds like a domain that ought to be used for the future auth(n|z) service.

That one is already called login.wikimedia.org.

How about using just wikimedia.org ? That makes it clear that the data is for all of the projects listed at https://www.wikimedia.org/.

central sounds like a domain that ought to be used for the future auth(n|z) service.

That one is already called login.wikimedia.org.

I see. I'd still associate central.x with auth stuff, so I'd rather stay clear of it.

How about using just wikimedia.org ? That makes it clear that the data is for all of the projects listed at https://www.wikimedia.org/.

I'd avoid that one as well. Many simply neglect to type www. (I guess they assume that these are synonyms), which will lead to unnecessary conflicts, /me thinks.

TBH, I'd personally go with something along the lines of api.wm.org, but this has already been discussed before I'm not mistaken. Even rest.wm.org might be a good candidate ;)

Many simply neglect to type www.

It's not needed, as https://wikimedia.org/ redirects to www. Also, this gives you two domains ;)

Many simply neglect to type www.

It's not needed, as https://wikimedia.org/ redirects to www. Also, this gives you two domains ;)

Yup. Good point. Let's go for it.

In general, I think our notion of "project" doesn't map very well to the notion of "domain".

I think it's clear that global aggregations will be useful and needed. Those aggregations don't overlap with existing domains though, so there shouldn't be much of an issue of having many ways to do / name the same thing.

en.wikipedia.org/api/rest_v1/pageviews/top/all-access/fr.wikipedia/2015/all-months/all-days

This doesn't make much sense to me. Something like en.wikipedia.org/api/rest_v1/stats/views/top/all-access/2015/all-months/all-days rewriting to global.wikipedia.org/api/rest_v1/stats/views/en.wikipedia.org/top/all-access/2015/all-months/all-days could work, but I don't think that it makes sense to expose other project's views at en.wikipedia.org. How many top-level hierarchies do you expect to have beyond views?

The problem with the rewrite you're suggesting is we have to parse the domain into the project parameter. In our case, that doesn't always make sense. And we'd have to maintain some manual code that does this, and that seems unnecessarily complicated to me. The less code I have to think about, the better.

We don't know how many top level hierarchies we'll have. But definitely editing data, and different types of data synthesized from Event Logging. Maybe performance, search, and could really be any number of things.

I'll ping you on IRC @GWicke, we can chat more so this thread doesn't become impossible to follow.

@GWicke and I talked, and we're on the same page I think. We submitted a pull request which implements the pageviews module with the necessary tests. And we decided we'd leave the front-end configuration for a different pull request. For now, we commented on how we agreed to do this in the commit message:

https://github.com/wikimedia/restbase/pull/343

The problem with the rewrite you're suggesting is we have to parse the domain into the project parameter. In our case, that doesn't always make sense. And we'd have to maintain some manual code that does this, and that seems unnecessarily complicated to me. The less code I have to think about, the better.

The alternatives I see to not transforming the domain into a project name are:

  • Use the project name in routes, such as https://en.wikipedia.org/api/rest_v1/data/stats/views/enwiki/.... It makes little sense to have to specify the project name when the domain already sets that.
  • Have the pageviews API live only inside the global domain, so that we have URIs like wikimedia.org/api/rest_v1/data/stats/{project}/views/.... Here I dislike the duality of data fetching: all of the content can be fetched using the same prefix (https://{project-domain}/api/rest_v1/...) except for *pageviews*.

As you can tell, I'm not happy with either solution. Here's an alternative. For cases where there is a 1:1 mapping between domains and projects (en.wp.org -> enwiki), the routes could live in that domain's public API (https://en.wikipedia.org/api/rest_v1/page/title/Foobar/stats/views/{access}/...). The mapping does not have to be maintained manually, as it can be retrieved from the sitematrix API. For other projects (which are, really, analytics-specific), such as top, all-wiktionary, en-all, etc., the routes could live in the global domain only, so that one would use a URI like https://wikimedia.org/api/rest_v1/data/stats/{project}/views/.... What do you guys think?

I have left some comments there, but I have a feeling that until we settle the above conversion, this PR will be controversial :)

I honestly have some difficulties to use those new endpoints from the math extension. (See https://gerrit.wikimedia.org/r/#/c/245478/)

There are some new project-global content entry points (picture / article of the day, trending articles, "in the news", T132340) that need a home. They don't seem to fit too well in the /page/ hierarchy, so it might make more sense to create a new top-level entry point. Lets discuss options here.

Some candidates:

  • /project/
  • /latest/
ggellerman removed a subscriber: Halfak.
ggellerman subscribed.

removing Research and Data backlog. @DarTar is still subscribed

The concrete issue discussed on this task has been resolved, so there is nothing actionable left. Lets close this task to reflect that, but reference it for future API layout considerations.