Change Details

One of the services team goals (T111819) is to cover at least 2 high-traffic API endpoints by RESTBase. This task was created to hold a discussion which of the endpoints should be covered. Too early to dive into implementation, but it's better to start the discussion earlier. According to [[ https://grafana.wikimedia.org/dashboard/db/api-requests | metrics ]], the most used `query` apis in MW API, not yet covered by RESTBase are: [[ https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bimageinfo | image info ]] (1100 req/s), [[ https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageimages | pageimages ]] (850 req/s), [[ https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bextracts | extracts ]] (450 req/s) and [[ https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageterms | pageterms ]] (270 req/s). All of these endpoints are catchable, so potentially could be speeded-up by RESTBase. ##Image Info The most high-traffic endpoint, however according to @GWicke, it's mostly used within Parsoid, so it's arguable that it worths covering. However, the content is cacheable, and should be updated when a new image is uploaded, or a file description is changed (which is not that easy, because there's no special hook for that). ##Extracts The content itself is catchable, but could change with page rerender. So, we could add RB endpoints, similar to `html`: - `/page/extract/{title}` - returns the page extract for the latest title - `/page/extract/{title}/` - lists all revision/tid pairs available in storage - `/page/extract/{title}{/revision}{/tid}` - get an exact historic revision of the extract. What makes this more complicated, is that API endpoint supports limiting output by number of chars (could be done by a simple `substring` call), number of sentences, optionally returns only content before the first section, and supports multiple output formats. All of these options could be covered, but at first we need to consider how used/useful they are. A new render can be added when a new render of html content appears, and that could be done asynchronously or in parallel with html content update. ##Page Images This API provides a single image that best describes the title. So, it's also perfectly catchable, and we could either add this info to the `/title/{title}` endpoint output, or support new set of endpoints: - `/page/image/{title}` - returns a page image for the latest revision of a title Additionally, we could support historic data - page image for each historic revision of a title: - `page/image/{title}/{revision}` - returns a page image for a specific revision of a page. In the API, an endpoint has a `size` parameter, that sets the size of a generated thumbnail. We could either pre-fetch a set of predefined sizes (e.g. xxs, xs, s, m, l, xl, xxl), or drop support for this parameter. The cached content should be updated on a revision change (or, possible, on page rerender). One more problem, is that an API is provided not by MW core, but by [[ https://www.mediawiki.org/wiki/Extension:PageImages | an extension ]], so RESTBase should first ensure the extension is present on a MW installation before advertising the endpoints. ##Page Terms The simplest one to cover out of all. The API is very simple and doesn't have any non-cashable options. Could be fetched on every new revision, so we can provide revisioned view of page terms. The API could look like this: - `/page/terms/{title}` - returns page terms for the latest revision - `/page/terms/{title}/` - returns a listing of revisions, for which page terms are available. In general it is not equal to all revisions of this title, because we can't get this data for already created (historic) revisions. - `/page/terms/{title}/{revision}` - returns page terms for the revision @GWicke, @mobrovac, @EEvans, what are your thoughts?