One of the services team goals (T111819) is to cover at least 2 high-traffic API endpoints by RESTBase. This task was created to hold a discussion which of the endpoints should be covered. Too early to dive into implementation, but it's better to start the discussion earlier.
## Recent analysis
Logs have now moved to mwlog1001.eqiad.wmnet. Recent output:
gwicke@mwlog1001:/srv/mw-log$ tail -100000 api.log | grep action=query | sed -e 's/^.*action=//' -e 's/\(srsearch\|titles\|rvprop\|gpssearch\|callback\|gbltitle\|ggscoord|\codistancefrompoint\)=[^ ]\+//' | sort | uniq -c | sort -n -r | head -10
Type-ahead search?
3345 query format=json generator=prefixsearch redirects= prop=pageprops%7Cpageprops%7Cpageimages%7Cpageterms gpsnamespace=0 gpslimit=15 ppprop=displaytitle piprop=thumbnail pithumbsize=80 pilimit=15 wbptterms=description
Media viewer (https://www.mediawiki.org/wiki/Extension:MultimediaViewer); Makes one request for each image in a page when one image is clicked.
2688 query format=json smaxage=300 maxage=300 uselang=content prop=imageinfo iiprop=timestamp%7Curl%7Csize%7Cmime%7Cmediatype%7Cextmetadata iiextmetadatalanguage=en iiextmetadatafilter=DateTime%7CDateTimeOriginal%7CObjectName%7CImageDescription%7CLicense%7CLicenseShortName%7CUsageTerms%7CLicenseUrl%7CCredit%7CArtist%7CAuthorCount%7CGPSLatitude%7CGPSLongitude%7CPermission%7CAttribution%7CAttributionRequired%7CNonFree%7CRestrictions
Source unclear
2034 query format=json prop=info%7Crevisions continue= rvprop=ids%7Ctimestamp%7Cuser%7Cuserid%7Csize%7Csha1%7Ccontentmodel%7Ccomment%7Ctags formatversion=1
generator=search; for each result, returns information very similar to the summary endpoint. https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bsearch
1927 query format=json smaxage=86400 maxage=86400 uselang=content generator=search prop=pageimages%7Cpageterms g gsrnamespace=0 gsrlimit=3 gsrqiprofile=classic_noboostlinks piprop=thumbnail pithumbsize=160 pilimit=3 wbptterms=description formatversion=2
Similar to summary end point; source unclear.
1716 query format=json prop=categories%7Cextracts%7Cpageimages exintro= explaintext= pithumbsize=140
https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bsearch
1384 query format=xml list=search srlimit=3 srprop=snippet%7Csize
Summary-ish
990 query format=json prop=info%7Cextracts%7Cpageimages%7Crevisions%7Cpageterms%7Ccoordinates inprop=displaytitle exsentences=5 exintro=true piprop=thumbnail%7Coriginal pithumbsize=320 pilicense=any rvprop=timestamp%7Cids wbptterms=description formatversion=1
Summary-ish
933 query format=json prop=pageimages piprop=thumbnail pithumbsize=320 pilicense=any formatversion=2
Search + summary
891 query format=json generator=search prop=pageterms%7Cpageimages%7Cpageprops g gsrnamespace=0 gsrlimit=6 gsrwhat=text gsrinfo= gsrprop=redirecttitle wbptterms=description piprop=thumbnail pithumbsize=320 pilicense=any ppprop=mainpage%7Cdisambiguation formatversion=2
723 query format=json smaxage=300 maxage=300 uselang=content prop=imageinfo iiprop=timestamp%7Curl%7Csize%7Cmime%7Cmediatype%7Cextmetadata iiextmetadatalanguage=de iiextmetadatafilter=DateTime%7CDateTimeOriginal%7CObjectName%7CImageDescription%7CLicense%7CLicenseShortName%7CUsageTerms%7CLicenseUrl%7CCredit%7CArtist%7CAuthorCount%7CGPSLatitude%7CGPSLongitude%7CPermission%7CAttribution%7CAttributionRequired%7CNonFree%7CRestrictions
Overall, it looks like the most busy query modules are currently related to search, image viewer, and page summaries. Some of the summary requests could likely be satisfied by the REST API summary end point.
## Historic information
According to [[ https://grafana.wikimedia.org/dashboard/db/api-requests | metrics ]], the most used `query` apis in MW API, not yet covered by RESTBase are: [[ https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bimageinfo | image info ]] (1100 req/s), [[ https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageimages | pageimages ]] (850 req/s), [[ https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bextracts | extracts ]] (450 req/s) and [[ https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageterms | pageterms ]] (270 req/s). All of these endpoints are catchable, so potentially could be speeded-up by RESTBase.
## [Image Info](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bimageinfo) (1100 req/s)
The most high-traffic endpoint, however according to @GWicke, it's mostly used within Parsoid, so it's arguable that it worths covering. However, the content is cacheable, and should be updated when a new image is uploaded, or a file description is changed (which is not that easy, because there's no special hook for that).
## [Extracts](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bextracts) (450 req/s)
The content itself is catchable, but could change with page rerender. So, we could add RB endpoints, similar to `html`:
- `/page/extract/{title}` - returns the page extract for the latest title
- `/page/extract/{title}/` - lists all revision/tid pairs available in storage
- `/page/extract/{title}{/revision}{/tid}` - get an exact historic revision of the extract.
What makes this more complicated, is that API endpoint supports limiting output by number of chars (could be done by a simple `substring` call), number of sentences, optionally returns only content before the first section, and supports multiple output formats. All of these options could be covered, but at first we need to consider how used/useful they are. A new render can be added when a new render of html content appears, and that could be done asynchronously or in parallel with html content update.
## [Page Images](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageimages) (850 req/s)
This API provides a single image that best describes the title. So, it's also perfectly catchable, and we could either add this info to the `/title/{title}` endpoint output, or support new set of endpoints:
- `/page/image/{title}` - returns a page image for the latest revision of a title
Additionally, we could support historic data - page image for each historic revision of a title:
- `page/image/{title}/{revision}` - returns a page image for a specific revision of a page.
In the API, an endpoint has a `size` parameter, that sets the size of a generated thumbnail. We could either pre-fetch a set of predefined sizes (e.g. xxs, xs, s, m, l, xl, xxl), or drop support for this parameter. The cached content should be updated on a revision change (or, possible, on page rerender).
One more problem, is that an API is provided not by MW core, but by [[ https://www.mediawiki.org/wiki/Extension:PageImages | an extension ]], so RESTBase should first ensure the extension is present on a MW installation before advertising the endpoints.
## [Page Terms](https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bpageterms) (270 req/s)
The simplest one to cover out of all. The API is very simple and doesn't have any non-cashable options. Could be fetched on every new revision, so we can provide revisioned view of page terms. The API could look like this:
- `/page/terms/{title}` - returns page terms for the latest revision
- `/page/terms/{title}/` - returns a listing of revisions, for which page terms are available. In general it is not equal to all revisions of this title, because we can't get this data for already created (historic) revisions.
- `/page/terms/{title}/{revision}` - returns page terms for the revision
@GWicke, @mobrovac, @EEvans, what are your thoughts?