Page MenuHomePhabricator

High-traffic API endpoints to cover in RESTBase
Closed, DeclinedPublic

Description

One of the services team goals (T111819) is to cover at least 2 high-traffic API endpoints by RESTBase. This task was created to hold a discussion which of the endpoints should be covered. Too early to dive into implementation, but it's better to start the discussion earlier.

Recent analysis

Logs have now moved to mwlog1001.eqiad.wmnet. Recent output:

gwicke@mwlog1001:/srv/mw-log$ tail -100000 api.log | grep action=query | sed -e 's/^.*action=' -e 's/\(srsearch\|titles\|rvprop\|gpssearch\|callback\|gbltitle\|ggscoord|\codistancefrompoint\)=[^ ]\+' | sort | uniq -c | sort -n -r | head -10

Type-ahead search?

3345 query format=json generator=prefixsearch redirects= prop=pageprops%7Cpageprops%7Cpageimages%7Cpageterms  gpsnamespace=0 gpslimit=15 ppprop=displaytitle piprop=thumbnail pithumbsize=80 pilimit=15 wbptterms=description

Media viewer (https://www.mediawiki.org/wiki/Extension:MultimediaViewer); Makes one request for each image in a page when one image is clicked.

2688 query format=json smaxage=300 maxage=300 uselang=content  prop=imageinfo iiprop=timestamp%7Curl%7Csize%7Cmime%7Cmediatype%7Cextmetadata iiextmetadatalanguage=en iiextmetadatafilter=DateTime%7CDateTimeOriginal%7CObjectName%7CImageDescription%7CLicense%7CLicenseShortName%7CUsageTerms%7CLicenseUrl%7CCredit%7CArtist%7CAuthorCount%7CGPSLatitude%7CGPSLongitude%7CPermission%7CAttribution%7CAttributionRequired%7CNonFree%7CRestrictions

Source unclear

2034 query format=json  prop=info%7Crevisions continue= rvprop=ids%7Ctimestamp%7Cuser%7Cuserid%7Csize%7Csha1%7Ccontentmodel%7Ccomment%7Ctags formatversion=1

generator=search; for each result, returns information very similar to the summary endpoint. https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bsearch

1927 query format=json smaxage=86400 maxage=86400 uselang=content generator=search prop=pageimages%7Cpageterms g gsrnamespace=0 gsrlimit=3 gsrqiprofile=classic_noboostlinks piprop=thumbnail pithumbsize=160 pilimit=3 wbptterms=description formatversion=2

Similar to summary end point; source unclear.

1716 query format=json  prop=categories%7Cextracts%7Cpageimages exintro= explaintext= pithumbsize=140

https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bsearch

1384 query format=xml list=search  srlimit=3 srprop=snippet%7Csize

Summary-ish

990 query format=json  prop=info%7Cextracts%7Cpageimages%7Crevisions%7Cpageterms%7Ccoordinates inprop=displaytitle exsentences=5 exintro=true piprop=thumbnail%7Coriginal pithumbsize=320 pilicense=any rvprop=timestamp%7Cids wbptterms=description formatversion=1

Summary-ish

933 query format=json  prop=pageimages piprop=thumbnail pithumbsize=320 pilicense=any formatversion=2

Search + summary

891 query format=json generator=search prop=pageterms%7Cpageimages%7Cpageprops g gsrnamespace=0 gsrlimit=6 gsrwhat=text gsrinfo= gsrprop=redirecttitle wbptterms=description piprop=thumbnail pithumbsize=320 pilicense=any ppprop=mainpage%7Cdisambiguation formatversion=2  
723 query format=json smaxage=300 maxage=300 uselang=content  prop=imageinfo iiprop=timestamp%7Curl%7Csize%7Cmime%7Cmediatype%7Cextmetadata iiextmetadatalanguage=de iiextmetadatafilter=DateTime%7CDateTimeOriginal%7CObjectName%7CImageDescription%7CLicense%7CLicenseShortName%7CUsageTerms%7CLicenseUrl%7CCredit%7CArtist%7CAuthorCount%7CGPSLatitude%7CGPSLongitude%7CPermission%7CAttribution%7CAttributionRequired%7CNonFree%7CRestrictions

Overall, it looks like the most busy query modules are currently related to search, image viewer, and page summaries. Some of the summary requests could likely be satisfied by the REST API summary end point.

Historic information

According to metrics, the most used query apis in MW API, not yet covered by RESTBase are: image info (1100 req/s), pageimages (850 req/s), extracts (450 req/s) and pageterms (270 req/s). All of these endpoints are catchable, so potentially could be speeded-up by RESTBase.

Image Info (1100 req/s)

The most high-traffic endpoint, however according to @GWicke, it's mostly used within Parsoid, so it's arguable that it worths covering. However, the content is cacheable, and should be updated when a new image is uploaded, or a file description is changed (which is not that easy, because there's no special hook for that).

Extracts (450 req/s)

The content itself is catchable, but could change with page rerender. So, we could add RB endpoints, similar to html:

  • /page/extract/{title} - returns the page extract for the latest title
  • /page/extract/{title}/ - lists all revision/tid pairs available in storage
  • /page/extract/{title}{/revision}{/tid} - get an exact historic revision of the extract.

What makes this more complicated, is that API endpoint supports limiting output by number of chars (could be done by a simple substring call), number of sentences, optionally returns only content before the first section, and supports multiple output formats. All of these options could be covered, but at first we need to consider how used/useful they are. A new render can be added when a new render of html content appears, and that could be done asynchronously or in parallel with html content update.

Page Images (850 req/s)

This API provides a single image that best describes the title. So, it's also perfectly catchable, and we could either add this info to the /title/{title} endpoint output, or support new set of endpoints:

  • /page/image/{title} - returns a page image for the latest revision of a title

Additionally, we could support historic data - page image for each historic revision of a title:

  • page/image/{title}/{revision} - returns a page image for a specific revision of a page.

In the API, an endpoint has a size parameter, that sets the size of a generated thumbnail. We could either pre-fetch a set of predefined sizes (e.g. xxs, xs, s, m, l, xl, xxl), or drop support for this parameter. The cached content should be updated on a revision change (or, possible, on page rerender).

One more problem, is that an API is provided not by MW core, but by an extension, so RESTBase should first ensure the extension is present on a MW installation before advertising the endpoints.

Page Terms (270 req/s)

The simplest one to cover out of all. The API is very simple and doesn't have any non-cashable options. Could be fetched on every new revision, so we can provide revisioned view of page terms. The API could look like this:

  • /page/terms/{title} - returns page terms for the latest revision
  • /page/terms/{title}/ - returns a listing of revisions, for which page terms are available. In general it is not equal to all revisions of this title, because we can't get this data for already created (historic) revisions.
  • /page/terms/{title}/{revision} - returns page terms for the revision

@GWicke, @mobrovac, @Eevans, what are your thoughts?

Event Timeline

Pchelolo claimed this task.
Pchelolo raised the priority of this task from to Needs Triage.
Pchelolo updated the task description. (Show Details)

@Pchelolo: Nice analysis! We should discuss the requirements and actual uses with the main users of these endpoints. For extracts, this will likely be the Web-Team-Backlog and app teams.

Cache invalidation for page terms looks potentially trickier than the others, as this data is really coming from the associated wikidata item. We'd have to handle the dependency between wikidata item and this API response.

GWicke triaged this task as Medium priority.
GWicke edited projects, added Services (next); removed Services.

Here is some new information from api.log:

tail -100000 api.log | grep action=query | sed -e 's/^.*action=//' -e 's/\(srsearch\|titles\|rvprop\|gpssearch\|callback\|gbltitle\|ggscoord|\codistancefrompoint\)=[^ ]\+//' | sort | uniq -c | sort -n -r | head -10
   5975 query format=json generator=prefixsearch redirects= prop=pageprops%7Cpageprops%7Cpageimages%7Cpageterms  gpsnamespace=0 gpslimit=15 ppprop=displaytitle piprop=thumbnail pithumbsize=80 pilimit=15 wbptterms=description  
   2614 query format=json smaxage=300 maxage=300  prop=imageinfo iiprop=timestamp%7Curl%7Csize%7Cmime%7Cmediatype%7Cextmetadata iiextmetadatalanguage=en iiextmetadatafilter=DateTime%7CDateTimeOriginal%7CObjectName%7CImageDescription%7CLicense%7CLicenseShortName%7CUsageTerms%7CLicenseUrl%7CCredit%7CArtist%7CAuthorCount%7CGPSLatitude%7CGPSLongitude%7CPermission%7CAttribution%7CAttributionRequired%7CNonFree%7CRestrictions  
   2601 query format=json  prop=info%7Crevisions continue= rvprop=ids%7Ctimestamp%7Cuser%7Cuserid%7Csize%7Csha1%7Ccontentmodel%7Ccomment%7Ctags formatversion=1  
    950 query format=json smaxage=86400 maxage=86400 generator=search prop=pageimages%7Cpageterms g gsrnamespace=0 gsrlimit=3 gsrqiprofile=classic_noboostlinks piprop=thumbnail pithumbsize=160 pilimit=3 wbptterms=description formatversion=2  
    919 query format=json generator=search prop=pageterms%7Cpageimages%7Cpageprops continue= g gsrnamespace=0 gsrlimit=3 gsrwhat=text gsrinfo= gsrprop=redirecttitle wbptterms=description piprop=thumbnail pithumbsize=320 pilimit=3 ppprop=mainpage%7Cdisambiguation  
    870 query format=json  prop=pageimages continue= piprop=thumbnail pithumbsize=320 pilimit=1  
    862 query format=json  prop=categories%7Cextracts%7Cpageimages exintro= explaintext= pithumbsize=140  
    739 query format=xml list=search  srlimit=3 srprop=snippet%7Csize  
    727 query format=json maxlag=5  prop=info meta=userinfo indexpageids= continue= inprop=protection uiprop=blockinfo%7Chasmsg  
    719 query format=json  redirects= prop=revisions rawcontinue=1 rvprop=content rvlimit=1 rvsection=0
This comment was removed by GWicke.

Removed the previous comment after moving the information into the task description.

Given that the data in this ticket is quite outdated and that Mediawiki is getting it's own REST API soon(TM) I think it's safe to decline this ticket.