Page MenuHomePhabricator

PageImages API is running slow
Open, Needs TriagePublicBUG REPORT

Description

From T289738

The geosearch API request is surprisingly slow at 700ms (production). I wonder if something broke or otherwise regressed there. Perhaps its DB query is lacking an index, or maybe a cache stopped working and its running expensive code in a tight loop?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The Search Platform team is going to investigate, depending on the result of the investigation we will see what needs to be done (if anything).

We should be able to get an idea from the percentiles we record in graphite at MediaWiki.CirrusSearch.eqiad.requestTimeMs.GeoData_spatial_search.*. We have 6 months of data for these percentiles They record the amount of time mediawiki was waiting for elasticsearch to respond. p50 through p95 look fairly stable, with p95 staying sub 100ms. p99 on the other hand has seen some variation. p99 used to bounce arround between 750-1000ms, but lately has been more in the 1000-1400 range.

Since the graphite data only includes the time spent waiting for elasticsearch, and not the time of the overall requests, I also pulled some data from hadoop. This only goes back 90 days, but includes how long the containing php requests took which is closer to what users experience than only the elasticsearch request. Latency here also seems fairly stable. Oddly those these datasets don't completely agree. The graphite metric puts p99 of the elasticsearch request at >1s, but the p99 for total request length is coming in much lower at 300-500ms.

T300590_geodata_request_time_percentiles.png (623×604 px, 47 KB)

I suppose the open question that remains is that these requests represent 4M-8M GeoData searches per day, perhaps some segment of queries is consistently ending up in the slow bucket? Do we have some specific example urls that are always slow that can be looked into?

Slicing on a few different dimensions, what I've found is that most wikis are fast and commonswiki is exceptionally slow. For a given day(2-2-22) these are the breakdowns for number of overall requests and approx latency percentiles over that day for the highest volume geosearch wikis (+commonswiki, because it didn't have enough volume to be included):

databasen_requestsp95p99p999
itwiki4061731111481001
frwiki1405731355694805
svwiki9024601592201465
enwiki4778721863451784
dawiki1014009112142856
commonswiki213656149432562
nowiki840999125157906

I'm not entirely sure what we can really do about commonswiki. The problem is commonswiki is huge and we already ask 32 shards to solve these queries in parallel against commonswiki. With the low query volume and large amount of data in commonswiki this is plausibly running into IO latency where it has to regularly go to disk instead of answering queries from data already in memory as well (unverified).

The context for this task, and the task description quoted from my comment, is T289738#7647597. In my comment, I used "the geosearch API request" to refer to an API request from the Nearby app that involves geosearch (among other things), and to differnatiate it from other requests within the context of Nearby.

It did not mean a request to "the" geosearch API, since it is not a request soley to the geosearch API. It is a request to the MediaWiki API that mainly involves time spent looking up metadata about the pages that were sourced from the geosearch "API generator". (See also T300593.)

An example request:
https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=coordinates|pageprops|pageprops|pageimages|description&colimit=max&generator=geosearch&ggsradius=10000&ggsnamespace=0&ggslimit=50&ggscoord=51.06|-0.10&ppprop=displaytitle&piprop=thumbnail&pithumbsize=150&pilimit=50&codistancefrompoint=51.06|-0.10

According to a one-off XHGui profile (via WikimediaDebug) this took 515ms of which 127ms is spent in QueryGeoSearch::executeGenerator (sort by "inclusive wall time").

Things that stand out:

  • Most time spent in PageImages\ApiQueryPageImages::execute: 211ms during this particular query.
  • No CDN or client-side caching allowed (not a high chance of cache hits with coordinates in the query, but still needlessly disallowed, and a potential load risk if e.g. a link becomes shared virally).

@EBernhardson Can you confirm that the Hadoop query you used matched these kinds generator queries?

@EBernhardson Can you confirm that the Hadoop query you used matched these kinds generator queries?

The initial graph of latencies was all queries tagged by cirrus query parsing with query_type=Geodata_spatial_search, as it would most closely match the graphite data. The table of reported numbers from hadoop have the same query type filter and an additional filter against generator=geosearch in the http request parameters (comes from WebRequest::getValues). This filter on generator drops about 30% of Geodata_spatial_search requests.

Looking at ext.nearby.scripts/api.js there are a few constant parameters we can key off of, adding a filter[1] for these constants finds a significantly narrower result set. This reduced dataset contains 3k-15k queries per day through february across all wikis. Aggregated over the month of february with the same wikis as before gives the following, indeed queries with this shape are much slower than the average request involving geo search (that i think mostly come from mobile apps, but not 100% sure).

databasen_requestsp95p99p999
enwiki380532473561160
itwiki10078285463941
dawiki96331964964
frwiki104442995031130
svwiki113012914471038
nowiki132424533601
commonswiki2238118116244537

[1] The code also has redirects=no, but that filter finds only 70 queries in feb 2022 and it wasn't in the example query above either so excluded here.

maybe_nearby = (
    (params['colimit'] == 'max')
    & (params['generator'] == 'geosearch')
    & (params['ggslimit'] == 50)
    & (params['ppprop'] == 'displaytitle')
    & (params['piprop'] == 'thumbnail')
    & (params['pithumbsize'] == 150)
    & (params['pilimit'] == 50)
)

It looks like the Search part of this is investigated and the issue is on the generator side. Should the Web-Team team look into this?

I'm not completely understanding the recommendation here. Is it to change the client side parameters or to add an optimization in the API code (if so would that be in Geodata or MediaWiki core?)

And yes this would impact apps too.

I don't have much of a recommendation. From the geodata side of things everything seems to be working acceptably. We can see from the request logging that the php execution time of these specific requests are about 2x the runtime of an average api request that uses geodata as a generator. I don't have any particular insight into what happens in the requests after the geodata portion has completed,. There may be ways to find that information from other sources but I'm not sure how.

As indicated last week, half of the time is spent in the PageImages API. Improving that would go a long way.

Krinkle renamed this task from Geodata API is running slow to PageImages API is running slow.Mar 15 2022, 10:58 PM
Krinkle removed EBernhardson as the assignee of this task.
Krinkle removed a project: GeoData.
Jdlrobson removed the point value 5 for this task.Apr 28 2022, 9:59 PM