Page MenuHomePhabricator

RelatedArticles API query can take 2 seconds
Closed, ResolvedPublic

Description

The API request to https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=80&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3AJohn_Lennon&gsrnamespace=0&gsrlimit=3 is taking about 2s for me and never caching.

We should explore caching this and understanding where the slowdown is occurring - API? More like search? PageImages? Wikidata?.

Event Timeline

Jdlrobson updated the task description. (Show Details)
Jdlrobson raised the priority of this task from to Needs Triage.
Jdlrobson added a subscriber: Jdlrobson.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 11 2015, 6:48 PM
100.00% 2864.283      1 - main()
 99.39% 2846.799      1 - ApiMain::execute
 99.39% 2846.796      1 - ApiMain::executeActionWithErrorHandling
 99.32% 2844.714      1 - ApiMain::executeAction
 87.66% 2510.968      1 - ApiQuery::execute
 86.59% 2480.180      1 - ApiPageSet::execute
 86.59% 2480.178      1 - ApiPageSet::executeInternal
 86.55% 2479.091      1 - ApiQuerySearch::executeGenerator
 86.55% 2479.089      1 - ApiQuerySearch::run
 86.07% 2465.146      1 - CirrusSearch::searchText
 86.06% 2465.126      1 - CirrusSearch::searchTextReal
 86.04% 2464.562      1 - CirrusSearch::moreLikeThis
 85.83% 2458.423      1 - CirrusSearch\Searcher::moreLikeTheseArticles
 85.82% 2458.219      1 - CirrusSearch\Searcher::search
 84.53% 2421.266      2 - CirrusSearch\Util::__invoke
 84.38% 2416.870      1 - CirrusSearch\Util::doPoolCounterWork
 84.37% 2416.701      1 - PoolCounterWork::execute
 84.37% 2416.571      1 - PoolCounterWorkViaCallback::doWork
 84.37% 2416.478      1 - PoolCounterWork::execute@1
 84.30% 2414.632      1 - PoolCounterWorkViaCallback::doWork@1
 84.30% 2414.629      1 - CirrusSearch\Searcher::__invoke
 84.27% 2413.646      1 - Elastica\Search::search
 84.26% 2413.402      1 - Elastica\Client::request
 84.26% 2413.314      1 - Elastica\Request::send
 84.25% 2413.231      1 - Elastica\Transport\Http::exec
 11.53% 330.207      1 - ApiMain::printResult
 11.50% 329.505      1 - ApiFormatBase::closePrinter
Jdlrobson triaged this task as High priority.Dec 14 2015, 6:13 PM

Have asked @BGerstle-WMF and @bearND to see if this impacts apps too. Thanks @Legoktm for showing us where most of the leg work is happening.

ori added a subscriber: ori.Dec 14 2015, 10:59 PM

According to @Legoktm's paste, 84.25% of the time is spent waiting on ElasticSearch to return results.

Jdlrobson set Security to None.
Restricted Application added a project: Discovery. · View Herald TranscriptDec 14 2015, 11:01 PM

According to @Legoktm's paste, 84.25% of the time is spent waiting on ElasticSearch to return results.

Yup which makes sense given I'd expect this to be the most computationally expensive. The question is whether we can cache or need to add a caching layer.

@EBernhardson @dcausse Thoughts on this? It seems the "morelike" query exposed by CirrusSearch can take quite a long time to give a response.

Morelike is a costly query :(
We might be able to gain some ms by using smaller fields, a very quick test shows that we can decrease response time from ~2 to 0.7s by working on opening_text.
Unfortunately I never looked at the quality of the results with this field.
You can experiment by adding the following params to the URL : &cirrusMltUseFields=yes&cirrusMltFields=opening_text

Opening text:

nomoa@dizzy:~$ ab -k -n 50 'https://en.m.wikipedia.org/w/api.php?action=query&format=json&prop=pageterms%7Cpageimages%7Cpageprops&ppprop=mainpage%7Cdisambiguation&wbptterms=description&generator=search&gsrsearch=morelike%3AJohn_Lennon&gsrnamespace=0&gsrwhat=text&gsrinfo=&gsrprop=redirecttitle&gsrlimit=5&piprop=thumbnail&pithumbsize=640&pilimit=5&continue=&cirrusMltUseFields=yes&cirrusMltFields=opening_text'

              min  mean[+/-sd] median   max
Connect:      100  101   0.4    101     102
Processing:   435  520  47.2    521     618
Waiting:      435  519  47.2    521     617
Total:        535  620  47.2    622     718

Percentage of the requests served within a certain time (ms)
  50%    622
  66%    642
  75%    652
  80%    663
  90%    691
  95%    708
  98%    718
  99%    718
 100%    718 (longest request)

Default:

nomoa@dizzy:~$ ab -k -n 50 'https://en.m.wikipedia.org/w/api.php?action=query&format=json&prop=pageterms%7Cpageimages%7Cpageprops&ppprop=mainpage%7Cdisambiguation&wbptterms=description&generator=search&gsrsearch=morelike%3AJohn_Lennon&gsrnamespace=0&gsrwhat=text&gsrinfo=&gsrprop=redirecttitle&gsrlimit=5&piprop=thumbnail&pithumbsize=640&pilimit=5&continue=&cirrusMltUseFields=yes'

              min  mean[+/-sd] median   max
Connect:      100  107  28.8    101     295
Processing:  1896 2355 354.1   2326    3215
Waiting:     1896 2355 354.1   2326    3215
Total:       1996 2462 353.4   2431    3315

Percentage of the requests served within a certain time (ms)
  50%   2431
  66%   2603
  75%   2672
  80%   2817
  90%   2998
  95%   3111
  98%   3315
  99%   3315
 100%   3315 (longest request)

https://www.mediawiki.org/wiki/API:Caching_data may be helpful. Caution: abuse of these things can cause as much pain as non-use.

Change 260603 had a related patch set uploaded (by DCausse):
Optimise morelike queries

https://gerrit.wikimedia.org/r/260603

Thanks @dcausse! I'm going to put this on our sprint board for review, since there's a patch here now.

Deskana moved this task from in progress to Needs review on the Discovery-Search (Current work) board.
Deskana assigned this task to dcausse.
Deskana moved this task from Needs triage to On Sprint Board on the Discovery board.
Legoktm removed a subscriber: Legoktm.

The patch I provided is an easy win on cirrus side, we can expect a 600-800ms gain. Unfortunately this query remains costly, I hope to see avg response time below 1.5s but I don't think we can reduce it to <1s. Caching this query remains a very good idea (I should have created another task maybe?).

I think that we should invest some time in tuning the morelike query (and not only for performance reasons), the morelike uses default settings that were never actually evaluated. Anecdotal testing shows that we can improve quality and performance.

@dcausse, were you thinking to have CirrusSearch itself emit the caching headers (or keep some sort of origins-side hot cache) so that browser/app side JS doesn't have to put smaxage=<seconds> in the URL?

Or were you thinking the browser/app side JS should change the URL so they are hinting to the infrastructure to create objects at the Varnish edge?

@dcausse, were you thinking to have CirrusSearch itself emit the caching headers (or keep some sort of origins-side hot cache) so that browser/app side JS doesn't have to put smaxage=<seconds> in the URL?

I'd say that, in general, this should be handled on the server side; clients shouldn't really have to worry about it. Whether the logic lives in the CirrusSearch extension, or somewhere else in the stack, is better left to people more knowledgeable than me. :-)

I think there are multiple ways to cache and/or optimize this query:

Caching

  1. Use the elasticsearch cache: this is something we rarely do because queries are hard to cache and we prefer to enable caching for queries/filters that have a very high cache hits.
  2. Use varnish, I don't know what is needed to enable varnish here but I think it would be a good idea to have a common URL for mobile and desktop if both need same features.
  3. Client cache (Cache-control headers): I think it won't hurt to enable client caching here but I don't think that cirrus can emit the headers itself.

Optimisations
Morelike is one of the slowest query we serve today (avg time per node is 125ms vs 20ms for a classic fulltext query). By default morelike uses the full content of the page to compute relevant terms. I don't think we really need the full content here, relevant terms are often in the first section.
I think we can dramatically reduce elasticsearch time by working on smaller content and more broadly I think we can improve quality. Unfortunately morelike is a feature hidden behind a complex syntax for our direct users (search users) so it's extremely difficult for us to measure/enhance morelike with our search logs. But it seems to be a core part of the discovery features offered by the mobile app. There are many settings available in cirrus that we could tune to improve both quality and performances.

Just a small example to show how results can vary by just tweaking one or two params :

  • Paris
Default (2s)Without boost links (2s)opening_text with boost links (600ms)opening_text without boost links (600ms)
FranceDemographics of ParisBarcelonaCulture of Paris
LyonEconomy of ParisDhakaArt in Paris
RéunionTimeline of ParisIndonesiaDhaka
Tramways in Île-de-FranceTramways in Île-de-FranceLisbonHistory of Paris
Île-de-FranceÎle-de-FranceLondonLisbon

Change 260603 merged by jenkins-bot:
Optimise morelike queries

https://gerrit.wikimedia.org/r/260603

Deskana closed this task as Resolved.Dec 28 2015, 9:42 PM

The above patch, which contained optimisations to help here, was merged, so I think we can consider this task resolved. Thanks @dcausse! If there remain outstanding issues with the speed of the query, then I propose we open another task for that.