Cache morelike API query results
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Jan 20 2016, 8:12 PM

Description

morelike queries take up a significant load on the cluster (~20%) vs the number of queries we actually perform. Some initial analysis in hive suggests we can cut the number served to 1/5 of the current load by caching responses for 24 hours:

A relatively naive query in hive suggests in the span of 24h we could cut morelike queries to the backend from 7.3M to 1.7M:

select sum(total), sum(deduplicated) from (select count(1) as total, count(distinct requests[0].query) as deduplicated from wmf_raw.cirrussearchrequestset where year=2016 and month=1 and day=10 and requests[0].querytype = 'more_like' group by wikiid) x;

_c0 _c1
7331659 1726091

This tries to get a rough estimate on how that compares to the variance in the way uri's are sent. I'm not sure how good of an approximation this is, but the totals are similar enough it might be a good guess:

select sum(total), sum(deduplicated) from (select count(1) as total, count(distinct uri_query) as deduplicated from wmf.webrequest where year=2016 and month=1 and day=10 and uri_query LIKE '%search=morelike%' group by uri_host) x;

_c0 _c1
7383599 2214332

The typical API request looks like: https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=80&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3ACome_Share_My_Love&gsrnamespace=0&gsrlimit=3

Details

	Subject	Repo	Branch	Lines +/-
	Cache more like queries for 24 hours	operations/mediawiki-config	master	+3 -0
	Cache more like queries into ObjectCache	mediawiki/extensions/CirrusSearch	master	+85 -8

Customize query in gerrit

Related Objects

Mentioned In: T124626: Elasticsearch health and capacity planning FY2016-17
T124258: Perform A/B test to determine if using opening_text instead of text as the field to perform more_like_this queries is better or not
T123268: Reduce amount of queries per search request
T124100: Determine root cause of weekend latency spikes in elasticsearch cluster
Mentioned Here: T66214: Define an official thumb API

Event Timeline

EBernhardson created this task.Jan 20 2016, 8:12 PM

EBernhardson raised the priority of this task from to Needs Triage.

EBernhardson updated the task description. (Show Details)

EBernhardson added a project: CirrusSearch.

EBernhardson subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptJan 20 2016, 8:12 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

EBernhardson mentioned this in T124100: Determine root cause of weekend latency spikes in elasticsearch cluster.Jan 20 2016, 8:56 PM

@Anomie Before i go digging into working on this, I was wondering if you have any opinion? I'm initially thinking of adding a cacheMaxAge() function to the SearchResultSet object, from which the api can then choose to use or not use. CirrusSearch will only set this for morelike queries.

The main part I'm uncertain about is how the use as a generator to get other properties effects the cachability. Although perhaps other things I just don't know about yet.

• Deskana mentioned this in T123268: Reduce amount of queries per search request.Jan 20 2016, 10:02 PM

In T124216#1949801, @EBernhardson wrote:

The main part I'm uncertain about is how the use as a generator to get other properties effects the cachability. Although perhaps other things I just don't know about yet.

That is a concern: setting a cache time would cause all those properties to be cached too, which may confuse users or cause them to hack around the problem by varying their URLs or using POSTs to bypass caching.

Another concern is that multiple query modules trying to set cache times would conflict with each other, unless we add code to somehow merge the different values.

Another possibility to deal with this issue, if the queries are all from a limited number of sources and seem likely to remain so, would be to convince those sources to set the maxage and smaxage parameters and to ensure a consistent ordering of their query parameters.

Or the caching could be done in Cirrus instead of the API: store morelike query results in a memcache of some sort, and then use that cached data to respond to the query when it recurs. That would allow the API to process the rest of its modules using fresh data as people probably expect.

EBernhardson mentioned this in T124258: Perform A/B test to determine if using opening_text instead of text as the field to perform more_like_this queries is better or not.Jan 21 2016, 1:03 AM

In T124216#1950590, @Anomie wrote:

Another possibility to deal with this issue, if the queries are all from a limited number of sources and seem likely to remain so, would be to convince those sources to set the maxage and smaxage parameters and to ensure a consistent ordering of their query parameters.

As @Jhernandez said, this seems like the quickest change that the Reading Web and Apps teams could make while also being the most brittle – if one team wanted to tweak the query, then they must communicate/coordinate with the other teams well in advance.

Or the caching could be done in Cirrus instead of the API: store morelike query results in a memcache of some sort, and then use that cached data to respond to the query when it recurs. That would allow the API to process the rest of its modules using fresh data as people probably expect.

… with the cache layer doing a little work to order the query parameters as it sees fit. IMO this is the solution we should be aiming for, perhaps using the above as a stop-gap.

• Jhernandez awarded a token.Jan 21 2016, 11:22 AM

• Mholloway subscribed.Jan 21 2016, 2:14 PM

EBernhardson mentioned this in T124626: Elasticsearch health and capacity planning FY2016-17.Jan 25 2016, 7:49 PM

EBernhardson claimed this task.Jan 25 2016, 9:36 PM

EBernhardson added a project: Discovery-Search (Current work).

EBernhardson set Security to None.

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Change 266419 had a related patch set uploaded (by EBernhardson):
Cache more like queries into ObjectCache

https://gerrit.wikimedia.org/r/266419

gerritbot added a project: Patch-For-Review.Jan 25 2016, 10:33 PM

• Deskana triaged this task as Medium priority.Jan 25 2016, 10:41 PM

• Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.

• Deskana subscribed.

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jan 26 2016, 1:15 AM

This entry point is also a good candidate for a cached REST API entry point, very similar to what we have done for textextracts / "summary". It's easy to set up on our end, makes this cached entry point easy to discover, and avoids the brittleness of having to deterministically construct long query strings.

The complication of having to select one standard thumb size illustrates why we need a thumbnail API with support for client-side size selection. This is discussed in T66214: Define an official thumb API.

• GWicke added a project: Services-next.Jan 30 2016, 5:57 PM

• GWicke added a subscriber: • mobrovac.Jan 30 2016, 6:02 PM

Just the other day @Pchelolo and I were discussing this idea, as he commented how the related articles are really cool, but take a ton of time to load. I'm thinking having the entry point /page/related/{title} that returns summaries (more or less) or related pages would be really cool (and fast). Two problems I see right away:

Thumbnails: As noted, their size varies and depends on the display device. As a first resort, we could do what is currently done for the /page/summary end point, where multiple thumb size URLs are generated / served to the client, so that it can decide which one to use.

Updates: related articles do not depend on edits, but rather on views. In the long run, after the results are generated, they could be put in the Event-Platform and picked up by RESTBase which would then refresh its contents. In the short term, though, we could simply regenerate that after each edit.

the current more like query that runs isn't based on page views, it works (approximatly, this is very generalized. See https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html for a more specific description) by taking the most common words in the article and then performing a search for those words. As such I'm not sure EventBus would be a workable way to handle the updates.

At the higher level of API and caching, I do like the concept of making a restbase endpoint. The search API see's high tens to low hundreds of millions of queries a day and it might be a good first step to figuring out what it would look like to try and reduce some of the latency induced by the mediawiki layer.

Change 266419 merged by jenkins-bot:
Cache more like queries into ObjectCache

https://gerrit.wikimedia.org/r/266419

• Deskana moved this task from Inbox to Technical on the CirrusSearch board.Feb 3 2016, 6:15 PM

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2016-02-09_(1.27.0-wmf.13)).Feb 3 2016, 7:00 PM

• Deskana closed this task as Resolved.Feb 3 2016, 7:06 PM

In T124216#1984206, @GWicke wrote:

This entry point is also a good candidate for a cached REST API entry point, very similar to what we have done for textextracts / "summary".

I think the caching that we have implemented here is probably sufficient for now. I'm definitely interested in exploring this possibility in the future, though. :-)

• Deskana moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Feb 11 2016, 11:07 PM

Change 272483 had a related patch set uploaded (by EBernhardson):
Cache more like queries for 24 hours

https://gerrit.wikimedia.org/r/272483

Change 272483 merged by jenkins-bot:
Cache more like queries for 24 hours

https://gerrit.wikimedia.org/r/272483

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:42 PM