Page MenuHomePhabricator

Cache morelike API query results
Closed, ResolvedPublic

Description

morelike queries take up a significant load on the cluster (~20%) vs the number of queries we actually perform. Some initial analysis in hive suggests we can cut the number served to 1/5 of the current load by caching responses for 24 hours:

A relatively naive query in hive suggests in the span of 24h we could cut morelike queries to the backend from 7.3M to 1.7M:

select sum(total), sum(deduplicated) from (select count(1) as total, count(distinct requests[0].query) as deduplicated from wmf_raw.cirrussearchrequestset where year=2016 and month=1 and day=10 and requests[0].querytype = 'more_like' group by wikiid) x;

_c0 _c1
7331659 1726091

This tries to get a rough estimate on how that compares to the variance in the way uri's are sent. I'm not sure how good of an approximation this is, but the totals are similar enough it might be a good guess:

select sum(total), sum(deduplicated) from (select count(1) as total, count(distinct uri_query) as deduplicated from wmf.webrequest where year=2016 and month=1 and day=10 and uri_query LIKE '%search=morelike%' group by uri_host) x;

_c0 _c1
7383599 2214332

The typical API request looks like: https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=80&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3ACome_Share_My_Love&gsrnamespace=0&gsrlimit=3

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson added a project: CirrusSearch.
EBernhardson subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

@Anomie Before i go digging into working on this, I was wondering if you have any opinion? I'm initially thinking of adding a cacheMaxAge() function to the SearchResultSet object, from which the api can then choose to use or not use. CirrusSearch will only set this for morelike queries.

The main part I'm uncertain about is how the use as a generator to get other properties effects the cachability. Although perhaps other things I just don't know about yet.

The main part I'm uncertain about is how the use as a generator to get other properties effects the cachability. Although perhaps other things I just don't know about yet.

That is a concern: setting a cache time would cause all those properties to be cached too, which may confuse users or cause them to hack around the problem by varying their URLs or using POSTs to bypass caching.

Another concern is that multiple query modules trying to set cache times would conflict with each other, unless we add code to somehow merge the different values.


Another possibility to deal with this issue, if the queries are all from a limited number of sources and seem likely to remain so, would be to convince those sources to set the maxage and smaxage parameters and to ensure a consistent ordering of their query parameters.

Or the caching could be done in Cirrus instead of the API: store morelike query results in a memcache of some sort, and then use that cached data to respond to the query when it recurs. That would allow the API to process the rest of its modules using fresh data as people probably expect.

Another possibility to deal with this issue, if the queries are all from a limited number of sources and seem likely to remain so, would be to convince those sources to set the maxage and smaxage parameters and to ensure a consistent ordering of their query parameters.

As @Jhernandez said, this seems like the quickest change that the Reading Web and Apps teams could make while also being the most brittle – if one team wanted to tweak the query, then they must communicate/coordinate with the other teams well in advance.

Or the caching could be done in Cirrus instead of the API: store morelike query results in a memcache of some sort, and then use that cached data to respond to the query when it recurs. That would allow the API to process the rest of its modules using fresh data as people probably expect.

… with the cache layer doing a little work to order the query parameters as it sees fit. IMO this is the solution we should be aiming for, perhaps using the above as a stop-gap.

Change 266419 had a related patch set uploaded (by EBernhardson):
Cache more like queries into ObjectCache

https://gerrit.wikimedia.org/r/266419

Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.
Deskana subscribed.

This entry point is also a good candidate for a cached REST API entry point, very similar to what we have done for textextracts / "summary". It's easy to set up on our end, makes this cached entry point easy to discover, and avoids the brittleness of having to deterministically construct long query strings.

The complication of having to select one standard thumb size illustrates why we need a thumbnail API with support for client-side size selection. This is discussed in T66214: Define an official thumb API.

Just the other day @Pchelolo and I were discussing this idea, as he commented how the related articles are really cool, but take a ton of time to load. I'm thinking having the entry point /page/related/{title} that returns summaries (more or less) or related pages would be really cool (and fast). Two problems I see right away:

  • Thumbnails: As noted, their size varies and depends on the display device. As a first resort, we could do what is currently done for the /page/summary end point, where multiple thumb size URLs are generated / served to the client, so that it can decide which one to use.
  • Updates: related articles do not depend on edits, but rather on views. In the long run, after the results are generated, they could be put in the Event-Platform and picked up by RESTBase which would then refresh its contents. In the short term, though, we could simply regenerate that after each edit.

the current more like query that runs isn't based on page views, it works (approximatly, this is very generalized. See https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html for a more specific description) by taking the most common words in the article and then performing a search for those words. As such I'm not sure EventBus would be a workable way to handle the updates.

At the higher level of API and caching, I do like the concept of making a restbase endpoint. The search API see's high tens to low hundreds of millions of queries a day and it might be a good first step to figuring out what it would look like to try and reduce some of the latency induced by the mediawiki layer.

Change 266419 merged by jenkins-bot:
Cache more like queries into ObjectCache

https://gerrit.wikimedia.org/r/266419

This entry point is also a good candidate for a cached REST API entry point, very similar to what we have done for textextracts / "summary".

I think the caching that we have implemented here is probably sufficient for now. I'm definitely interested in exploring this possibility in the future, though. :-)

Change 272483 had a related patch set uploaded (by EBernhardson):
Cache more like queries for 24 hours

https://gerrit.wikimedia.org/r/272483

Change 272483 merged by jenkins-bot:
Cache more like queries for 24 hours

https://gerrit.wikimedia.org/r/272483