Page MenuHomePhabricator

Reader gets cacheable search results
Closed, ResolvedPublic2 Estimated Story Points

Description

"As a Reader, I want search results to be cacheable on a very short time span, so if I make a typo and correct it my previous search results are retrieved much faster."

This is for typeahead search typos. So if I type "Washingt" I get search results for that word, and if I type "p" next, I'll get very few results for "Washingtp", and typing backspace will initiate a search for "Washingt" which will be cached. Apparently a big hassle for end users when it does too many searches. Cache window should be somewhere between search index window and the time to identify and correct a typo (<60s, maybe much less).

@eprodromou to determine the correct cache window.

Event Timeline

Feedback from @Anomie :

seems straightforward enough if SearchHandler doesn't already do it. For the Action API you should be able to get this by including &maxage=5&smaxage=5 in the query.

WDoranWMF set the point value for this task to 2.Feb 26 2020, 7:59 PM

@Pchelolo did some good research on this, from which I extrapolated with some heuristics. We have a 2 minute time frame for search indexing, with a wide variation. I'm going to say we take half that period, 1 minute, as the best-case cache window.

I'd prefer that this window was set with a configuration variable and defaulted to 0 (no caching) for developer systems or small installations.

We have a config setting that controls caching of the result of the OpenSearch action API module, which is used by the search widget: $wgSearchSuggestCacheExpiry. This defaults to 1200 seconds (twenty minutes). In production, it's 10800 seconds (two hours). ApiOpenSearch also sets this cache to public, so it's in Varnish, protecting the search backend from traffic. I expect this is especially important for short prefixes. Without this, the search backend would be hit with hundreds of searches per second for short prefixes like "e" and "s" and "n", etc. We had this happen with the completion search on wikidata, causing massive load and timeouts.

If the target use case is indeed type-ahead, we should follow the same pattern as ApiOpenSearch. We could try and be smart and say we only publically cache prefixes up to three letters, and make the cache for longer input private. That would prevent flooding of the Varnish layer while still protecting the search infrastructure. But this strategy should be adopted by all APIs serving type-ahead completions.

By the way - as far as I can tell, the search endpoint is not currently suitable for typeahead suggestions, since it doesn't do prefix searches. It's based on the searchTitle() and searchText() methods in SearchEngine, which type-ahead completion should use completionSearchWithVariants(), as ApiOpenSearch does.

Agreed that the ApiOpenSearch pattern sounds far superior (I haven't actually looked at the code). It also sounds like more effort than we'd discussed for the current task. Fortunately, traffic levels to the the current Core REST API search endpoint are modest, so this isn't melting down the servers. I'm not aware of any reason that traffic to this endpoing is imminently about to skyrocket. So this sounds to me like a high priority concern, but not an Unbreak Now one either.

Suggestion: complete the current task in this sprint as it is currently described, using a hard-coded cache time rather than a configured one. This seems to make the task straightforward, and (assuming we cache privately so no varnish flooding) makes the current endpoint slightly less bad. Then also file a separate task for reimplementing the endpoint using the techniques from ApiOpenSearch module, and pick that up in the next sprint.

@daniel brings up a great point about prefix search versus full-text search. I think what we need for this particular epic is prefix search. I've broken out a new ticket T246387 for making that work. I called it 'title search', but that might be imprecise.

In our 1:1 Daniel and I compared the caching for Special:Search and the type-ahead widget. The special page has 0 caching, and the type-ahead widget has 2-hour caching. So it probably makes sense to reflect those values for /search/page and /search/title respectively.

Change 575368 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] SearchHandler: emit cache control header

https://gerrit.wikimedia.org/r/575368

Following Evan's comment above, I have made this a parent task of T246387. Caching will not be implemented for the existing search/page endpoint, but for a future search/title endpoint. So we can't work on this until we have the search/title endpoint.

Change 583458 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] SearchHandler: emit Cache-Control header.

https://gerrit.wikimedia.org/r/583458

Change 575368 abandoned by Daniel Kinzler:
SearchHandler: emit cache control header

Reason:
Ia8366684203381b6c4dc55669a6877e53e9ffe40

https://gerrit.wikimedia.org/r/575368

Change 583458 merged by jenkins-bot:
[mediawiki/core@master] SearchHandler: emit Cache-Control header.

https://gerrit.wikimedia.org/r/583458