Page MenuHomePhabricator

Expose search_after in SearchEngine
Open, LowPublicFeature

Description

Feature summary (what you would like to be able to do and where):
Expose search_after parameter in SearchEngine.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
T345713: fixLinkRecommendationData script yields cirrussearch-offset-too-large identified the reach of the maximum value for the offset parameter in a GrowthExperiments maintenance script. The error is fair but it should be possible to request further results of a search call. ES supports it via search_after parameter. Per T345713#9175219 it seems the parameter is exposed in Cirrus raw queries but not for SearchEngine queries.

Benefits (why should this be implemented?):
Navigating through result sets of 10K+ records should be possible

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Tgr subscribed.

SearchEngine::setLimitOffset() has an offset parameter, but translating an offset to a search_after value seems hard if at all possible. So I think the nicer thing to do would be to issue multiple queries internally within ISearchResultSet's iterator logic (so this is actually a CirrusSearch issue, not a core one).

Exposing this feature might involve a dedicated API, I'm not sure that the existing SearchEngine and ISearchResultSet is well suited for this:

  • might be easier to just assume that we are going to sort on page_id, so that db engines might have a chance to be able to implement this.
  • the client must be able to resume this operation so that it can be processed via multiple jobs (to avoid long running maint scripts)

All this is already possible using internal CirrusSearch APIs, have this solution be considered (see for instance https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/ImageSuggestions/+/refs/heads/master/includes/Notifier.php)?

A DB engine should not have a problem paging on any set if criteria it can efficiently sort on, I think? It's slightly more effort as you have to manually write the query for each sort while CirrusSearch just gives you the continuation data to pass into the next request, but still not hard. I'd probably add a SearchEngine::setContinue() and an IContinuableSearchResultSet::getContinue(), and the caller could just check whether a IContinuableSearchResultSet was returned, and over time all implementations would migrate to returning that (which is unfortunately the least painful way of adding a new method to an interface).

All this is already possible using internal CirrusSearch APIs, have this solution be considered (see for instance https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/ImageSuggestions/+/refs/heads/master/includes/Notifier.php)?

I think that would work for T345713 specifically. But other aspects of structured tasks where this limitation can also be relevant rely on community-configurable search strings to filter the list of applicable tasks; so a CirrusSearch query string would have to be transformed to an Elastica query and then search_after added to that. That seems quite involved.

But mainly I think this would be a valuable addition in general that would prove useful in the future. Eg. we'd probably want to convert recent changes to CirrusSearch some day because currently it's a horror show of query planner edge cases that require lots of manual debugging and tuning, often differently for different DB engines or wiki profiles. But recent changes have more than 10K entries and some clients do want to go through the whole thing.

This will require a significant amount of design and coordination. As such, it should go through the Foundational Technology Requests prioritization.

If it's a lot of work, I don't think we have a sufficiently compelling use case. It seemed easily doable to me, but I don't know the codebase well.

Gehel triaged this task as Low priority.Nov 3 2023, 10:41 AM
Gehel moved this task from needs triage to Feature Requests on the Discovery-Search board.

As I understand it, the implementation is probably not too difficult. Exposing it in a way that makes sense and does not tie us more strongly to Elasticsearch is the difficult part.

I'm changing this to low priority and keeping it around.