Page MenuHomePhabricator

Use user_random sort in LocalSearchTaskSuggester
Open, Needs TriagePublic

Description

We currently do not have the ability to easily paginate results in the local search task suggester.

What we do instead is:

  1. make a query
  2. keep track of the page IDs that we've received
  3. when we make a query for more results, we pass and exclude page IDs we've already received

We should be able to make use of the user_random sort flag in CirrusSearch added in rECIR92ad47f7c93a: Provider a stable random sort. That should provide the ability for us to have a stable randomize search per user through results.

Acceptance Criteria

  1. Use CirrusSearch pagination instead of relying on the excludepageids parameter.
  2. Make use of the offset parameter in the growth tasks API
Completion checklist

Functionality

  • The patches have been code reviewed and merged
  • The task passes its acceptance criteria

Engineering

  • There are existing and passing unit/integration tests
  • Tests for every involved patch should pass
  • Coverage for every involved project should have improved or stayed the same

Design & QA

  • If the task is UX/Design related: it must be reviewed and approved by the UX/Design team
  • Must be reviewed and approved by Quality Assurance.

Documentation

  • Related and updated documentation done where necessary

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Here's a use case from the SD team:

We need to get ALL pages-with-suggestions for a particular wiki, in order to send notifications to experienced users who have those pages on their watchlists. ATM we query elasticsearch directly to get these because it's tricky to get more than 10k results from elastic, so we have to use elastic's scroll API

ATM we do not exclude results with wiki-specific infbox templates. The simplest way to do this would be to just allow Growth's code to do it, but atm they can only return a limited number of results, and we can't page through them

Would this change make it possible to page through everything? Or will we still hit the 10k results cap? @dcausse @EBernhardson

ATM we do not exclude results with wiki-specific infbox templates. The simplest way to do this would be to just allow Growth's code to do it, but atm they can only return a limited number of results, and we can't page through them

The Growth code defines a search term for infoboxes (hastemplatecollection:infobox) so you can use that with the normal CirrusSearch API and whatever sort / paging options you prefer. Not sure that would help with the 10K cap though.

Wrt using user_random in the Growth search code: is that subject to the same constraint than random with a seed, that it relies on document ID hashes and will essentially re-randomize during index updates?

Here's a use case from the SD team:

We need to get ALL pages-with-suggestions for a particular wiki, in order to send notifications to experienced users who have those pages on their watchlists. ATM we query elasticsearch directly to get these because it's tricky to get more than 10k results from elastic, so we have to use elastic's scroll API

ATM we do not exclude results with wiki-specific infbox templates. The simplest way to do this would be to just allow Growth's code to do it, but atm they can only return a limited number of results, and we can't page through them

Would this change make it possible to page through everything? Or will we still hit the 10k results cap? @dcausse @EBernhardson

I think I'm not quite following the relationship between a different sort and the 10k limit. I suppose the hope is that by adding the infobox related filter it would get the result lists down below the threshold, but if it does work it would be a weak guarantee.

You still wouldn't be able to page through everything directly, the 10k result limit is a hard limit within elasticsearch on the search api. The workarounds are either scroll or _search_after functionality. We have maintenance code that uses both, but nothing on the user facing side. Scroll couldn't be exposed on the user-facing side, it requires state on the search instances which could lead to trouble. Isearch_after could be supported, but it's often hard to fit these kinds of pieces into the mediawiki side, as the core code is supposed to be generic to the underlying search engine. The actual implementation side of search_after should be easy, it amounts to taking a value from the elasticsearch response and sticking it in the next request, the difficulty would revolve around getting that information out to the edge so it can be passed back in on subsequent queries.

Wrt using user_random in the Growth search code: is that subject to the same constraint than random with a seed, that it relies on document ID hashes and will essentially re-randomize during index updates?

The randomization is based on _seq_id, which is essentially an internal counter of the # of writes within the index that is unique. Every time the document is updated it get's a new version number, so the sorting will slowly change over time. We've been intending to add the page_id as a field (since 2019...) that would be usable here, we could probably get around to it sooner than later which would provide a longer-term stable randomization.

_search_after could be supported, but it's often hard to fit these kinds of pieces into the mediawiki side, as the core code is supposed to be generic to the underlying search engine.

The API supports an arbitrary continuation string which the API module can pass to the client, and the client is expected to return verbatim to get the next batch of results. I think that approach could be used in SearchEngine too: add a new IPageableSearchResultSet which extends ISearchResultSet and exposes a continuation key, make the various result classes implement it, and add a setter to SearchEngine. The continuation string would be opaque to the code calling the search engine and could simply be null if the given engine does not support paging.

We've been intending to add the page_id as a field (since 2019...) that would be usable here, we could probably get around to it sooner than later which would provide a longer-term stable randomization.

That would be great, thanks!

I think I'm not quite following the relationship between a different sort and the 10k limit. I suppose the hope is that by adding the infobox related filter it would get the result lists down below the threshold, but if it does work it would be a weak guarantee.

You still wouldn't be able to page through everything directly, the 10k result limit is a hard limit within elasticsearch on the search api.

Ok fair enough - it's the paging-through-everything that we need, don't really care about the randomness. In that case we don't have a use case for this ticket after all :)

The Growth code defines a search term for infoboxes (hastemplatecollection:infobox) so you can use that with the normal CirrusSearch API and whatever sort / paging options you prefer. Not sure that would help with the 10K cap though.

Aha! I saw the hastemplatecollection but it hadn't occurred to me we could use it in our extensions. Cool! We still have the 10k limit problem though

We've been intending to add the page_id as a field (since 2019...) that would be usable here, we could probably get around to it sooner than later which would provide a longer-term stable randomization.

That was done in rECIR2858f3ab7369: Add page_id a doc_values only field (thanks!), availability in production is tracked in T147505: [tracking] CirrusSearch: what is updated during re-indexing. I think SearchRequestBuilder still needs to be updated - is it possible to just change the field used for the existing random and user_random, for some very minor B/C break, or should these be new sort types?

Wrt this task, using pseudorandom sort is a good idea (not sure if it should be user_random or just letting the caller choose the random seed), but pagination is going to be the tricky part, given that SearchTaskSuggester interleaves multiple CirrusSearch queries.

Moving to Triaged due to lack of activity; it's listed under T311850 so hopefully we don't lose sight of this task.

pagination is going to be the tricky part, given that SearchTaskSuggester interleaves multiple CirrusSearch queries.

I guess we can use a separate offset for every CirrusSearch query; the action API makes that kind of thing easy. We'd make sure that the order of referencing queries is deterministic, and use a continuation parameter like 3|3|4|5|4 that's a list of query offsets. TaskSet would have to be updated so the offset property is an array, not an int (or maybe leave that property be and add a new continuation property which is an arbitrary JSON-able value handled in an implementation-specific manner) and the logic in SearchTaskSuggester::doSuggest() would have to increment it as it picks tasks from the individual queries. The $offset parameter of SearchTaskSuggester::suggest() would have to be changed similarly.

So that seems conceptually simple but possibly still more work than value.