Page MenuHomePhabricator

[16 hrs] Investigate cursor based pagination for CirrusSearch results
Closed, ResolvedPublic

Description

Event Timeline

TL;DR: Cursor pagination is not supported by CirrusSearch out of the box, but we can make it look like cursor pagination on the outside (user-facing), while doing limit and offset pagination internally.


Pagination in ElasticSearch and OpenSearch

There are three different pagination methods:

  • limit/offset (called from + size in OpenSearch) as we know it
  • Sort + search_after
    • Requires specifying a value to sort results by
    • works by specifying values within the fields that are sorted by e.g. “sort countries by name, search_after: { name: ‘Japan’ }” would return all countries in alphabetical order starting after "Japan"
    • Can be used in combination with a point in time to keep results stable for a period of time even after index updates
  • Scroll
    • can be used for paginating through large result sets using a scroll_id (= cursor) and guarantees stable results for a period of time even after index updates
    • Discouraged according to a note in the ES docs? OpenSearch also suggest using search_after with a point in time instead

Pagination in (Wikibase)CirrusSearch
CirrusSearch only supports limit and offset pagination for user-facing search. Other types (scroll and search_after) are used for deep pagination in some maintenance scripts.

Suggested approach: encode limit + offset in the cursor
We can use the limit and offset pagination method that is supported in WikibaseCirrusSearch while designing our schema for cursor pagination. We can achieve this by encoding the limit and offset values in the cursor. This is not an uncommon approach and is described e.g. in this article: https://medium.com/better-programming/understanding-the-offset-and-cursor-pagination-8ddc54d10d98#fcb6

This approach has a few (acceptable!?) downsides:

  • Users may expect search results to be stable, but concurrent edits can cause changes in the results while a user is paginating through them
  • Users may be able to decode the cursor to skip ahead, which is not meant to be possible with cursor pagination

The advantages of this approach are

  • We can implement it easily and without external help
  • Our schema is future-proof, so we can improve the underlying pagination mechanism in the future without changing the interface

Point in Time sounds like a useful feature to explore in the future. It allows for stable search results, and paginating beyond 10000 results.

Thanks for the investigation and very comprehensible summary

          #
  #      ##
  #     # #
#####     #
  #       #
  #       #
        #####