Page MenuHomePhabricator

Improve resiliency of the reindexing process
Open, MediumPublic


Reindexing large wikis is becoming very difficult (c.f. T227136).
It seems that the current reindexing process which is based on the internal mechanism provided by elastic is not able to retry any failed query. Reason is that the scrolled queries are not retriable (ref
Fixing this is tracked upstream by where they suggest that an API be added to create and maintain a reference to a lucene IndexReader giving the possibility to sort on _doc (lucene internal ids) and use searchAfter. This has been marked as a high hanging fruit.
It's likely that if this feature is implemented the reindex process will rely on it.

We could alternatively re-implement our own reindex mechanism. We don't strictly a immutable IndexReader, we just need a stable sort field that we could use with searchAfter. Currently the _id does not have doc_values enabled making it hard to use it as sort critiria, we'd have to duplicate it into a new field.