Page MenuHomePhabricator

Improve resiliency of the reindexing process
Open, MediumPublic

Description

Reindexing large wikis is becoming very difficult (c.f. T227136).
It seems that the current reindexing process which is based on the internal mechanism provided by elastic is not able to retry any failed query. Reason is that the scrolled queries are not retriable (ref https://github.com/elastic/elasticsearch/issues/26153).
Fixing this is tracked upstream by https://github.com/elastic/elasticsearch/pull/25797 where they suggest that an API be added to create and maintain a reference to a lucene IndexReader giving the possibility to sort on _doc (lucene internal ids) and use searchAfter. This has been marked as a high hanging fruit.
It's likely that if this feature is implemented the reindex process will rely on it.

We could alternatively re-implement our own reindex mechanism. We don't strictly a immutable IndexReader, we just need a stable sort field that we could use with searchAfter. Currently the _id does not have doc_values enabled making it hard to use it as sort critiria, we'd have to duplicate it into a new field.

Event Timeline

dcausse created this task.Jul 18 2019, 3:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2019, 3:00 PM
debt triaged this task as Medium priority.Jul 25 2019, 5:05 PM
debt moved this task from needs triage to elastic / cirrus on the Discovery-Search board.