Page MenuHomePhabricator

CirrusSearch should be able to keep its index upto date in most cases
Closed, ResolvedPublic


Discrepancies in the search index are possible due to various reasons:

  • When the cluster is down
  • Leaks in the update process (jobqueue, bugs, network issues, unknown...)
  • New keyword data added

The purpose of this task is to list all possible discrepancies and find a reliable way to make sure that the index is up to date in reasonable amount of time.

The sanitization process is already in place and can fix some discrepancies (out of date pages, e.g. when the last rev id does not match the one in the index). But this is not sufficient:

  • when a new keyword is added the revision id may not change, the sanitization process does not know that some pages need to be re-parsed/reindexed.
  • add more here.

Event Timeline

dcausse created this task.May 2 2017, 6:27 PM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptMay 2 2017, 6:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@dcausse How do you suggest we begin to tackle this?

Deskana triaged this task as Medium priority.May 4 2017, 5:10 PM

@Deskana we probably need to discuss possible solutions first.
I'd say there are two main strategies:

  • brute force: adapt the sanitize process to make sure that every page in the index has been regenerated and reindexed within <period to define>. This will likely produce a lot more index request than necessary but seems doable if the period is relatively large (>2 months)
  • smart process: identify which pages need to be regenerated/reindexed, this is a bit more tricky as we'd need to track everything that could lead to a change in the document structure/data, @Smalyshev has been working on T163851 which is going in the right direction. I still don't know if it's possible...
EBernhardson closed this task as Resolved.Sep 11 2018, 5:26 PM
EBernhardson claimed this task.
EBernhardson added a subscriber: EBernhardson.

Resolved via T203622