Discrepancies in the search index are possible due to various reasons:
- When the cluster is down
- Leaks in the update process (jobqueue, bugs, network issues, unknown...)
- New keyword data added
The purpose of this task is to list all possible discrepancies and find a reliable way to make sure that the index is up to date in reasonable amount of time.
The sanitization process is already in place and can fix some discrepancies (out of date pages, e.g. when the last rev id does not match the one in the index). But this is not sufficient:
- when a new keyword is added the revision id may not change, the sanitization process does not know that some pages need to be re-parsed/reindexed.
- add more here.