The CirrusSearch inconsistencies superset dashboard tracks some inconsistencies detected by comparing the cirrus dumps extracted from the codfw search cluster vs the sqooped mysql tables available in the datalake. (Note some aggregated data is also available in prometheus under the cirrussearch_content_inconsistency metric).
In T410602 a contributor discovered that the search index kept some stale data in the search index. This is particularly dangerous because this stale data can be from a vandalized edit which may in turn pollute search behaviors with highly problematic responses/suggestions.
The current consistency checks failed to uncover such problems early enough, we should evaluate how to improve them to make sure that we can detect these issues more pro-actively.
Problems of the current checks:
- it does only capture simple problems:
- redirect_in_cirrus: redirects indexed as plain pages in cirrus
- in_mysql_but_not_in_cirrus: page present in the database but not in the cirrus index
- in_cirrus_but_not_in_mysql: page in the search index but not in the database
- revision_mismatch: indexed page but with the wrong revision
- the check compares weekly cirrus dumps vs monthly sqoop db snapshots, and does include some tedious logic to account for the time difference between these two datasets
- the checks are only comparing very high level metadata (page_id, revision_id)
Suggested improvements:
- increase the frequency of the checks from monthly to weekly (can we use other datasources than sqooped tables?)
- also check the eqiad index by ingesting it into the datalake as well
- include more granular checks of important metadata:
- redirects array in cirrus
- defaultsort
- others