Search for "the well ordered withdrawal" returns 3 pages. Battle of Buna–Gona is listed twice. One is up to date, but the other is from a month ago. Can we lose the old ghost?
Description
Related Objects
- Mentioned In
- T132951: Cirrus search finds article twice
- Mentioned Here
- T137113: Implement a continuous sanitization process
Event Timeline
@Chris_the_speller: Not sure why you assigned this to me?
Also wondering if there are more examples or so far only this one?
@Aklapper – I cloned this from another CirrusSearch bug, and your name was on it. If you have moved on to other work, thanks for your help in the past. This is the only case of this bug at the moment. If I see others, I'll report them.
Would it be hard to make a test system to verify the contents of the index?
I'd envision a two part background service that daily (1) takes a random X% (where X ~ 0.1 to 1) of the documents and checks that they made it to, and are still in the index, and (2) does the reverse by pulling random ES documents and verifying they should be in the index.
One method of this is to calculate a hash on all documents at the start of your indexing pipeline and storing the hash in an ES field in the document, and in an external store. You can then query ES for the hash to ensure the right version exists. The reverse direction can query ES with a random scoring function, and compare the retrieved hashes.
The output of the test would an alert and/or dashboard that breaks down missing and extraneous documents by { indexing date, index (language), cluster, etc } to facilitate finding the root cause.
This is fixed now, likely as a result of T137113 which was fixing duplicate entries in all sorts of Elastic-backed systems.