Page MenuHomePhabricator

Search on en.wikipedia returns a page twice (one up to date, the other one out of date)
Closed, ResolvedPublic

Description

Search for "the well ordered withdrawal" returns 3 pages. Battle of Buna–Gona is listed twice. One is up to date, but the other is from a month ago. Can we lose the old ghost?

Event Timeline

Chris_the_speller assigned this task to Aklapper.
Chris_the_speller raised the priority of this task from to Needs Triage.
Chris_the_speller updated the task description. (Show Details)
jeremyb set Security to None.
jeremyb subscribed.

(refraining from purging, etc. in case someone wants to dump the index first)

@Chris_the_speller: Not sure why you assigned this to me?

Also wondering if there are more examples or so far only this one?

@Aklapper – I cloned this from another CirrusSearch bug, and your name was on it. If you have moved on to other work, thanks for your help in the past. This is the only case of this bug at the moment. If I see others, I'll report them.

Deskana subscribed.

Would it be hard to make a test system to verify the contents of the index?

I'd envision a two part background service that daily (1) takes a random X% (where X ~ 0.1 to 1) of the documents and checks that they made it to, and are still in the index, and (2) does the reverse by pulling random ES documents and verifying they should be in the index.

One method of this is to calculate a hash on all documents at the start of your indexing pipeline and storing the hash in an ES field in the document, and in an external store. You can then query ES for the hash to ensure the right version exists. The reverse direction can query ES with a random scoring function, and compare the retrieved hashes.

The output of the test would an alert and/or dashboard that breaks down missing and extraneous documents by { indexing date, index (language), cluster, etc } to facilitate finding the root cause.

The above two messages are on the wrong ticket ... fixing

Deskana claimed this task.

This is fixed now, likely as a result of T137113 which was fixing duplicate entries in all sorts of Elastic-backed systems.