Page MenuHomePhabricator

Search on en.wikipedia returns a page twice (one up to date, the other one out of date)
Closed, ResolvedPublic

Description

Search for "the well ordered withdrawal" returns 3 pages. Battle of Buna–Gona is listed twice. One is up to date, but the other is from a month ago. Can we lose the old ghost?

Event Timeline

Chris_the_speller assigned this task to Aklapper.
Chris_the_speller raised the priority of this task from to Needs Triage.
Chris_the_speller updated the task description. (Show Details)
Restricted Application added a project: Discovery. · View Herald TranscriptJul 19 2015, 4:42 PM
jeremyb updated the task description. (Show Details)Jul 19 2015, 4:53 PM
jeremyb set Security to None.
jeremyb added a subscriber: jeremyb.

(refraining from purging, etc. in case someone wants to dump the index first)

Aklapper removed Aklapper as the assignee of this task.Jul 26 2015, 12:56 AM

@Chris_the_speller: Not sure why you assigned this to me?

Also wondering if there are more examples or so far only this one?

Ironholds moved this task from Needs triage to Search on the Discovery board.Aug 4 2015, 8:18 AM

@Aklapper – I cloned this from another CirrusSearch bug, and your name was on it. If you have moved on to other work, thanks for your help in the past. This is the only case of this bug at the moment. If I see others, I'll report them.

Deskana triaged this task as Low priority.Dec 23 2015, 11:43 PM
Deskana added a subscriber: Deskana.

Would it be hard to make a test system to verify the contents of the index?

I'd envision a two part background service that daily (1) takes a random X% (where X ~ 0.1 to 1) of the documents and checks that they made it to, and are still in the index, and (2) does the reverse by pulling random ES documents and verifying they should be in the index.

One method of this is to calculate a hash on all documents at the start of your indexing pipeline and storing the hash in an ES field in the document, and in an external store. You can then query ES for the hash to ensure the right version exists. The reverse direction can query ES with a random scoring function, and compare the retrieved hashes.

The output of the test would an alert and/or dashboard that breaks down missing and extraneous documents by { indexing date, index (language), cluster, etc } to facilitate finding the root cause.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 12 2016, 10:37 PM
EBernhardson added a subscriber: EBernhardson.EditedApr 15 2016, 2:00 AM

(wrong ticket)

EBernhardson added a subscriber: dcausse.EditedApr 15 2016, 2:20 AM

(wrong ticket)

The above two messages are on the wrong ticket ... fixing

Deskana closed this task as Resolved.Aug 6 2016, 3:52 AM
Deskana claimed this task.

This is fixed now, likely as a result of T137113 which was fixing duplicate entries in all sorts of Elastic-backed systems.