Page MenuHomePhabricator

Reindex codfw search cluster for the bm25 AB test
Closed, ResolvedPublic

Event Timeline

debt triaged this task as Medium priority.Aug 25 2016, 10:09 PM
debt moved this task from needs triage to Current work on the Discovery-Search board.
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Having some issues here, a number of documents end up erroring with messages like:

                        [7] Caught an error retrying as singles.  Backing off for 255 and retrying.  Error type is 'Elastica\Exception\Bulk\ResponseException' and message is:  unknown: Error in one or more bulk r
equest actions:

index: /enwiki_content_1472503971/page/38061643 caused failed to parse

Enough failed that the final check that document counts are within 5% of eachother failed. It also turns out we don't log anything about failures in the reindex process, so I only have the above info. To get the test moving i've adjusted the script on terbium to allow things to be out of wack and will run the saneitizer to try and backfill any missing pages.

Will do some more direct debugging from mwrepl to try and figure out what is really going wrong here...

@dcausse Best i can tell there is a problem with id hash mod, but i looked over the code and this plugin is so simple I can't figure out how it could be broken.

Run the plugin with a mod of 10 for values 0..9:

mkdir ~/id_hash_mod_test
cd ~/id_hash_mod_test
for i in {0..9}; do
  curl -XGET search.svc.codfw.wmnet:9200/enwiki_content/_search -d '{
    "_source": ["id"], 
    "from": 0, "size": 5000, 
    "query":{
      "bool":{
        "filter":[{
          "id_hash_mod": {
            "mod": 10,
            "match": '$i'
          }
        }]
      }
    }
  }' | jq -r '.hits.hits | map(._id) | join("\n")' > $i
done

Use grep to look for id's returned by mod 0 in other mods. The -F means use fixed (exact) string matching. -x means select only matches that exactly match a full line, and -f sources patterns from a file, one pattern per line.

grep -Fx -f 0 {1..9} | cut -d : -f 1 | sort | uniq -c

This results in:

240 1
232 5
232 6
215 8
230 9

Which basically means id's return as part of mod 0 were also returned as part of mods 1, 5, 6, 8 and 9.

FWIW it appears the script, for which the plugin is supposed to be equivilent, appears to work. To test i put the script into /etc/elasticsearch/script/id_hash_mod.groovy on elastic20{01..24}.codfw.wmnet, and ran the above test but with:

{
    "_source": ["id"], 
    "from": 0, "size": 5000, 
    "query":{
      "bool":{
        "filter":[{
          "script": {
            "script": {
              "lang": "groovy",
              "file": "id_hash_mod",
              "params": {
                "mod": 10,
                "match": '$i'
              }
            }
          }
        }]
      }
    }
  }

This runs noticeably slower, but over the test case with 5k results and 10 queries there were no duplicates in any of the returned sets.

I've spent couple hours on the id_hash_mod and failed to spot an issue here.
I'll have another look because I was looking for cases where id_hash_mod would miss some ids.

Also looking at the Reindex code I think that a single doc failure could cause the whole thread to fail.
Docs with invalid coordinates will always fail even if we retry them but because we catch ExceptionInterface outside iterateOverScroll the thread will certainly stop leaving all the remaining ids in this mod unindexed.
Given that id_hash_mod returns duplicates the chances that an invalid doc appears in multiple threads is increased.
Do you have the reindex logs?
We could check if threads were silent after the failure?

I think I've got it, the hash is now computed with murmurhash and its seed value is set to current time when the jvm starts, so using the hashCode over multiple servers could return different hash value depending on the server and when the hashCode is computed...
Will fix by using a fixed seed.

Getting pretty close to finishing the saneitize to fix up codfw, i think we are good to start the test today in ~1hr with the SF morning SWAT.

Current counts:

indexeqiadcodfwdiff
content522733652244782858 (0.05%)
general2580863325582204226429 (0.87%)

The general index is still fixing and will get even closer over the next hour