19:15 < roy649_> I don't know if this is significant, but I'm getting "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later." on some searches on https://checkuser.wikimedia.org/ 19:18 < roy649_> OK, thanks. It's intermittent and looks like it's more common on searches with large numbers of namespaces. 19:24 < tn> I'm seeing `index_not_found_exception: no such index`, with the marked index' as `checkuserwiki,chi:commonswiki_file` fwiw, seems ^
Description
Related Objects
Event Timeline
In similar tickets like T240778 I see comments how search was fixed for individual wikis by reindexing (@EBernhardson could this also be the case here, does checkuser need reindexing? though we also see the "index_not_found_exception" as if there is none at all, that's why I named the ticket this way.
"host": "mw1325", "hitsOffset": 0, "normalized_message": "Search backend error during {queryType} search for '{query}' after {tookMs}: {error_message}", "syntax": [ "full_text", "full_text_simple_match", "simple_bag_of_words" ], "limit": 21, "type": "mediawiki", "index": "checkuserwiki,chi:commonswiki_file", "server": "checkuser.wikimedia.org", "error_message": "index_not_found_exception: no such index", "reqId": "56ca2991-f12a-4bef-b8a6-fea6df1c99e9",
This doesn't look to be anything related to reindexing, rather something is out-of-sync inside the elasticsearch clusters with regards to cross-cluster functionality.
Querying some of the instances running 9643 works every time:
ebernhardson@mwdebug1002:~$ curl -k 'https://elastic1045.eqiad.wmnet:9643/checkuserwiki,chi:commonswiki_file/page/_search?q=example&size=0' && echo {"took":19,"timed_out":false,"...}
And fails against several other instances every time:
ebernhardson@mwdebug1002:~$ curl -k 'https://elastic1043.eqiad.wmnet:9643/checkuserwiki,chi:commonswiki_file/page/_search?q=example&size=0' && echo {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"chi:commonswiki_file","index_uuid":"_na_","index":"chi:commonswiki_file"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"chi:commonswiki_file","index_uuid":"_na_","index":"chi:commonswiki_file"},"status":404}
These are configured to ignore errors in cross-cluster communications (if the remote cluster is inaccessible), which means elastic thinks it's talking to the cluster and the cluster doesn't have the requested indices (while it clearly does).
"cluster" : { "remote" : { "chi" : { "skip_unavailable" : "true", "seeds" : [ "elastic1036.eqiad.wmnet:9300", "elastic1030.eqiad.wmnet:9300", "elastic1040.eqiad.wmnet:9300" ] }, "omega" : { "skip_unavailable" : "true", "seeds" : [ "elastic1034.eqiad.wmnet:9500", "elastic1040.eqiad.wmnet:9500", "elastic1038.eqiad.wmnet:9500" ] } } }
I don't have great answers, as this looks to be some sort of issue with state going out of sync running a restart across the cluster might get things to re-read the appropriate bits, but unclear.
With the error consistently reproducable when talking to individual machines i setup a script to check all the different directions we do cluster to cluster communications. There were 5 instances (1047, 1046, 1044, 1042, 1035) showing the issue, all in the psi->chi direction. I restarted each of the instances, logging shows the errors have stopped.
Looking into the logging, it seems this issue has been occuring at a low level for some time at a rate of around 400 events per day. The events start on 9-14, but that isn't as telling as we would hope since primary traffic was switched from codfw to eqiad on that day.
Since we're migrating to newer Elasticsearch version, it does not seem to be worth investigating anymore. Let's close.