Page MenuHomePhabricator

missing search index on checkuserwiki
Closed, ResolvedPublic

Description

19:15 < roy649_> I don't know if this is significant, but I'm getting "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later." on some searches on https://checkuser.wikimedia.org/


19:18 < roy649_> OK, thanks.  It's intermittent and looks like it's more common on searches with large numbers of namespaces.


19:24 < tn> I'm seeing `index_not_found_exception: no such index`, with the marked index' as `checkuserwiki,chi:commonswiki_file` fwiw, seems ^

Event Timeline

Dzahn renamed this task from missing search index on check user wiki to missing search index on checkuserwiki.Nov 5 2021, 7:37 PM
Dzahn edited projects, added Discovery-Search; removed Discovery-ARCHIVED.

In similar tickets like T240778 I see comments how search was fixed for individual wikis by reindexing (@EBernhardson could this also be the case here, does checkuser need reindexing? though we also see the "index_not_found_exception" as if there is none at all, that's why I named the ticket this way.

Ref Oo2Z8XwBfkHq3kAMpyfs

"host": "mw1325",
"hitsOffset": 0,
"normalized_message": "Search backend error during {queryType} search for '{query}' after {tookMs}: {error_message}",
"syntax": [
  "full_text",
  "full_text_simple_match",
  "simple_bag_of_words"
],
"limit": 21,
"type": "mediawiki",
"index": "checkuserwiki,chi:commonswiki_file",
"server": "checkuser.wikimedia.org",
"error_message": "index_not_found_exception: no such index",
"reqId": "56ca2991-f12a-4bef-b8a6-fea6df1c99e9",

This doesn't look to be anything related to reindexing, rather something is out-of-sync inside the elasticsearch clusters with regards to cross-cluster functionality.

Querying some of the instances running 9643 works every time:

ebernhardson@mwdebug1002:~$ curl -k 'https://elastic1045.eqiad.wmnet:9643/checkuserwiki,chi:commonswiki_file/page/_search?q=example&size=0' && echo
{"took":19,"timed_out":false,"...}

And fails against several other instances every time:

ebernhardson@mwdebug1002:~$ curl -k 'https://elastic1043.eqiad.wmnet:9643/checkuserwiki,chi:commonswiki_file/page/_search?q=example&size=0' && echo
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"chi:commonswiki_file","index_uuid":"_na_","index":"chi:commonswiki_file"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"chi:commonswiki_file","index_uuid":"_na_","index":"chi:commonswiki_file"},"status":404}

These are configured to ignore errors in cross-cluster communications (if the remote cluster is inaccessible), which means elastic thinks it's talking to the cluster and the cluster doesn't have the requested indices (while it clearly does).

"cluster" : {
  "remote" : {
    "chi" : {
      "skip_unavailable" : "true",
      "seeds" : [
        "elastic1036.eqiad.wmnet:9300",
        "elastic1030.eqiad.wmnet:9300",
        "elastic1040.eqiad.wmnet:9300"
      ]
    },
    "omega" : {
      "skip_unavailable" : "true",
      "seeds" : [
        "elastic1034.eqiad.wmnet:9500",
        "elastic1040.eqiad.wmnet:9500",
        "elastic1038.eqiad.wmnet:9500"
      ]
    }
  }
}

I don't have great answers, as this looks to be some sort of issue with state going out of sync running a restart across the cluster might get things to re-read the appropriate bits, but unclear.

With the error consistently reproducable when talking to individual machines i setup a script to check all the different directions we do cluster to cluster communications. There were 5 instances (1047, 1046, 1044, 1042, 1035) showing the issue, all in the psi->chi direction. I restarted each of the instances, logging shows the errors have stopped.

Looking into the logging, it seems this issue has been occuring at a low level for some time at a rate of around 400 events per day. The events start on 9-14, but that isn't as telling as we would hope since primary traffic was switched from codfw to eqiad on that day.

Adding DC switchover so it serves as a reminder in future switchbacks of the issue

@EBernhardson thanks for mitigating it quickly on a Friday afternoon

Gehel claimed this task.
Gehel subscribed.

Since we're migrating to newer Elasticsearch version, it does not seem to be worth investigating anymore. Let's close.