missing search index on checkuserwiki
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dzahn
	Nov 5 2021, 7:36 PM

Description

19:15 < roy649_> I don't know if this is significant, but I'm getting "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later." on some searches on https://checkuser.wikimedia.org/


19:18 < roy649_> OK, thanks.  It's intermittent and looks like it's more common on searches with large numbers of namespaces.


19:24 < tn> I'm seeing `index_not_found_exception: no such index`, with the marked index' as `checkuserwiki,chi:commonswiki_file` fwiw, seems ^

Related Objects

Mentioned Here: T240778: "We could not complete your search due to a temporary problem." searching words on Minangkabau Wiktionary

Event Timeline

Dzahn created this task.Nov 5 2021, 7:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 5 2021, 7:36 PM

Dzahn renamed this task from missing search index on check user wiki to missing search index on checkuserwiki.Nov 5 2021, 7:37 PM

Dzahn edited projects, added Discovery-Search; removed Discovery-ARCHIVED.

In similar tickets like T240778 I see comments how search was fixed for individual wikis by reindexing (@EBernhardson could this also be the case here, does checkuser need reindexing? though we also see the "index_not_found_exception" as if there is none at all, that's why I named the ticket this way.

Ref Oo2Z8XwBfkHq3kAMpyfs

"host": "mw1325",
"hitsOffset": 0,
"normalized_message": "Search backend error during {queryType} search for '{query}' after {tookMs}: {error_message}",
"syntax": [
  "full_text",
  "full_text_simple_match",
  "simple_bag_of_words"
],
"limit": 21,
"type": "mediawiki",
"index": "checkuserwiki,chi:commonswiki_file",
"server": "checkuser.wikimedia.org",
"error_message": "index_not_found_exception: no such index",
"reqId": "56ca2991-f12a-4bef-b8a6-fea6df1c99e9",

Dzahn edited projects, added Discovery-Search (Current work); removed Discovery-Search.Nov 5 2021, 7:41 PM

RhinosF1 subscribed.Nov 5 2021, 7:52 PM

GeneralNotability subscribed.Nov 5 2021, 7:54 PM

Peachey88 updated the task description. (Show Details)Nov 5 2021, 8:23 PM

This doesn't look to be anything related to reindexing, rather something is out-of-sync inside the elasticsearch clusters with regards to cross-cluster functionality.

Querying some of the instances running 9643 works every time:

ebernhardson@mwdebug1002:~$ curl -k 'https://elastic1045.eqiad.wmnet:9643/checkuserwiki,chi:commonswiki_file/page/_search?q=example&size=0' && echo
{"took":19,"timed_out":false,"...}

And fails against several other instances every time:

ebernhardson@mwdebug1002:~$ curl -k 'https://elastic1043.eqiad.wmnet:9643/checkuserwiki,chi:commonswiki_file/page/_search?q=example&size=0' && echo
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"chi:commonswiki_file","index_uuid":"_na_","index":"chi:commonswiki_file"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"chi:commonswiki_file","index_uuid":"_na_","index":"chi:commonswiki_file"},"status":404}

These are configured to ignore errors in cross-cluster communications (if the remote cluster is inaccessible), which means elastic thinks it's talking to the cluster and the cluster doesn't have the requested indices (while it clearly does).

"cluster" : {
  "remote" : {
    "chi" : {
      "skip_unavailable" : "true",
      "seeds" : [
        "elastic1036.eqiad.wmnet:9300",
        "elastic1030.eqiad.wmnet:9300",
        "elastic1040.eqiad.wmnet:9300"
      ]
    },
    "omega" : {
      "skip_unavailable" : "true",
      "seeds" : [
        "elastic1034.eqiad.wmnet:9500",
        "elastic1040.eqiad.wmnet:9500",
        "elastic1038.eqiad.wmnet:9500"
      ]
    }
  }
}

I don't have great answers, as this looks to be some sort of issue with state going out of sync running a restart across the cluster might get things to re-read the appropriate bits, but unclear.

With the error consistently reproducable when talking to individual machines i setup a script to check all the different directions we do cluster to cluster communications. There were 5 instances (1047, 1046, 1044, 1042, 1035) showing the issue, all in the psi->chi direction. I restarted each of the instances, logging shows the errors have stopped.

Looking into the logging, it seems this issue has been occuring at a low level for some time at a rate of around 400 events per day. The events start on 9-14, but that isn't as telling as we would hope since primary traffic was switched from codfw to eqiad on that day.

Adding DC switchover so it serves as a reminder in future switchbacks of the issue

@EBernhardson thanks for mitigating it quickly on a Friday afternoon

Since we're migrating to newer Elasticsearch version, it does not seem to be worth investigating anymore. Let's close.

missing search index on checkuserwikiClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

missing search index on checkuserwiki
Closed, ResolvedPublic
Actions