Investigate CirrusSearch eqiad failures
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RKemper
	Jun 17 2022, 7:08 PM

Description

It looks like starting on 2022-06-03 15:00 UTC we saw a sustained level of reported failures in the metrics, at a level of >1000ops/min, which held steady until improving on 2022-06-17. Investigate what these failures were and why they seemingly went back down without intervention.

Related Objects

Mentioned Here: P30045 verify elasticsearch cross-cluster communication is running

Event Timeline

RKemper created this task.Jun 17 2022, 7:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 17 2022, 7:08 PM

I'm not sure if this is related, but we got an alert for "Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool " at ~ 17:36 UTC today.

I logged into the host and found that while the production-search-psi-eqiad service was healthy, elasticsearch_6@production-search-eqiad was failing to start . Unfortunately, I didn't find any helpful error messages, so I rebooted via cookbook.

When the server came back up, it still could not start elasticsearch_6@production-search-eqiad , but the service did recover after I manually ran puppet agent.

RKemper updated the task description. (Show Details)Jun 21 2022, 6:35 PM

Poking around in logstash it seems like there is a significant volume of cross-cluster search errors. These errors are invisible to users, they simply don't get sister search results.

I wrote a quick python script (P30045) to check all the connections, it said eqiad chi (9200) -> omega (9400) was failing.

It's not entirely clear what fixed things, i did the following a few times in slightly different ways and suddenly it decided to work.

Deleted the existing cross-cluster configuration. I ran the same proceedure using the old and the new names, but based on my review of elastic's codebase it seems like they rewrite the old name into the new name on incoming settings updates fairly early in the process. This only removes the new configuration, cluster settings still reports the old values in the old location. Per https://www.elastic.co/guide/en/elasticsearch/reference/6.8/modules-remote-clusters.html

A remote cluster can be deleted from the cluster settings by setting its seeds and optional settings to null :

PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster_two": { 
          "seeds": null,
          "skip_unavailable": null,
          "transport": {
            "compress": null
          }
        }
      }
    }
  }
}

Then reapply the expected cross-cluster settings:

{
  "persistent": {
    "search": {
      "remote" : {
        "chi" : {
          "seeds" : [
            "elastic1054.eqiad.wmnet:9300",
            "elastic1074.eqiad.wmnet:9300",
            "elastic1081.eqiad.wmnet:9300"
          ]
        },
        "omega" : {
          "seeds" : [
            "elastic1068.eqiad.wmnet:9500",
            "elastic1076.eqiad.wmnet:9500",
            "elastic1057.eqiad.wmnet:9500"
          ]
        },
        "psi" : {
          "seeds" : [
            "elastic1073.eqiad.wmnet:9700",
            "elastic1075.eqiad.wmnet:9700",
            "elastic1083.eqiad.wmnet:9700"
          ]
        }
      }
    }
  }
}

EBernhardson claimed this task.Jun 23 2022, 9:59 PM

EBernhardson moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.

Mentioned in SAL (#wikimedia-operations) [2022-06-30T00:49:53Z] <ebernhardson> T310924 Cleared eqiad chi->omega cross cluster settings and reapplied

Gehel closed this task as Resolved.Jul 25 2022, 2:19 PM

Investigate CirrusSearch eqiad failuresClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Investigate CirrusSearch eqiad failures
Closed, ResolvedPublic
Actions