Page MenuHomePhabricator

Investigate CirrusSearch eqiad failures
Closed, ResolvedPublic

Description

It looks like starting on 2022-06-03 15:00 UTC we saw a sustained level of reported failures in the metrics, at a level of >1000ops/min, which held steady until improving on 2022-06-17. Investigate what these failures were and why they seemingly went back down without intervention.

Event Timeline

I'm not sure if this is related, but we got an alert for "Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool " at ~ 17:36 UTC today.

I logged into the host and found that while the production-search-psi-eqiad service was healthy, elasticsearch_6@production-search-eqiad was failing to start . Unfortunately, I didn't find any helpful error messages, so I rebooted via cookbook.

When the server came back up, it still could not start elasticsearch_6@production-search-eqiad , but the service did recover after I manually ran puppet agent.

Poking around in logstash it seems like there is a significant volume of cross-cluster search errors. These errors are invisible to users, they simply don't get sister search results.

I wrote a quick python script (P30045) to check all the connections, it said eqiad chi (9200) -> omega (9400) was failing.

It's not entirely clear what fixed things, i did the following a few times in slightly different ways and suddenly it decided to work.

Deleted the existing cross-cluster configuration. I ran the same proceedure using the old and the new names, but based on my review of elastic's codebase it seems like they rewrite the old name into the new name on incoming settings updates fairly early in the process. This only removes the new configuration, cluster settings still reports the old values in the old location. Per https://www.elastic.co/guide/en/elasticsearch/reference/6.8/modules-remote-clusters.html

A remote cluster can be deleted from the cluster settings by setting its seeds and optional settings to null :

PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster_two": { 
          "seeds": null,
          "skip_unavailable": null,
          "transport": {
            "compress": null
          }
        }
      }
    }
  }
}

Then reapply the expected cross-cluster settings:

{
  "persistent": {
    "search": {
      "remote" : {
        "chi" : {
          "seeds" : [
            "elastic1054.eqiad.wmnet:9300",
            "elastic1074.eqiad.wmnet:9300",
            "elastic1081.eqiad.wmnet:9300"
          ]
        },
        "omega" : {
          "seeds" : [
            "elastic1068.eqiad.wmnet:9500",
            "elastic1076.eqiad.wmnet:9500",
            "elastic1057.eqiad.wmnet:9500"
          ]
        },
        "psi" : {
          "seeds" : [
            "elastic1073.eqiad.wmnet:9700",
            "elastic1075.eqiad.wmnet:9700",
            "elastic1083.eqiad.wmnet:9700"
          ]
        }
      }
    }
  }
}

Mentioned in SAL (#wikimedia-operations) [2022-06-30T00:49:53Z] <ebernhardson> T310924 Cleared eqiad chi->omega cross cluster settings and reapplied