Page MenuHomePhabricator

Enable adaptive replica selection on CirrusSearch Elasticsearch clusters
Closed, ResolvedPublic3 Estimated Story Points

Description

As a user i would like my search queries to not time out when peak hours overload the infrastructure so that i can <insert wide variety of workflows supported by search>.

Our alert on 95th percentile elastic response time has been going off more often recently (twice last week, once over the weekend). Per icinga the alert was critical for 6h 15min in the last 31 days, 4h 21m of that is in the last 7. This is almost always a load issue of some sort.

In the most recent incident the elasticsearch percentiles dashboard in grafana shows, in the search queue graph, that the search queue went up to 1k and then we started rejecting requests. Since the single node queue depth is 1k this suggests a single struggling node. The cluster overview dashboard in the same time range shows elastic1046 hit ~91% cpu utilization and stayed flat for the next 2 hours. This is not the first time we've seen such an issue, elasticsearch is not resilient to a single node becoming overloaded even though it has more copies of the data.

To help address this elasticsearch added Adaptive replica selection in 6.1. This became the cluster wide default in 7.0. We should evalute if enabling this on our clusters would help avoid the hotspotting issues we've seen recently.

Event Timeline

CBogen set the point value for this task to 3.Aug 17 2020, 5:25 PM

Reviewed elasticsearch, nothing mentioned in changelogs and nothing substantial turned up in git logs of some of the related stats and replica ranking code in elasticsearch between v6.5.4 and v7.9.1.

Applied to search.svc.eqiad.wmnet:9[246]43/_cluster/settings, which is currently the inactive cluster and will only serve indexing and mjolnir msearch requests.

{"transient":{"cluster.routing.use_adaptive_replica_selection": true}}'

Mentioned in SAL (#wikimedia-operations) [2020-09-22T20:46:51Z] <ebernhardson> T259539 enabled adaptive replica selection on elasticsearch at search.svc.eqiad.wmnet:9[246]43

Change 640274 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[operations/puppet@production] elastic: Turn on adaptive replica selection in elastic 6

https://gerrit.wikimedia.org/r/640274

We've now been running with this in the first idle, then active datacenter and nothing out of the ordinary seems to have occured due to setting this. Determining if this was helpful will be difficult though. We have since found the root cause for the latency spikes from before, so mitigations around how the cluster behaves when some nodes are suffering aren't too easy to review the efficacy of.

I've gone ahead and applied the same settings to codfw and submitted a puppet patch to make this the default (but its mostly a no-op, since already applied transiently).

Change 640274 merged by Ryan Kemper:
[operations/puppet@production] elastic: Turn on adaptive replica selection in elastic 6

https://gerrit.wikimedia.org/r/640274

Since the changes have already been applied transiently, the only post-deploy step is to make sure it's applied persistently following the run of the puppet agent.

{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation": {
          "node_concurrent_recoveries": "4",
          "disk": {
            "watermark": {
              "low": "75%",
              "high": "80%"
            }
          },
          "enable": "all"
        }
      }
    },
    "indices": {
      "recovery": {
        "max_bytes_per_sec": "80mb"
      }
    },
    "search": {
      "remote": {
        "omega": {
          "seeds": [
            "elastic2042.codfw.wmnet:9500",
            "elastic2047.codfw.wmnet:9500",
            "elastic2038.codfw.wmnet:9500"
          ]
        },
        "psi": {
          "seeds": [
            "elastic2027.codfw.wmnet:9700",
            "elastic2029.codfw.wmnet:9700",
            "elastic2048.codfw.wmnet:9700"
          ]
        }
      }
    }
  },
  "transient": {
    "action": {
      "auto_create_index": "+apifeatureusage-*,+glent_*,-*"
    },
    "cluster": {
      "routing": {
        "use_adaptive_replica_selection": "true",
        "allocation": {
          "include": {
            "_ip": ""
          },
          "disk": {
            "watermark": {
              "low": "75%",
              "high": "80%"
            }
          },
          "exclude": {
            "_name": "",
            "_ip": ""
          },
          "enable": "all"
        }
      }
    },
    "logger": {
      "org": {
        "elasticsearch": {
          "common": {
            "logging": {
              "DeprecationLogger": "ERROR"
            }
          },
          "deprecation": {
            "common": {
              "ParseField": "ERROR"
            },
            "index": {
              "similarity": {
                "SimilarityService": "ERROR"
              },
              "query": {
                "functionscore": {
                  "ScoreFunctionBuilder": "ERROR"
                }
              }
            },
            "search": {
              "sort": {
                "GeoDistanceSortBuilder": "ERROR"
              }
            }
          },
          "index": {
            "engine": {
              "Engine": "INFO",
              "ElasticsearchConcurrentMergeScheduler": "TRACE"
            }
          }
        }
      }
    }
  }
}
ryankemper@elastic2040:~$ curl -X GET -s -k https://localhost:9243/_cluster/settings | jq .                                                                                                                                                                                                                                                                          
{
  "persistent": {
    "cluster": {
      "routing": {
        "allocation": {
          "node_concurrent_recoveries": "4",
          "disk": {
            "watermark": {
              "low": "75%",
              "high": "80%"
            }
          },
          "enable": "all"
        }
      }
    },
    "indices": {
      "recovery": {
        "max_bytes_per_sec": "80mb"
      }
    },
    "search": {
      "remote": {
        "omega": {
          "seeds": [
            "elastic2042.codfw.wmnet:9500",
            "elastic2047.codfw.wmnet:9500",
            "elastic2038.codfw.wmnet:9500"
          ]
        },
        "psi": {
          "seeds": [
            "elastic2027.codfw.wmnet:9700",
            "elastic2029.codfw.wmnet:9700",
            "elastic2048.codfw.wmnet:9700"
          ]
        }
      }
    }
  },
  "transient": {
    "action": {
      "auto_create_index": "+apifeatureusage-*,+glent_*,-*"
    },
    "cluster": {
      "routing": {
        "use_adaptive_replica_selection": "true",
        "allocation": {
          "include": {
            "_ip": ""
          },
          "disk": {
            "watermark": {
              "low": "75%",
              "high": "80%"
            }
          },
          "exclude": {
            "_name": "",
            "_ip": ""
          },
          "enable": "all"
        }
      }
    },
    "logger": {
      "org": {
        "elasticsearch": {
          "common": {
            "logging": {
              "DeprecationLogger": "ERROR"
            }
          },
          "deprecation": {
            "common": {
              "ParseField": "ERROR"
            },
            "index": {
              "similarity": {
                "SimilarityService": "ERROR"
              },
              "query": {
                "functionscore": {
                  "ScoreFunctionBuilder": "ERROR"
                }
              }
            },
            "search": {
              "sort": {
                "GeoDistanceSortBuilder": "ERROR"
              }
            }
          },
          "index": {
            "engine": {
              "Engine": "INFO",
              "ElasticsearchConcurrentMergeScheduler": "TRACE"
            }
          }
        }
      }
    }
  }
}

Hmm, I'll need to check if the settings on puppet are just applied once when a n elasticsearchcluster is initially provisioned, since I still don't see adaptive replica selection in the persistent settings.

Hmm, I'll need to check if the settings on puppet are just applied once when a n elasticsearchcluster is initially provisioned, since I still don't see adaptive replica selection in the persistent settings.

See cluster update settings for a little more detail, but the short of it is that the cluster settings api doesn't return values from elasticsearch.yml. The difference between transient and persistent here is that transient takes precedence over persistent, and transient values wont survive a full-cluster shutdown.

There really isn't a rhyme or reason to when we use which. The only time we have done full cluster shutdowns is for major version changes, but those can be managed without full shutdowns these days. Mostly meaning we don't expect transient values to go away. We could ponder and document when each is appropriate (in some other forum than this ticket, I imagine :)

For the exact settings, elasticsearch.yml is only read during server startup. Mostly I apply settings to cluster state, and followup with elasticsearch.yml updates to make it more concrete and documented.