Page MenuHomePhabricator

Enable adaptive replica selection on CirrusSearch Elasticsearch clusters
Open, Needs TriagePublic3 Estimated Story Points


As a user i would like my search queries to not time out when peak hours overload the infrastructure so that i can <insert wide variety of workflows supported by search>.

Our alert on 95th percentile elastic response time has been going off more often recently (twice last week, once over the weekend). Per icinga the alert was critical for 6h 15min in the last 31 days, 4h 21m of that is in the last 7. This is almost always a load issue of some sort.

In the most recent incident the elasticsearch percentiles dashboard in grafana shows, in the search queue graph, that the search queue went up to 1k and then we started rejecting requests. Since the single node queue depth is 1k this suggests a single struggling node. The cluster overview dashboard in the same time range shows elastic1046 hit ~91% cpu utilization and stayed flat for the next 2 hours. This is not the first time we've seen such an issue, elasticsearch is not resilient to a single node becoming overloaded even though it has more copies of the data.

To help address this elasticsearch added Adaptive replica selection in 6.1. This became the cluster wide default in 7.0. We should evalute if enabling this on our clusters would help avoid the hotspotting issues we've seen recently.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2020, 5:49 PM
EBernhardson updated the task description. (Show Details)Aug 3 2020, 5:51 PM
EBernhardson updated the task description. (Show Details)Aug 4 2020, 7:55 PM
CBogen set the point value for this task to 3.Aug 17 2020, 5:25 PM

Reviewed elasticsearch, nothing mentioned in changelogs and nothing substantial turned up in git logs of some of the related stats and replica ranking code in elasticsearch between v6.5.4 and v7.9.1.

Applied to search.svc.eqiad.wmnet:9[246]43/_cluster/settings, which is currently the inactive cluster and will only serve indexing and mjolnir msearch requests.

{"transient":{"cluster.routing.use_adaptive_replica_selection": true}}'

Mentioned in SAL (#wikimedia-operations) [2020-09-22T20:46:51Z] <ebernhardson> T259539 enabled adaptive replica selection on elasticsearch at search.svc.eqiad.wmnet:9[246]43