As a user i would like my search queries to not time out when peak hours overload the infrastructure so that i can <insert wide variety of workflows supported by search>.
Our alert on 95th percentile elastic response time has been going off more often recently (twice last week, once over the weekend). Per icinga the alert was critical for 6h 15min in the last 31 days, 4h 21m of that is in the last 7. This is almost always a load issue of some sort.
In the most recent incident the elasticsearch percentiles dashboard in grafana shows, in the search queue graph, that the search queue went up to 1k and then we started rejecting requests. Since the single node queue depth is 1k this suggests a single struggling node. The cluster overview dashboard in the same time range shows elastic1046 hit ~91% cpu utilization and stayed flat for the next 2 hours. This is not the first time we've seen such an issue, elasticsearch is not resilient to a single node becoming overloaded even though it has more copies of the data.
To help address this elasticsearch added Adaptive replica selection in 6.1. This became the cluster wide default in 7.0. We should evalute if enabling this on our clusters would help avoid the hotspotting issues we've seen recently.