We've noticed that Elastic likes to pack a lot of the larger index shards (such as commonswiki) onto a single host. This seems to create "hot spots"-individual nodes that have higher load average and respond more slowly than the rest of the cluster.
Creating this ticket to:
- Tie this issue to SLOs (could a slow response from a single shard drag down response time enough that we should care? Do we have example queries that could prove this theory?)
- Identify big indices/shards likely to cause this behavior
- Measure the current distribution of problem shards and see if it's possible to predict performance issues (Once node has given X number of problem shards and Y number requests, the node will fall into the bottom 10% of performers).
- Correlate performance with other factors (hardware, read/write balance, etc?). If hardware is determined to be the problem, consider requesting more powerful hardware.
- Review [[ https://www.elastic.co/guide/en/elasticsearch/reference/7.10/modules-cluster.html#enabling-awareness | shard allocation awareness options ]] and determine if it's possible to change our current configuration without making the "perma-yellow" situation* worse.
*Meaning that the current row/rack awareness strictly limits shard allocation, we want to avoid adding more rules that make it impossible to schedule more shards ("perma-yellow").