We suspect that the poor performance (in terms of cpu load and actual query latency) of outlier nodes could be explained by them having a disproportionate share of the busier indices' shards (such as commonswiki_file). This might explain the phenomenom of "hot spots" - individual nodes that have higher load average and respond more slowly than the rest of the cluster.
Creating this ticket to:
- Tie this issue to SLOs (could a slow response from a single shard drag down response time enough that we should care? Do we have example queries that could prove this theory?)
- Identify busy indices/shards likely to cause this behavior
- Measure the current distribution of problem shards and see if it's possible to predict performance issues (Once node has given X number of problem shards and Y number requests, the node will fall into the bottom 10% of performers).
- Correlate performance with other factors (hardware, read/write balance, etc?). If hardware is determined to be the problem, consider requesting more powerful hardware.
- Review shard allocation awareness options and determine if it's possible to change our current configuration without making the "perma-yellow" situation* worse.
*Meaning that the current row/rack awareness strictly limits shard allocation, we want to avoid adding more rules that make it impossible to schedule more shards ("perma-yellow").