Change Details

We've noticed that Elastic likes to pack a lot of the larger index shards (such as commonswiki) onto a single host suspect that the poor performance (in terms of cpu load and actual query latency) of outlier nodes could be explained by them having a disproportionate share of the busier indices' shards (such as `commonswiki_file`). This seems to createmight explain the phenomenom of "hot spots"- - individual nodes that have higher load average and respond more slowly than the rest of the cluster. Creating this ticket to: - Tie this issue to SLOs (could a slow response from a single shard drag down response time enough that we should care? Do we have example queries that could prove this theory?) - Identify bigusy indices/shards likely to cause this behavior - Measure the current distribution of problem shards and see if it's possible to predict performance issues (Once node has given X number of problem shards and Y number requests, the node will fall into the bottom 10% of performers). - Correlate performance with other factors (hardware, read/write balance, etc?). If hardware is determined to be the problem, consider requesting more powerful hardware. - Review [[ https://www.elastic.co/guide/en/elasticsearch/reference/7.10/modules-cluster.html#enabling-awareness | shard allocation awareness options ]] and determine if it's possible to change our current configuration without making the "perma-yellow" situation* worse. *Meaning that the current row/rack awareness strictly limits shard allocation, we want to avoid adding more rules that make it impossible to schedule more shards ("perma-yellow").