Page MenuHomePhabricator

Avoid overloading individual Elastic nodes with popular shards
Closed, ResolvedPublic3 Estimated Story Points

Description

We suspect that the poor performance (in terms of cpu load and actual query latency) of outlier nodes could be explained by them having a disproportionate share of the busier indices' shards (such as commonswiki_file). This might explain the phenomenom of "hot spots" - individual nodes that have higher load average and respond more slowly than the rest of the cluster.

Creating this ticket to:

  • Tie this issue to SLOs (could a slow response from a single shard drag down response time enough that we should care? Do we have example queries that could prove this theory?)
  • Identify busy indices/shards likely to cause this behavior
  • Measure the current distribution of problem shards and see if it's possible to predict performance issues (Once node has given X number of problem shards and Y number requests, the node will fall into the bottom 10% of performers).
  • Correlate performance with other factors (hardware, read/write balance, etc?). If hardware is determined to be the problem, consider requesting more powerful hardware.
  • Review shard allocation awareness options and determine if it's possible to change our current configuration without making the "perma-yellow" situation* worse.

*Meaning that the current row/rack awareness strictly limits shard allocation, we want to avoid adding more rules that make it impossible to schedule more shards ("perma-yellow").

Event Timeline

I'm curious about what we've seen that indicates that

Elastic likes to pack a lot of the larger index shards (such as commonswiki) onto a single host

Naively, per https://github.com/wikimedia/puppet/blob/95102ff9e2aa58c6c30ade3e1c351c7af429af53/modules/elasticsearch/templates/elasticsearch_7.yml.erb#L174-L181 the intent of our settings is that it will distribute the big indices super evenly between hosts.

Here's the docs that explain how those settings work: https://www.elastic.co/guide/en/elasticsearch/reference/7.10/modules-cluster.html#shards-rebalancing-heuristics. Basically our settings should make ES care more about equalizing the number of shards per index across all nodes, rather than equalizing the number of shards per node in general.

Per avg(rate(elasticsearch_indices_search_query_total[5m])) by (index), it seems like enwiki_content is pretty much the main source of query load. Also, looking at 4 hosts (2 of the worst offenders in terms of latency, and 2 average latency), we noticed that the 2 worst offenders had 2 enwiki_content each while the 2 average-latency hosts had only one.

Taking a look, our enwiki_content index has more replicas than necessary (4 instead of 3). So we'll open a patch to shrink those shards down, and then rebalance the primary shards such that we have 45 shards total, which means for a 50 node cluster (our current size) every host except 5 will have exactly 1 copy. That also means we can lose up to 5 hosts before the lack of enwiki_content replica shards causes the cluster to dip into yellow status.

As part of that we'll want to change wgCirrusSearchMaxShardsPerNode => enwiki => [eqiad, codfw] => content from 2 to 1 (since no host will have more than 1 shard). However, since we will soon be temporarily dipping from 50 to 35 hosts in the eqiad cluster (T317816), we don't want to flip that setting specifically until after the hosts have been re-racked.

Change 833860 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/mediawiki-config@master] elastic: rebalance enwiki_content shard counts

https://gerrit.wikimedia.org/r/833860

Change 833861 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/mediawiki-config@master] elastic: allow only 1 enwiki_content per host

https://gerrit.wikimedia.org/r/833861

bking updated the task description. (Show Details)
bking updated the task description. (Show Details)

Change 833860 merged by jenkins-bot:

[operations/mediawiki-config@master] elastic: rebalance enwiki_content shard counts

https://gerrit.wikimedia.org/r/833860

Mentioned in SAL (#wikimedia-operations) [2022-09-27T20:41:17Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]]

Mentioned in SAL (#wikimedia-operations) [2022-09-27T20:41:40Z] <samtar@deploy1002> samtar and ryankemper: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-09-27T20:46:31Z] <samtar@deploy1002> Finished scap: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]] (duration: 05m 14s)

We should look into if commonswiki is having similar issues, when commonswiki gets heavily loaded we often have a few instances with significantly more load than the others.

EBernhardson set the point value for this task to 3.Oct 24 2022, 3:45 PM

Checked the stats, commonswiki_file is pretty reasonably distributed across the cluster. There is probably some reason that a few nodes overloaded when commonswiki queries were loading up the cluster, but it's not due to shard balance.

:) (ebernhardson@stat1006)-~$ curl https://search.svc.eqiad.wmnet:9243/_cat/shards/commonswiki_file | awk '{print $8}' | sort | uniq -c | sort                                                                   
     1 elastic1071-production-search-eqiad                                                                                                                                                                       
     1 elastic1076-production-search-eqiad                                                                                                                                                                       
     1 elastic1085-production-search-eqiad                                                                                                                                                                       
     1 elastic1090-production-search-eqiad                                                                                                                                                                       
     2 elastic1053-production-search-eqiad
     2 elastic1054-production-search-eqiad
     2 elastic1055-production-search-eqiad
     2 elastic1056-production-search-eqiad
     2 elastic1057-production-search-eqiad
     2 elastic1058-production-search-eqiad
     2 elastic1059-production-search-eqiad
     2 elastic1060-production-search-eqiad
     2 elastic1061-production-search-eqiad
     2 elastic1062-production-search-eqiad
     2 elastic1063-production-search-eqiad
     2 elastic1064-production-search-eqiad
     2 elastic1065-production-search-eqiad
     2 elastic1066-production-search-eqiad
     2 elastic1067-production-search-eqiad
     2 elastic1068-production-search-eqiad
     2 elastic1069-production-search-eqiad
     2 elastic1070-production-search-eqiad
     2 elastic1072-production-search-eqiad
     2 elastic1073-production-search-eqiad
     2 elastic1074-production-search-eqiad
     2 elastic1075-production-search-eqiad
     2 elastic1077-production-search-eqiad
     2 elastic1078-production-search-eqiad
     2 elastic1079-production-search-eqiad
     2 elastic1080-production-search-eqiad
     2 elastic1081-production-search-eqiad
     2 elastic1082-production-search-eqiad
     2 elastic1083-production-search-eqiad
     2 elastic1084-production-search-eqiad
     2 elastic1086-production-search-eqiad
     2 elastic1087-production-search-eqiad
     2 elastic1088-production-search-eqiad
     2 elastic1089-production-search-eqiad
     2 elastic1091-production-search-eqiad
     2 elastic1092-production-search-eqiad
     2 elastic1093-production-search-eqiad
     2 elastic1094-production-search-eqiad
     2 elastic1095-production-search-eqiad
     2 elastic1096-production-search-eqiad
     2 elastic1097-production-search-eqiad
     2 elastic1098-production-search-eqiad
     2 elastic1099-production-search-eqiad
     2 elastic1100-production-search-eqiad
     2 elastic1101-production-search-eqiad
     2 elastic1102-production-search-eqiad