Avoid overloading individual Elastic nodes with popular shards
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	bking
	Sep 21 2022, 7:47 PM

Description

We suspect that the poor performance (in terms of cpu load and actual query latency) of outlier nodes could be explained by them having a disproportionate share of the busier indices' shards (such as commonswiki_file). This might explain the phenomenom of "hot spots" - individual nodes that have higher load average and respond more slowly than the rest of the cluster.

Creating this ticket to:

Tie this issue to SLOs (could a slow response from a single shard drag down response time enough that we should care? Do we have example queries that could prove this theory?)
Identify busy indices/shards likely to cause this behavior
Measure the current distribution of problem shards and see if it's possible to predict performance issues (Once node has given X number of problem shards and Y number requests, the node will fall into the bottom 10% of performers).
Correlate performance with other factors (hardware, read/write balance, etc?). If hardware is determined to be the problem, consider requesting more powerful hardware.
Review shard allocation awareness options and determine if it's possible to change our current configuration without making the "perma-yellow" situation* worse.

*Meaning that the current row/rack awareness strictly limits shard allocation, we want to avoid adding more rules that make it impossible to schedule more shards ("perma-yellow").

Details

	Subject	Repo	Branch	Lines +/-
	elastic: rebalance enwiki_content shard counts	operations/mediawiki-config	master	+11 -9

Customize query in gerrit

Related Objects

Mentioned Here: T317816: Enable 10G networking in cirrus elastic clusters

Event Timeline

bking created this task.Sep 21 2022, 7:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2022, 7:47 PM

bking updated the task description. (Show Details)Sep 21 2022, 7:48 PM

I'm curious about what we've seen that indicates that

Elastic likes to pack a lot of the larger index shards (such as commonswiki) onto a single host

Naively, per https://github.com/wikimedia/puppet/blob/95102ff9e2aa58c6c30ade3e1c351c7af429af53/modules/elasticsearch/templates/elasticsearch_7.yml.erb#L174-L181 the intent of our settings is that it will distribute the big indices super evenly between hosts.

Here's the docs that explain how those settings work: https://www.elastic.co/guide/en/elasticsearch/reference/7.10/modules-cluster.html#shards-rebalancing-heuristics. Basically our settings should make ES care more about equalizing the number of shards per index across all nodes, rather than equalizing the number of shards per node in general.

RKemper updated the task description. (Show Details)Sep 21 2022, 8:38 PM

Per avg(rate(elasticsearch_indices_search_query_total[5m])) by (index), it seems like enwiki_content is pretty much the main source of query load. Also, looking at 4 hosts (2 of the worst offenders in terms of latency, and 2 average latency), we noticed that the 2 worst offenders had 2 enwiki_content each while the 2 average-latency hosts had only one.

Taking a look, our enwiki_content index has more replicas than necessary (4 instead of 3). So we'll open a patch to shrink those shards down, and then rebalance the primary shards such that we have 45 shards total, which means for a 50 node cluster (our current size) every host except 5 will have exactly 1 copy. That also means we can lose up to 5 hosts before the lack of enwiki_content replica shards causes the cluster to dip into yellow status.

As part of that we'll want to change wgCirrusSearchMaxShardsPerNode => enwiki => [eqiad, codfw] => content from 2 to 1 (since no host will have more than 1 shard). However, since we will soon be temporarily dipping from 50 to 35 hosts in the eqiad cluster (T317816), we don't want to flip that setting specifically until after the hosts have been re-racked.

Change 833860 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/mediawiki-config@master] elastic: rebalance enwiki_content shard counts

https://gerrit.wikimedia.org/r/833860

Change 833861 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/mediawiki-config@master] elastic: allow only 1 enwiki_content per host

https://gerrit.wikimedia.org/r/833861

bking awarded a token.Sep 26 2022, 4:32 PM

bking edited projects, added Discovery-Search (Current work); removed Discovery-Search.Sep 26 2022, 4:36 PM

bking updated the task description. (Show Details)

Change 833860 merged by jenkins-bot:

[operations/mediawiki-config@master] elastic: rebalance enwiki_content shard counts

https://gerrit.wikimedia.org/r/833860

Mentioned in SAL (#wikimedia-operations) [2022-09-27T20:41:17Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]]

Mentioned in SAL (#wikimedia-operations) [2022-09-27T20:41:40Z] <samtar@deploy1002> samtar and ryankemper: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-09-27T20:46:31Z] <samtar@deploy1002> Finished scap: Backport for [[gerrit:833860|elastic: rebalance enwiki_content shard counts (T318270)]] (duration: 05m 14s)

Gehel moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.Sep 29 2022, 6:38 PM

Gehel assigned this task to RKemper.Oct 10 2022, 3:08 PM

We should look into if commonswiki is having similar issues, when commonswiki gets heavily loaded we often have a few instances with significantly more load than the others.

EBernhardson moved this task from Waiting to In Progress on the Discovery-Search (Current work) board.Oct 24 2022, 3:23 PM

EBernhardson set the point value for this task to 3.Oct 24 2022, 3:45 PM

Checked the stats, commonswiki_file is pretty reasonably distributed across the cluster. There is probably some reason that a few nodes overloaded when commonswiki queries were loading up the cluster, but it's not due to shard balance.

:) (ebernhardson@stat1006)-~$ curl https://search.svc.eqiad.wmnet:9243/_cat/shards/commonswiki_file | awk '{print $8}' | sort | uniq -c | sort                                                                   
     1 elastic1071-production-search-eqiad                                                                                                                                                                       
     1 elastic1076-production-search-eqiad                                                                                                                                                                       
     1 elastic1085-production-search-eqiad                                                                                                                                                                       
     1 elastic1090-production-search-eqiad                                                                                                                                                                       
     2 elastic1053-production-search-eqiad
     2 elastic1054-production-search-eqiad
     2 elastic1055-production-search-eqiad
     2 elastic1056-production-search-eqiad
     2 elastic1057-production-search-eqiad
     2 elastic1058-production-search-eqiad
     2 elastic1059-production-search-eqiad
     2 elastic1060-production-search-eqiad
     2 elastic1061-production-search-eqiad
     2 elastic1062-production-search-eqiad
     2 elastic1063-production-search-eqiad
     2 elastic1064-production-search-eqiad
     2 elastic1065-production-search-eqiad
     2 elastic1066-production-search-eqiad
     2 elastic1067-production-search-eqiad
     2 elastic1068-production-search-eqiad
     2 elastic1069-production-search-eqiad
     2 elastic1070-production-search-eqiad
     2 elastic1072-production-search-eqiad
     2 elastic1073-production-search-eqiad
     2 elastic1074-production-search-eqiad
     2 elastic1075-production-search-eqiad
     2 elastic1077-production-search-eqiad
     2 elastic1078-production-search-eqiad
     2 elastic1079-production-search-eqiad
     2 elastic1080-production-search-eqiad
     2 elastic1081-production-search-eqiad
     2 elastic1082-production-search-eqiad
     2 elastic1083-production-search-eqiad
     2 elastic1084-production-search-eqiad
     2 elastic1086-production-search-eqiad
     2 elastic1087-production-search-eqiad
     2 elastic1088-production-search-eqiad
     2 elastic1089-production-search-eqiad
     2 elastic1091-production-search-eqiad
     2 elastic1092-production-search-eqiad
     2 elastic1093-production-search-eqiad
     2 elastic1094-production-search-eqiad
     2 elastic1095-production-search-eqiad
     2 elastic1096-production-search-eqiad
     2 elastic1097-production-search-eqiad
     2 elastic1098-production-search-eqiad
     2 elastic1099-production-search-eqiad
     2 elastic1100-production-search-eqiad
     2 elastic1101-production-search-eqiad
     2 elastic1102-production-search-eqiad

EBernhardson moved this task from In Progress to Needs Reporting on the Discovery-Search (Current work) board.Nov 22 2022, 7:46 PM

Gehel moved this task from Needs Reporting to Blocked/Waiting on the Discovery-Search (Current work) board.Nov 22 2022, 7:46 PM

Gehel moved this task from Blocked/Waiting to Needs Reporting on the Discovery-Search (Current work) board.Nov 28 2022, 4:17 PM

Gehel closed this task as Resolved.Dec 2 2022, 2:11 PM

Avoid overloading individual Elastic nodes with popular shardsClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Avoid overloading individual Elastic nodes with popular shards
Closed, ResolvedPublic3 Estimated Story Points
Actions