Search team has been steadily working over the last few years on getting all of our elastic* hosts with 10G NICs and placed into 10G racks, which will allow us to increase our cluster-wide network throughput limit, indices.recovery.max_bytes_per_sec, accordingly. This will enable much faster network recovery of shards which will allow us to recover from unexpected host failures more quickly, as well as perform routine maintenance operations like rolling upgrades / rolling reimages more quickly.
There's two main sets of hosts that are still not in 10G racks that we need to address:
- (eqiad) elastic10[53-67], see https://phabricator.wikimedia.org/T230746#5544656 for historical context. this is tracked in https://phabricator.wikimedia.org/T322082
- (codfw) elastic20[25-36] will be decom'd very soon per https://phabricator.wikimedia.org/T300943; so this is just blocked on our own search team efforts and not another team. these will be taken care of very soon.
And afterwards we need to:
- Review our existing eqiad/codfw cirrus elasticsearch cluster throughput limits