Page MenuHomePhabricator

Enable 10G networking in cirrus elastic clusters
Closed, ResolvedPublic3 Estimated Story Points

Description

Search team has been steadily working over the last few years on getting all of our elastic* hosts with 10G NICs and placed into 10G racks, which will allow us to increase our cluster-wide network throughput limit, indices.recovery.max_bytes_per_sec, accordingly. This will enable much faster network recovery of shards which will allow us to recover from unexpected host failures more quickly, as well as perform routine maintenance operations like rolling upgrades / rolling reimages more quickly.

There's two main sets of hosts that are still not in 10G racks that we need to address:

  • (codfw) elastic20[25-36] will be decom'd very soon per https://phabricator.wikimedia.org/T300943; so this is just blocked on our own search team efforts and not another team. these will be taken care of very soon.

And afterwards we need to:

  • Review our existing eqiad/codfw cirrus elasticsearch cluster throughput limits

Event Timeline

RKemper changed the task status from Open to In Progress.Nov 7 2022, 4:09 PM

@Jclark-ctr provided the following context in IRC:

  • Reimaging might be required, especially if moving rows (we're OK with reimaging if necessary).
  • There is adequate 10G capacity in all rows (which means we should probably just keep all hosts in the existing rows to maintain balance).
  • We want to do one or two to start

I scheduled some time on Friday for the two of us to do a trial run. If it works and @Jclark-ctr has time, we could potentially all the hosts then. @Jclark-ctr
if you need to correct me or add more context, feel free to do it here.

Thanks for your help with this!

~B

Change 895874 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: Incr per-node shard recovery thru-put cap

https://gerrit.wikimedia.org/r/895874

Change 895874 merged by Bking:

[operations/puppet@production] elastic: Incr per-node shard recovery thru-put cap

https://gerrit.wikimedia.org/r/895874

Mentioned in SAL (#wikimedia-operations) [2023-03-09T19:51:23Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to enable incr shard recovery throughput - ryankemper@cumin1001 - T317816

We've zeroed out the cluster (transient|persistent).indices.recovery.max_bytes_per_sec settings for eqiad & codfw:

(eqiad previous)

ryankemper@cumin1001:~$ curl -s https://search.svc.eqiad.wmnet:9243/_cluster/settings | jq .[].indices && \
curl -s https://search.svc.eqiad.wmnet:9443/_cluster/settings | jq .[].indices && \
curl -s https://search.svc.eqiad.wmnet:9643/_cluster/settings | jq .[].indices
{
  "recovery": {
    "max_bytes_per_sec": "40mb"
  }
}
{
  "recovery": {
    "max_bytes_per_sec": "80mb"
  }
}
null
null
null
null

(eqiad current)

ryankemper@cumin1001:~$ curl -s https://search.svc.eqiad.wmnet:9243/_cluster/settings | jq .[].indices && \
curl -s https://search.svc.eqiad.wmnet:9443/_cluster/settings | jq .[].indices && \
curl -s https://search.svc.eqiad.wmnet:9643/_cluster/settings | jq .[].indices
null
null
null
null
null
null

(codfw previous)

curl -s https://search.svc.codfw.wmnet:9643/_cluster/settings | jq .[].indices
{
  "recovery": {
    "max_bytes_per_sec": "100mb"
  }
}
null
null
null
null
null

(codfw current)

ryankemper@cumin1001:~$ curl -s https://search.svc.codfw.wmnet:9243/_cluster/settings | jq .[].indices && curl -s https://search.svc.codfw.wmnet:9443/_cluster/settings | jq .[].indices && curl -s https://search.svc.codfw.wmnet:9643/_cluster/settings | jq .[].indices
null
null
null
null
null
null

Mentioned in SAL (#wikimedia-operations) [2023-03-09T21:19:40Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to enable incr shard recovery throughput - ryankemper@cumin1001 - T317816

Rerouted a shard like so:

curl -X POST "https://search.svc.eqiad.wmnet:9243/_cluster/reroute?metric=none&pretty" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {
      "move": {
        "index":     "enwiki_general_1677006528", "shard": 7,
        "from_node": "elastic1101-production-search-eqiad",
        "to_node":   "elastic1069-production-search-eqiad"
      }
    }
  ]
}
'

And watched the recovery via curl -s 'https://search.svc.eqiad.wmnet:9243/_cat/recovery?active_only&v&h=index,shard,source_node,target_node,time,stage,bytes_percent,translog_ops_recovered,translog_ops_percent'

The initial transfer took about 5 minutes, which works out to around 80 MB/s. This is pretty close to our theoretical max of 120 MB/s, and of course in real scenarios we will often be recovering more than one shard simultaneously, so we think these settings are in a good place.