Page MenuHomePhabricator

swift backend decomms / rebalances are noisy
Closed, ResolvedPublic

Description

It's very common, when Swift backend machines are doing lots of rebalancing after their weights have been changed, to get a ton of spammy alerts:

17:58:08	<+icinga-wm>	PROBLEM - swift-container-server on ms-be2024 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.60: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
17:58:34	<+icinga-wm>	PROBLEM - MD RAID on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer
17:58:34	<+icinga-wm>	PROBLEM - puppet last run on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer

While the operation of Swift itself doesn't seem to be obviously adversely affected by all the I/O of rebalancing, certainly Icinga NRPEs are sensitive to it.

This is a task to verify the truth of the former part of the above statement, and to minimize the impact of rebalancing upon monitoring

Event Timeline

herron triaged this task as Medium priority.Apr 26 2019, 10:44 PM
herron added a project: observability.

Mentioned in SAL (#wikimedia-operations) [2019-05-07T13:02:45Z] <cdanis> T221904 cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 -m async -b5 'ms-be2*' 'run-puppet-agent -q' 'systemctl restart swift-object-replicator' 'systemctl restart swift-object-auditor'

Trying out a few things here:

  • ionice'ing swift-object-replicator lower than everything else, except
  • ionice'ing swift-object-auditor even lower than that
  • pick a handful of hosts in codfw: all hosts ending in a 4 or 7 -- 8 hosts or about 20% of the cluster. ms-be2*[4,7].codfw.wmnet
  • and change their scheduler algorithm on sd? to cfq (which respects ionice)
  • start off a new round of replication traffic from the to-be-decommed hosts (c.f. T221068 )
  • afterwards, do some combing through logs to see how swift-object-server performance and also monitoring noise was affected between control/experiment

Mentioned in SAL (#wikimedia-operations) [2019-05-07T13:17:17Z] <cdanis> T221904 cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 -m async -b5 'ms-be1*' 'run-puppet-agent -q' 'systemctl restart swift-object-replicator' 'systemctl restart swift-object-auditor'

Mentioned in SAL (#wikimedia-operations) [2019-05-08T19:01:04Z] <cdanis> T221904 cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'ms-be2*[4,7].codfw.wmnet' 'for DISK in /sys/block/sd*/queue/scheduler ; do echo cfq > $DISK ; done'

FWIW this should be slightly less noisy since individual swift daemons won't produce alerts anymore as per https://gerrit.wikimedia.org/r/c/operations/puppet/+/530080

AFAICS through the latest rebalances we haven't observed any alerts, possibly also due to using multiple servers per port (T222366)

Optimistically resolving this ticket, I think both https://gerrit.wikimedia.org/r/c/operations/puppet/+/530080 and T222366 have fixed it.

Unfortunately reopening, we've been seeing failures (e.g. systemd, ssh) during latest codfw rebalances

lmata added a subscriber: lmata.

I'm going to un tag Observability for now as this is more swift related and less o11y related. :-) if this changes please retag

Change 660854 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: limit rsync service memory

https://gerrit.wikimedia.org/r/660854

Change 660855 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: limit rsync to 10% memory in codfw

https://gerrit.wikimedia.org/r/660855

Change 660854 merged by Filippo Giunchedi:
[operations/puppet@production] swift: limit rsync service memory

https://gerrit.wikimedia.org/r/660854

Change 660855 merged by Filippo Giunchedi:
[operations/puppet@production] swift: limit rsync to 10% memory in codfw

https://gerrit.wikimedia.org/r/660855

Change 661343 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: default swift rsync memory limit

https://gerrit.wikimedia.org/r/661343

Change 661343 merged by Filippo Giunchedi:
[operations/puppet@production] role: default swift rsync memory limit

https://gerrit.wikimedia.org/r/661343

Mentioned in SAL (#wikimedia-operations) [2021-02-03T14:19:55Z] <godog> test memory limits on swift-object-replicator on ms-be2050 - T221904

Change 661408 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: limit rsync and swift-object-replicator memory to 5% in codfw

https://gerrit.wikimedia.org/r/661408

Change 661408 merged by Filippo Giunchedi:
[operations/puppet@production] swift: limit rsync and swift-object-replicator memory to 5% in codfw

https://gerrit.wikimedia.org/r/661408

Limiting the memory of rsync (receive side) and swift-object-replicator (sender side) has helped quite a bit in bounding the read/write latency experienced by clients. See the screenshot below, with feb 1st rebalance causing big spikes in latency, whereas subsequent rebalances (where latency spikes, but not as much)

Change 662703 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add defaults for profile::swift::storage::replication_limit_memory_percent

https://gerrit.wikimedia.org/r/662703

Change 662703 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add defaults for profile::swift::storage

https://gerrit.wikimedia.org/r/662703

Change 662907 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: limit rsync and swift-object-replicator memory to 5% in eqiad

https://gerrit.wikimedia.org/r/662907

Change 662907 merged by Filippo Giunchedi:
[operations/puppet@production] swift: limit rsync and swift-object-replicator memory to 5% in eqiad

https://gerrit.wikimedia.org/r/662907

I'm boldly resolving this again since limiting memory usage for object replication processes helped a whole lot to make rebalances quiet