Page MenuHomePhabricator

swift backend decomms / rebalances are noisy
Open, MediumPublic

Description

It's very common, when Swift backend machines are doing lots of rebalancing after their weights have been changed, to get a ton of spammy alerts:

17:58:08	<+icinga-wm>	PROBLEM - swift-container-server on ms-be2024 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.60: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
17:58:34	<+icinga-wm>	PROBLEM - MD RAID on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer
17:58:34	<+icinga-wm>	PROBLEM - puppet last run on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer

While the operation of Swift itself doesn't seem to be obviously adversely affected by all the I/O of rebalancing, certainly Icinga NRPEs are sensitive to it.

This is a task to verify the truth of the former part of the above statement, and to minimize the impact of rebalancing upon monitoring

Event Timeline

CDanis created this task.Apr 25 2019, 10:05 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2019, 10:05 PM
herron triaged this task as Medium priority.Apr 26 2019, 10:44 PM
herron added a project: observability.
CDanis claimed this task.May 7 2019, 12:59 PM

Mentioned in SAL (#wikimedia-operations) [2019-05-07T13:02:45Z] <cdanis> T221904 cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 -m async -b5 'ms-be2*' 'run-puppet-agent -q' 'systemctl restart swift-object-replicator' 'systemctl restart swift-object-auditor'

CDanis added a comment.EditedMay 7 2019, 1:05 PM

Trying out a few things here:

  • ionice'ing swift-object-replicator lower than everything else, except
  • ionice'ing swift-object-auditor even lower than that
  • pick a handful of hosts in codfw: all hosts ending in a 4 or 7 -- 8 hosts or about 20% of the cluster. ms-be2*[4,7].codfw.wmnet
  • and change their scheduler algorithm on sd? to cfq (which respects ionice)
  • start off a new round of replication traffic from the to-be-decommed hosts (c.f. T221068 )
  • afterwards, do some combing through logs to see how swift-object-server performance and also monitoring noise was affected between control/experiment

Mentioned in SAL (#wikimedia-operations) [2019-05-07T13:17:17Z] <cdanis> T221904 cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 -m async -b5 'ms-be1*' 'run-puppet-agent -q' 'systemctl restart swift-object-replicator' 'systemctl restart swift-object-auditor'

Mentioned in SAL (#wikimedia-operations) [2019-05-08T19:01:04Z] <cdanis> T221904 cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'ms-be2*[4,7].codfw.wmnet' 'for DISK in /sys/block/sd*/queue/scheduler ; do echo cfq > $DISK ; done'

FWIW this should be slightly less noisy since individual swift daemons won't produce alerts anymore as per https://gerrit.wikimedia.org/r/c/operations/puppet/+/530080

AFAICS through the latest rebalances we haven't observed any alerts, possibly also due to using multiple servers per port (T222366)