Recently (gerrit:715597) we switched the swift-drive-audit cron to systemd timers/jobs.
This also means we now get Icinga alerts about systemd state if one of the audit runs fails (an actual positive when it _does_ find a drive failed).
So this happened the first time now:
00:05 <+icinga-wm> PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
I ACKed the alert and went to ms-be2035:
[ms-be2035:~] $ sudo systemctl status swift-drive-audit ... Process: 18014 ExecStart=/usr/bin/swift-drive-audit /etc/swift/swift-drive-audit.conf (code=exited, status=4) ... Sep 28 00:01:10 ms-be2035 drive-audit[18014]: Errors found but device unavailable: sdi:6
So the interesting part is "sdi:6" is gone.
There are 3 previous tickets about this host and RAID status: T241534, T241714 and T241535:.