Page MenuHomePhabricator

swift - ms-be2035 - device sdi:6 unavailable
Closed, ResolvedPublic

Description

Recently (gerrit:715597) we switched the swift-drive-audit cron to systemd timers/jobs.

This also means we now get Icinga alerts about systemd state if one of the audit runs fails (an actual positive when it _does_ find a drive failed).

So this happened the first time now:

00:05 <+icinga-wm> PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service  https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

I ACKed the alert and went to ms-be2035:

[ms-be2035:~] $ sudo systemctl status swift-drive-audit
...
  Process: 18014 ExecStart=/usr/bin/swift-drive-audit /etc/swift/swift-drive-audit.conf (code=exited, status=4)
...
  Sep 28 00:01:10 ms-be2035 drive-audit[18014]: Errors found but device unavailable: sdi:6

So the interesting part is "sdi:6" is gone.

There are 3 previous tickets about this host and RAID status: T241534, T241714 and T241535:.

Event Timeline

fgiunchedi added subscribers: Papaul, fgiunchedi.

Thank you @Dzahn, I've failed the physical disk manually. @Papaul please replace this failed 4TB disk (host is OOW though)

Joe triaged this task as High priority.Sep 29 2021, 6:22 AM
Papaul claimed this task.

@fgiunchedi disk replaced