Page MenuHomePhabricator

swift - ms-be1059 - device sdi:3 unavailable
Closed, ResolvedPublic

Description

on ms-be1059, sdi:3 is gone:

compare to T291988, T291896 in codfw

[ms-be1059:~] $ sudo systemctl status swift-drive-audit
● swift-drive-audit.service - Regular jobs to unmount failed disks
   Loaded: loaded (/lib/systemd/system/swift-drive-audit.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2021-10-04 23:01:12 UTC; 6min ago
  Process: 35328 ExecStart=/usr/bin/swift-drive-audit /etc/swift/swift-drive-audit.conf (code=exited, status=4)
 Main PID: 35328 (code=exited, status=4)

Oct 04 23:01:11 ms-be1059 systemd[1]: Started Regular jobs to unmount failed disks.
Oct 04 23:01:12 ms-be1059 drive-audit[35328]: Errors found but device unavailable: sdi:3

Event Timeline

These are just out of warranty by a few months. I do have spare disks on-site, my guess is a disk went bad on you. Is there any way you can tell me which disk slot corresponds to /dev/sdi?

@Cmjohnson The failed disk should have an amber light on it when you're onsite.

I replaced what I think was /dev/sdi. The server did not show any amber led to let me know which disk was failed.

It will need to be added back to the array. It was slot 8 that was replaced

Thanks @Cmjohnson !

@Papaul @fgiunchedi looking at the linked tickets that had similar cases it seemed as if just replacing the disk auto-fixed things, or was there a step where one of you added the disk back to the array? Also note how in the codfw case we had the amber light but not here. Not sure what is next exactly, was just reporting.

Thanks @Dzahn and @Cmjohnson ! I've done the procedure and documented it at https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings. We're back so I'm resolving the task