Page MenuHomePhabricator

swift-drive-audit unmounting a drive doesn't produce any alerts or notifications
Open, MediumPublic

Description

May  1 18:07:28 ms-be2043 kernel: [1227786.633762] megaraid_sas 0000:02:00.0: 1469 (610049246s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 01(e0x20/s1) at 108362cea
May  1 18:07:30 ms-be2043 kernel: [1227788.881547] megaraid_sas 0000:02:00.0: 1470 (610049247s/0x0001/FATAL) - Uncorrectable medium error logged for VD 03/3 at 108362cea (on PD 01(e0x20/s1) 
at 108362cea)
May  1 18:07:30 ms-be2043 kernel: [1227788.892031] sd 0:2:3:0: [sdd] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
May  1 18:07:30 ms-be2043 kernel: [1227788.892037] sd 0:2:3:0: [sdd] tag#11 Sense Key : Medium Error [current] 
May  1 18:07:30 ms-be2043 kernel: [1227788.892040] sd 0:2:3:0: [sdd] tag#11 Add. Sense: No additional sense information
May  1 18:07:30 ms-be2043 kernel: [1227788.892044] sd 0:2:3:0: [sdd] tag#11 CDB: Read(16) 88 00 00 00 00 01 08 36 2c e0 00 00 00 20 00 00
May  1 18:07:30 ms-be2043 kernel: [1227788.892047] blk_update_request: I/O error, dev sdd, sector 4432735456
May  1 18:07:30 ms-be2043 kernel: [1227788.899487] XFS (sdd1): metadata I/O error: block 0x1083624e0 ("xfs_trans_read_buf_map") error 5 numblks 32
May  1 18:07:30 ms-be2043 kernel: [1227788.910599] XFS (sdd1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.

Event Timeline

From sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli --all the only out of the ordinary data that I see is:

Media Error Count: 2

Although:

Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0

@fgiunchedi @CDanis Should we alarm and automatically open a task in those cases too?

I think the 'real' thing we need to notify on here is when Swift decides it wants to stop using a disk (which it did here)

May  1 19:01:01 ms-be2043 drive-audit: Errors found but device unavailable: sdd:1
May  1 19:01:01 ms-be2043 drive-audit: Unmounting /srv/swift-storage/sdd1 with 2 errors
May  1 19:01:02 ms-be2043 kernel: [1231000.234444] XFS (sdd1): Unmounting Filesystem
May  1 19:01:03 ms-be2043 drive-audit: Commenting out /srv/swift-storage/sdd1 from /etc/fstab

I'd rather catch that than catch all the underlying conditions that can cause Swift to do this

CDanis renamed this task from ms-be2043 /dev/sdd drive failure to swift-drive-audit unmounting a drive doesn't produce any alerts or notifications.May 2 2019, 1:32 PM
CDanis removed a project: ops-codfw.
Dzahn triaged this task as Medium priority.May 3 2019, 4:00 PM
Dzahn added a project: observability.

Just wanted to say when swift-drive-audit fails it now causes generic systemd Icinga alerts because we converted it to a service/timer from a cron.

This caused tickets like T291988, T291896, T292486.

And I noticed in these cases when swift-drive-audit failed because a disk went away it did actually say it did NOT unmount anything, like here:

Oct 04 23:01:12 ms-be1059 drive-audit[35328]: Errors found but device unavailable: sdi:3
Oct 04 23:01:12 ms-be1059 drive-audit[35328]: No drives were unmounted