Page MenuHomePhabricator

Degraded RAID on aqs1013
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host aqs1013. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid10 sde2[4](F) sdh2[3] sdg2[2] sdf2[1]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
      bitmap: 6/28 pages [24KB], 65536KB chunk

md1 : active raid10 sda2[0] sdd2[3] sdb2[1] sdc2[2]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 2/28 pages [8KB], 65536KB chunk

md0 : active raid10 sda1[0] sdd1[3] sdb1[1] sdc1[2]
      48791552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      
unused devices: <none>

Related Objects

Event Timeline

Failing disk:

root@aqs1013:/home/hnowlan# udevadm info --query=all --name=/dev/sde| grep SERIAL
E: ID_SERIAL=MZ7KH1T9HAJR0D3_S4KVNA0MB04213
E: ID_SERIAL_SHORT=S4KVNA0MB04213
root@aqs1013:/home/hnowlan# dmesg | grep sde | tail
[6107332.316343] sd 6:0:0:0: [sde] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[6107332.316359] sd 6:0:0:0: [sde] tag#13 CDB: Read(10) 28 00 02 e9 0f 80 00 00 08 00
[6107332.316365] blk_update_request: I/O error, dev sde, sector 48828288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[6107332.327008] Buffer I/O error on dev sde1, logical block 6103280, async page read
[6107332.335094] sd 6:0:0:0: [sde] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[6107332.335101] sd 6:0:0:0: [sde] tag#1 CDB: Read(10) 28 00 df 8f df 80 00 00 08 00
[6107332.335107] blk_update_request: I/O error, dev sde, sector 3750748032 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[6107332.345913] Buffer I/O error on dev sde2, logical block 462739952, async page read
[6107370.360633] sd 6:0:0:0: [sde] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[6107370.360637] sd 6:0:0:0: [sde] tag#16 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00

@Jclark-ctr Since the server is OOW; do you have a spare disk from a decommed server?

@Jclark-ctr Since the server is OOW; do you have a spare disk from a decommed server?

Actually, we've gone down that route (twice?) already. My thought was that perhaps there was something amiss with that drive bay? Either way though, this array has been degraded for a long time.

@Eevans Would you mind if swapped it again possibly 3rd times the charm. all bays in server are filled. possibly next if it fails again would be backplane swap

@Eevans Would you mind if swapped it again possibly 3rd times the charm. all bays in server are filled. possibly next if it fails again would be backplane swap

Ok, go ahead. 🤞

performed drive swap. blew out slot with compressed air If fails again we would need to look at possibly backplane swap

The device has been added to md2 (it is currently rebuilding).

The RAID has been rebuilt, let's hope 3rd time is the charm!