Page MenuHomePhabricator

Degraded RAID on ms-be2034
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be2034. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0](F) sdb1[1]
      58559488 blocks super 1.2 [2/1] [_U]
      
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

Event Timeline

Host was rebooted by @elukey this morning, though upon reboot the raid is assembled correctly:

root@ms-be2034:~# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0] sdb1[1]
      58559488 blocks super 1.2 [2/2] [UU]
      
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

From syslog servers I was able to get kernel messages, in P7024 (omitting the "rejecting i/o to offline device" spammy message).
Looks like hpsa detected a controller lockup and things snowballed from there:

archive/syslog.log-20180422.gz:Apr 21 19:57:19 ms-be2034 kernel: [4017510.932658] hpsa 0000:08:00.0: Controller lockup detected: 0x00130001 after 30
archive/syslog.log-20180422.gz:Apr 21 19:57:19 ms-be2034 kernel: [4017510.932687] hpsa 0000:08:00.0: controller lockup detected: LUN:0000004000000000 CDB:01040000000000000000000000000000
archive/syslog.log-20180422.gz:Apr 21 19:57:19 ms-be2034 kernel: [4017510.932691] hpsa 0000:08:00.0: Controller lockup detected during reset wait

Looks similar enough to T184390, I'll upgrade the raid controller firmware since that's pending anyways.

Mentioned in SAL (#wikimedia-operations) [2018-04-23T09:13:35Z] <godog> Flashing Smart Array P840 in Slot 3 [ 4.52 -> 6.30 ] on ms-be2034 - T192721 T141756

fgiunchedi claimed this task.

Firmware upgraded, I'll tentatively resolve and reopen if we see reoccurence