Page MenuHomePhabricator

Degraded RAID on ms-be2016
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be2016. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      	resync=PENDING
      
md0 : active raid1 sda1[0] sdb1[1](F)
      58559488 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Event Timeline

I can't log into the host neither via SSH not the mgmt. Papaul, can you have a look the next time you're in the DC?

Logins are failing with "Connection closed by UNKNOWN port 65535"

This happens after the key is offered for authentication.

The mgmt on ms-be2016 actually works, it's one of those old servers which have a very outdated SSH version which doesn't negotiate current SSH kexes, but a login with when explicitly selecting "diffie-hellman-group14-sha1". When logging in no console is available and it's just printing "rejecting I/O to offline device" every second.

After a power cycle the host is up just fine again, there's nothing in kern/syslog by the time of the crash. Filippo is checking whether there's a firmware update available for the RAID controller.

Mentioned in SAL (#wikimedia-operations) [2019-12-16T09:14:19Z] <godog> upgrade hw raid firmware on ms-be2016 and reboot - T240798

hw raid firmware upgraded, resolving