Page MenuHomePhabricator

Degraded RAID on ms-be2016
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be2016. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      	resync=PENDING
      
md0 : active raid1 sda1[0] sdb1[1](F)
      58559488 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 15 2019, 8:29 PM

I can't log into the host neither via SSH not the mgmt. Papaul, can you have a look the next time you're in the DC?

Logins are failing with "Connection closed by UNKNOWN port 65535"

Joe added a subscriber: Joe.Dec 16 2019, 8:45 AM

This happens after the key is offered for authentication.

The mgmt on ms-be2016 actually works, it's one of those old servers which have a very outdated SSH version which doesn't negotiate current SSH kexes, but a login with when explicitly selecting "diffie-hellman-group14-sha1". When logging in no console is available and it's just printing "rejecting I/O to offline device" every second.

Mentioned in SAL (#wikimedia-operations) [2019-12-16T08:53:49Z] <moritzm> powercycling ms-be2016 T240798

After a power cycle the host is up just fine again, there's nothing in kern/syslog by the time of the crash. Filippo is checking whether there's a firmware update available for the RAID controller.

Mentioned in SAL (#wikimedia-operations) [2019-12-16T09:14:19Z] <godog> upgrade hw raid firmware on ms-be2016 and reboot - T240798

fgiunchedi closed this task as Resolved.Dec 16 2019, 9:25 AM

hw raid firmware upgraded, resolving