Degraded RAID on ms-be2016
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Dec 15 2019, 8:29 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be2016. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      	resync=PENDING
      
md0 : active raid1 sda1[0] sdb1[1](F)
      58559488 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Event Timeline

ops-monitoring-bot added projects: ops-codfw, SRE.Dec 15 2019, 8:29 PM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 15 2019, 8:29 PM

I can't log into the host neither via SSH not the mgmt. Papaul, can you have a look the next time you're in the DC?

Logins are failing with "Connection closed by UNKNOWN port 65535"

This happens after the key is offered for authentication.

The mgmt on ms-be2016 actually works, it's one of those old servers which have a very outdated SSH version which doesn't negotiate current SSH kexes, but a login with when explicitly selecting "diffie-hellman-group14-sha1". When logging in no console is available and it's just printing "rejecting I/O to offline device" every second.

Mentioned in SAL (#wikimedia-operations) [2019-12-16T08:53:49Z] <moritzm> powercycling ms-be2016 T240798

After a power cycle the host is up just fine again, there's nothing in kern/syslog by the time of the crash. Filippo is checking whether there's a firmware update available for the RAID controller.

Mentioned in SAL (#wikimedia-operations) [2019-12-16T09:14:19Z] <godog> upgrade hw raid firmware on ms-be2016 and reboot - T240798

hw raid firmware upgraded, resolving

Degraded RAID on ms-be2016Closed, ResolvedPublicActions

Description

Event Timeline

Degraded RAID on ms-be2016
Closed, ResolvedPublic
Actions