Page MenuHomePhabricator

Degraded RAID on ms-be1020
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be1020. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdb1[1](F) sda1[0]
      58559488 blocks super 1.2 [2/1] [U_]
      
md1 : active (auto-read-only) raid1 sdb2[1] sda2[0]
      976320 blocks super 1.2 [2/2] [UU]
      	resync=PENDING
      
unused devices: <none>

Event Timeline

elukey added subscribers: elukey, Cmjohnson, fgiunchedi.

Reporting in here as well:

SSH:

#~ ssh ms-be1020.eqiad.wmnet
-bash: /usr/share/bash-completion/bash_completion: Input/output error
elukey@ms-be1020:~$ df -h
-bash: /bin/df: Input/output error
elukey@ms-be1020:~$ dmesg
Bus error
elukey@ms-be1020:~$ ls
elukey@ms-be1020:~$ ls -l
total 0
elukey@ms-be1020:~$ sudo less /var/log/syslog
sudo: unable to execute /usr/bin/less: Input/output error

Console COM2:

[8383659.879842] sd 0:1:0:0: rejecting I/O to offline device
[8383659.909204] sd 0:1:0:0: rejecting I/O to offline device
[8383659.936361] sd 0:1:0:0: rejecting I/O to offline device
[8383659.967836] sd 0:1:0:0: rejecting I/O to offline device
[8383659.996775] sd 0:1:0:0: rejecting I/O to offline device
[8383660.159268] sd 0:1:0:0: rejecting I/O to offline device
[8383660.189448] sd 0:1:0:0: rejecting I/O to offline device

Given the fact that this is a Swift Backend and is should get depooled by frontends as soon as it stops answering correctly (seems confirmed by the flatline of network activity), I'd be inclined to wait for @fgiunchedi's opinion/debug before shutting it down (or anything else, but since Chris is away we cannot do much).

Mentioned in SAL (#wikimedia-operations) [2019-01-27T16:22:24Z] <godog> powercycle ms-be1020 - T214778

fgiunchedi claimed this task.

Host is back after a powercycle, looks like the raid controller freaked out. Leaving this open to upgrade the controller firmware.

This is a HP server, while the f/w can probably be updated remotely it would be best if I did the update on-site with the service pack and can update everything else at the same time.

This is a HP server, while the f/w can probably be updated remotely it would be best if I did the update on-site with the service pack and can update everything else at the same time.

Sounds good to take the host offline and upgrade via service pack, let me know when you are available and I'll take the host offline

@fgiunchedi Let's do this on Monday if you are available and now that ms-be1033 is working again.

CDanis added a subscriber: CDanis.Feb 22 2019, 5:29 PM

@Cmjohnson sounds good, let me know when you are ready to go and I'll poweroff the host.

Cmjohnson closed this task as Resolved.Mar 25 2019, 4:24 PM

This server raid appears to be in optimal condition. I verified the h/w and icinga is not reporting a degraded raid. Resolving for now

This server raid appears to be in optimal condition. I verified the h/w and icinga is not reporting a degraded raid. Resolving for now

The firmware upgrade was pending on this host, is that happening?