Page MenuHomePhabricator

Degraded RAID on ms-be2032
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be2032. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0] sdb1[1](F)
      58559488 blocks super 1.2 [2/1] [U_]
      
md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 26 2019, 5:45 AM

Mentioned in SAL (#wikimedia-operations) [2019-06-26T07:30:37Z] <godog> powercycle ms-be2032 - T226600

Unaccessible via ssh

$ ssh ms-be2032.codfw.wmnet
Linux ms-be2032 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u2 (2019-05-13) x86_64
Debian GNU/Linux 9.9 (stretch)
ms-be2032 is a statsite server (statsite)
ms-be2032 is a swift storage brick (swift::storage)
The last Puppet run was at Wed Jun 26 05:26:19 UTC 2019 (123 minutes ago). 
-bash: /etc/bash.bashrc: Input/output error
-bash: /usr/share/bash-completion/bash_completion: Input/output error
-bash: /usr/bin/lesspipe: Input/output error
-bash: /usr/bin/tput: Input/output error
-bash: /usr/bin/tput: Input/output error
-bash: /usr/bin/tput: Input/output error
-bash: /usr/bin/tput: Input/output error
Connection to ms-be2032.codfw.wmnet closed.

console:

[2468466.458531] sd 0:1:0:0: rejecting I/O to offline device
[2468479.129482] sd 0:1:0:0: rejecting I/O to offline device
[2468479.155547] sd 0:1:0:0: rejecting I/O to offline device
[2468479.181390] sd 0:1:0:0: rejecting I/O to offline device
[2468479.207176] sd 0:1:0:0: rejecting I/O to offline device
[2468479.233050] sd 0:1:0:0: rejecting I/O to offline device
[2468479.258928] sd 0:1:0:0: rejecting I/O to offline device
[2468479.284776] sd 0:1:0:0: rejecting I/O to offline device
[2468479.310782] sd 0:1:0:0: rejecting I/O to offline device
[2468479.336650] sd 0:1:0:0: rejecting I/O to offline device
[2468479.363039] sd 0:1:0:0: rejecting I/O to offline device
[2468479.389075] sd 0:1:0:0: rejecting I/O to offline device
[2468479.414919] sd 0:1:0:0: rejecting I/O to offline device
fgiunchedi added a comment.EditedJun 26 2019, 7:42 AM

The host came back clean after a reboot, I've updated the raid controller firmware (cfr T141756) to 6.88 and rebooted again.

fgiunchedi closed this task as Resolved.Jun 26 2019, 7:47 AM
fgiunchedi claimed this task.