Page MenuHomePhabricator

Degraded RAID on wtp2017
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host wtp2017. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [raid1] 
md1 : active (auto-read-only) raid1 sda2[0]
      439426048 blocks super 1.2 [2/1] [U_]
      bitmap: 4/4 pages [16KB], 65536KB chunk

md0 : active raid1 sda1[0]
      48794624 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 10 2017, 10:55 AM
MoritzMuehlenhoff triaged this task as Medium priority.

@Joe or @akosiaris since this systems is using software raid there is nothing showing in hardware log which disk is bad and the system diagnostic came up with no error. Can you please pull up at the OS level a log that can help me determinate which disk is bad so i can submit it to Dell? Or is this another bug (false alert)

Thanks

No this is not a false alarm. One of the 2 disks has indeed failed and it seems so badly that the system can not even probe it anymore. What I could do is find out the serial number of the non broken disk. That would be WD-WMAYP0E3RCZJ. The broken disk should be the other one.

Papaul reassigned this task from Papaul to akosiaris.Nov 21 2017, 4:47 PM
Papaul added a subscriber: Papaul.

Call Dell, they said that the server part warranty has expired so i have some spare 500G SATA disks on site that i used for the system.

systems is back up.

akosiaris closed this task as Resolved.Nov 22 2017, 9:04 AM

I 've added the disk to the RAID array. For those interested the commands where

dd if=/dev/sda1 of=/dev/sdb bs=512 count=1 # (copy the MBR from the first disk)
fdisk /dev/sdb # (w, quit. I just used fdisk as a quick alternative to partprobe/kpartx)
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2

md0 has already resynced, md1 is resyncing

I am resolving this. @Papaul Thanks for handling it!

Mentioned in SAL (#wikimedia-operations) [2017-11-22T09:06:22Z] <akosiaris@tin> Started deploy [parsoid/deploy@b150764]: T180211

Mentioned in SAL (#wikimedia-operations) [2017-11-22T09:11:26Z] <akosiaris@tin> Finished deploy [parsoid/deploy@b150764]: T180211 (duration: 05m 05s)