Degraded RAID on wtp2017
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Nov 10 2017, 10:55 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host wtp2017. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [raid1] 
md1 : active (auto-read-only) raid1 sda2[0]
      439426048 blocks super 1.2 [2/1] [U_]
      bitmap: 4/4 pages [16KB], 65536KB chunk

md0 : active raid1 sda1[0]
      48794624 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Related Objects

Duplicates Merged Here: T181069: Degraded RAID on wtp2017

Event Timeline

ops-monitoring-bot added projects: ops-codfw, SRE.Nov 10 2017, 10:55 AM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 10 2017, 10:55 AM

MoritzMuehlenhoff assigned this task to Papaul.Nov 10 2017, 10:59 AM

MoritzMuehlenhoff triaged this task as Medium priority.

@Joe or @akosiaris since this systems is using software raid there is nothing showing in hardware log which disk is bad and the system diagnostic came up with no error. Can you please pull up at the OS level a log that can help me determinate which disk is bad so i can submit it to Dell? Or is this another bug (false alert)

Thanks

No this is not a false alarm. One of the 2 disks has indeed failed and it seems so badly that the system can not even probe it anymore. What I could do is find out the serial number of the non broken disk. That would be WD-WMAYP0E3RCZJ. The broken disk should be the other one.

@akosiaris thank you.

Call Dell, they said that the server part warranty has expired so i have some spare 500G SATA disks on site that i used for the system.

systems is back up.

faidon merged a task: T181069: Degraded RAID on wtp2017.Nov 21 2017, 5:09 PM

I 've added the disk to the RAID array. For those interested the commands where

dd if=/dev/sda1 of=/dev/sdb bs=512 count=1 # (copy the MBR from the first disk)
fdisk /dev/sdb # (w, quit. I just used fdisk as a quick alternative to partprobe/kpartx)
mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sdb2

md0 has already resynced, md1 is resyncing

I am resolving this. @Papaul Thanks for handling it!

Mentioned in SAL (#wikimedia-operations) [2017-11-22T09:06:22Z] <akosiaris@tin> Started deploy [parsoid/deploy@b150764]: T180211

Mentioned in SAL (#wikimedia-operations) [2017-11-22T09:11:26Z] <akosiaris@tin> Finished deploy [parsoid/deploy@b150764]: T180211 (duration: 05m 05s)

@akosiaris you welcome

Degraded RAID on wtp2017Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on wtp2017
Closed, ResolvedPublic
Actions