Page MenuHomePhabricator

Degraded RAID on ganeti2013
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ganeti2013. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md0 : active raid5 sdc1[2] sda1[0] sdd1[3] sdb1[1]
      1456128 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      
md1 : active raid5 sda2[0] sdd2[3] sdb2[1]
      117086208 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U]
      
md2 : active raid5 sdd3[3] sda3[0] sdb3[1]
      2225184768 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U]
      bitmap: 1/6 pages [4KB], 65536KB chunk

unused devices: <none>

Event Timeline

VM kubestagetcd2002.codfw.wmnet switching disk type to drbd

VM kubestagetcd2002.codfw.wmnet switching disk type to plain

The server can be taken down for troubleshooting anytime, I removed it from active service. I saw kernel messages on the console pointint to a broken /dev/sdc.

I realise the server is out of warranty for some months now, but let's either use a disk from a decommed server (if we have one) or buy a replacement?

Papaul triaged this task as Medium priority.Nov 28 2022, 3:51 PM

@MoritzMuehlenhoff unfortunately this server is out of warranty.

Mentioned in SAL (#wikimedia-operations) [2022-12-01T09:07:41Z] <moritzm> rebuilding raid on ganeti2013 T323222

The RAID rebuild has completed and the server has been readded to the cluster.