Page MenuHomePhabricator

Degraded RAID on cloudelastic1002
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host cloudelastic1002. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md1 : active raid10 sdd3[3] sda3[0] sdc3[2] sde3[4] sdf3[5]
      5479237632 blocks super 1.2 512K chunks 2 near-copies [6/5] [U_UUUU]
      bitmap: 8/41 pages [32KB], 65536KB chunk

md0 : active raid10 sdc2[2] sda2[0] sdd2[3] sde2[4] sdf2[5]
      146386944 blocks super 1.2 512K chunks 2 near-copies [6/5] [U_UUUU]
      
unused devices: <none>

Event Timeline

wiki_willy added a subscriber: wiki_willy.

Looks like this one is a duplicate of T230088

Volans removed Jclark-ctr as the assignee of this task.
Volans added a project: Discovery-ARCHIVED.

Re-opening as this has not being yet solved at the md software RAID layer, Icinga is still critical and /proc/mdstat still reports the above degraded status.

Mentioned in SAL (#wikimedia-operations) [2019-12-10T10:06:39Z] <onimisionipe> add new disk to RAID array on cloudelastic1002 - T239957

jcrespo triaged this task as Medium priority.
jcrespo added a subscriber: jcrespo.

Assigning to Mathew based on above update as part of clinic duty. Feel free to revert if this is wrong.

This is resolved now.
I worked with Filippo to fix this via the following commands:

sfdisk -d /dev/sda | sfdisk /dev/sdb
mdadm /dev/md0 --add /dev/sdb2 && mdadm /dev/md1 --add /dev/sdb3

This should be documented in wikitech.