Page MenuHomePhabricator

Degraded RAID on aqs1006
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host aqs1006. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid10 sdg2[2] sdh2[3] sde2[0](F) sdf2[1]
      3076536320 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
      bitmap: 4/23 pages [16KB], 65536KB chunk

md1 : active raid10 sda2[0] sdd2[3] sdb2[1] sdc2[2]
      3076536320 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 3/23 pages [12KB], 65536KB chunk

md0 : active raid10 sda1[0] sdd1[3] sdc1[2] sdb1[1]
      48793600 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      
unused devices: <none>

Event Timeline

A case has been opened up with HPE Support

Your case was successfully submitted. Please note your Case ID: 5333327393 for future reference.

response for HPE

hank you so much for updating this case.

This is regarding case number: 5333327393.

AHS logs is not showing any hard drives.

Can you please confirm if the hard drive is located on external storage device, if yes, please provide the details of external storage device including the serial number.

Also please provide the ADU report.

Nikhita Venugopal
Technical Solutions Consultant-Industry Standard Servers
Customer Solution Center,Hewlett Packard Enterprise

This will require the server to go down for about 20mins

jijiki assigned this task to Cmjohnson.Oct 23 2018, 2:45 PM
jijiki triaged this task as Medium priority.
jijiki added a subscriber: Ottomata.

@Cmjohnson sure lemme know when it works best for you, it should take me ~5/10 mins to shut it down properly (worst case scenario)

Mentioned in SAL (#wikimedia-operations) [2018-10-25T15:36:11Z] <elukey> shutdown aqs1006 to replace one broken disk - T206915

I sent HP a diagnostic log showing disk 5 as failed

The disk is being sent and should arrive today or tomorrow

Milimetric moved this task from Incoming to Radar on the Analytics board.Oct 29 2018, 4:08 PM

@elukey the new disk arrived, I am happy to swap it whenever you're ready. it's the first disk on the server and you will need manually replace it in raid since it's SW raid. ping when in IRC when you're ready.

elukey closed this task as Resolved.Oct 31 2018, 5:59 PM

Re-added to the md2 array and rebuilt it, all good!

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM