Maniphest T206915

Degraded RAID on aqs1006
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Oct 13 2018, 4:10 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host aqs1006. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid10 sdg2[2] sdh2[3] sde2[0](F) sdf2[1]
      3076536320 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
      bitmap: 4/23 pages [16KB], 65536KB chunk

md1 : active raid10 sda2[0] sdd2[3] sdb2[1] sdc2[2]
      3076536320 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 3/23 pages [12KB], 65536KB chunk

md0 : active raid10 sda1[0] sdd1[3] sdc1[2] sdb1[1]
      48793600 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      
unused devices: <none>

Related Objects

Duplicates Merged Here: T207958: Degraded RAID on aqs1006
T207964: Degraded RAID on aqs1006

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Oct 13 2018, 4:10 PM

ops-monitoring-bot subscribed.

MoritzMuehlenhoff added a subscriber: elukey.Oct 15 2018, 6:58 AM

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Oct 15 2018, 4:49 PM

A case has been opened up with HPE Support

Your case was successfully submitted. Please note your Case ID: 5333327393 for future reference.

response for HPE

hank you so much for updating this case.

This is regarding case number: 5333327393.

AHS logs is not showing any hard drives.

Can you please confirm if the hard drive is located on external storage device, if yes, please provide the details of external storage device including the serial number.

Also please provide the ADU report.

Nikhita Venugopal
Technical Solutions Consultant-Industry Standard Servers
Customer Solution Center,Hewlett Packard Enterprise

This will require the server to go down for about 20mins

jijiki assigned this task to • Cmjohnson.Oct 23 2018, 2:45 PM

jijiki triaged this task as Medium priority.

jijiki added a subscriber: Ottomata.

@Cmjohnson sure lemme know when it works best for you, it should take me ~5/10 mins to shut it down properly (worst case scenario)

jijiki added a project: Analytics.Oct 23 2018, 2:46 PM

Mentioned in SAL (#wikimedia-operations) [2018-10-25T15:36:11Z] <elukey> shutdown aqs1006 to replace one broken disk - T206915

elukey merged a task: T207964: Degraded RAID on aqs1006.Oct 25 2018, 4:20 PM

I sent HP a diagnostic log showing disk 5 as failed

SystemDiags.log9 KBDownload

SystemDiagsCeeHistory.log2 KBDownload

jijiki merged a task: T207958: Degraded RAID on aqs1006.Oct 26 2018, 8:36 AM

https://www.thegeekdiary.com/replacing-a-failed-mirror-disk-in-a-software-raid-array-mdadm/ is a good reference about how to swap the disk

The disk is being sent and should arrive today or tomorrow

Milimetric moved this task from Incoming to Radar on the Analytics board.Oct 29 2018, 4:08 PM

@elukey the new disk arrived, I am happy to swap it whenever you're ready. it's the first disk on the server and you will need manually replace it in raid since it's SW raid. ping when in IRC when you're ready.

Re-added to the md2 array and rebuilt it, all good!

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

	F26794607: SystemDiags.log
	Oct 25 2018, 4:22 PM

Degraded RAID on aqs1006Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on aqs1006
Closed, ResolvedPublic
Actions