Degraded RAID on db2010
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Sep 7 2017, 5:10 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db2010. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 0
			Drive's position: DiskGroup: 0, Span: 0, Arm: 0
			Media Error Count: 13426
			Other Error Count: 122
			Predictive Failure Count: =====> 423 <=====
			Last Predictive Failure Event Seq Number: 113300

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 42C (107.60 F)

		Span: 1 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 1, Arm: 0
			Media Error Count: 14786
			Other Error Count: 0
			Predictive Failure Count: =====> 301 <=====
			Last Predictive Failure Event Seq Number: 113301

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 42C (107.60 F)

		Span: 2 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 5
			Drive's position: DiskGroup: 0, Span: 2, Arm: 1
			Media Error Count: 3056
			Other Error Count: 34
			Predictive Failure Count: =====> 271 <=====
			Last Predictive Failure Event Seq Number: 113240

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 43C (109.40 F)

		Span: 4 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 4, Arm: 0
			Media Error Count: 9
			Other Error Count: 0
			Predictive Failure Count: =====> 301 <=====
			Last Predictive Failure Event Seq Number: 113302

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 42C (107.60 F)

=== RaidStatus completed

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Papaul	T175228 Degraded RAID on db2010
		Resolved		RobH	T175685 Decommission db2010 and move m1 codfw to db2078

Event Timeline

ops-monitoring-bot added projects: ops-codfw, SRE.Sep 7 2017, 5:10 AM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 7 2017, 5:10 AM

@Papaul please change the disk whenever you can.
Thanks!

Marostegui moved this task from Triage to In progress on the DBA board.Sep 7 2017, 5:20 AM

jcrespo subscribed.Sep 7 2017, 8:26 AM

@Marostegui, are we sure we want this done, and not get rid of the host directly?- it is a very old host and we have its replacements setup. I would ask how many spare disks we have left, if any, and only change it if we have plently, start cloning it otherwise.

In T175228#3587467, @jcrespo wrote:

@Marostegui, are we sure we want this done, and not get rid of the host directly?- it is a very old host and we have its replacements setup. I would ask how many spare disks we have left, if any, and only change it if we have plently, start cloning it otherwise.

If we have spare disks, I would just replace it for now and do the cloning without any rush. But it is really up to you if you don't mind getting this on your priority list, as I will on holidays soon! :-)

@Papaul Do you have plenty of old 300GB disks that would not be used otherwise or should we speed up the decomissioning (it will happen eventually, but right now we have other priorities).

Marostegui assigned this task to Papaul.Sep 7 2017, 8:39 AM

Marostegui moved this task from In progress to Blocked external/Not db team on the DBA board.Sep 7 2017, 9:19 AM

@jcrespo technically we do not have any 300GB spare disks. I am trying to load my Google spreadsheet for server decommission to see if we we do have a server with 300GB but can't for the moment once i am able i will update task.

Thanks.

@Papaul no problem- do not work too hard, we may replace the full server soon.

@jcrespo db200[1-9} all have 12x300Gb disks we can pull one out and use it for db2010 for now.

Yes, those are unused, you can use one of those with no problem. Please do if it doesn't take much of your time, thank you.

@jcrespo on db2010 I have 5 bad disks is there any particular order you will want them replaced?

5? wow. I would say 1 at a time, and we check they rebuild correctly. Do not necessarily wait, we can do a couple per day when you are around (normally it takes a few hours to rebuild each disk)? What do you think. You can tell me which ones you change and I will tell you when the finis rebuilding.

Let's start with changing only Slot Number: 5 (DiskGroup: 0, Span: 2, Arm: 1), which is the one really failed (the other should be only correctable errors). I can also help indentify it if you have problems by blinking a led.

Ok I have disk in slot 5 replaced

5 got rebuilt correctly, let's go with Slot Number: 0 now (much lower priority). It has 12K errors.

jcrespo created subtask T175685: Decommission db2010 and move m1 codfw to db2078.Sep 12 2017, 1:14 PM

Disk replacement in slot 0 complete

Still on Firmware state: Rebuild, we will wait a bit for the next one. (I am a bit more cautions than I have to be due to the RAID 10 because the disks are not new, so there is a change for those to fail, too).

0 is Online, Spun UP. Next one should be Span: 1

Disk in slot 2 replacement complete.

Volans merged a task: T175715: Degraded RAID on db2010.Sep 12 2017, 5:28 PM

Let's consider this fixed and lets focus on T175685.

RobH closed subtask T175685: Decommission db2010 and move m1 codfw to db2078 as Resolved.Nov 3 2017, 4:50 PM

Degraded RAID on db2010Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Degraded RAID on db2010
Closed, ResolvedPublic
Actions

Related Objects
Search...