Page MenuHomePhabricator

Degraded RAID on db2010
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db2010. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 0
			Drive's position: DiskGroup: 0, Span: 0, Arm: 0
			Media Error Count: 13426
			Other Error Count: 122
			Predictive Failure Count: =====> 423 <=====
			Last Predictive Failure Event Seq Number: 113300

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 42C (107.60 F)

		Span: 1 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 1, Arm: 0
			Media Error Count: 14786
			Other Error Count: 0
			Predictive Failure Count: =====> 301 <=====
			Last Predictive Failure Event Seq Number: 113301

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 42C (107.60 F)

		Span: 2 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 32
			Slot Number: 5
			Drive's position: DiskGroup: 0, Span: 2, Arm: 1
			Media Error Count: 3056
			Other Error Count: 34
			Predictive Failure Count: =====> 271 <=====
			Last Predictive Failure Event Seq Number: 113240

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 43C (109.40 F)

		Span: 4 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 8
			Drive's position: DiskGroup: 0, Span: 4, Arm: 0
			Media Error Count: 9
			Other Error Count: 0
			Predictive Failure Count: =====> 301 <=====
			Last Predictive Failure Event Seq Number: 113302

				Raw Size: 279.396 GB [0x22ecb25c Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 42C (107.60 F)

=== RaidStatus completed

Event Timeline

Marostegui added subscribers: Papaul, Marostegui.

@Papaul please change the disk whenever you can.
Thanks!

@Marostegui, are we sure we want this done, and not get rid of the host directly?- it is a very old host and we have its replacements setup. I would ask how many spare disks we have left, if any, and only change it if we have plently, start cloning it otherwise.

@Marostegui, are we sure we want this done, and not get rid of the host directly?- it is a very old host and we have its replacements setup. I would ask how many spare disks we have left, if any, and only change it if we have plently, start cloning it otherwise.

If we have spare disks, I would just replace it for now and do the cloning without any rush. But it is really up to you if you don't mind getting this on your priority list, as I will on holidays soon! :-)

@Papaul Do you have plenty of old 300GB disks that would not be used otherwise or should we speed up the decomissioning (it will happen eventually, but right now we have other priorities).

@jcrespo technically we do not have any 300GB spare disks. I am trying to load my Google spreadsheet for server decommission to see if we we do have a server with 300GB but can't for the moment once i am able i will update task.

Thanks.

@Papaul no problem- do not work too hard, we may replace the full server soon.

@jcrespo db200[1-9} all have 12x300Gb disks we can pull one out and use it for db2010 for now.

Yes, those are unused, you can use one of those with no problem. Please do if it doesn't take much of your time, thank you.

@jcrespo on db2010 I have 5 bad disks is there any particular order you will want them replaced?

5? wow. I would say 1 at a time, and we check they rebuild correctly. Do not necessarily wait, we can do a couple per day when you are around (normally it takes a few hours to rebuild each disk)? What do you think. You can tell me which ones you change and I will tell you when the finis rebuilding.

Let's start with changing only Slot Number: 5 (DiskGroup: 0, Span: 2, Arm: 1), which is the one really failed (the other should be only correctable errors). I can also help indentify it if you have problems by blinking a led.

Papaul triaged this task as High priority.EditedSep 11 2017, 2:56 PM

Ok I have disk in slot 5 replaced

jcrespo lowered the priority of this task from High to Low.Sep 12 2017, 11:55 AM

5 got rebuilt correctly, let's go with Slot Number: 0 now (much lower priority). It has 12K errors.

Disk replacement in slot 0 complete

Still on Firmware state: Rebuild, we will wait a bit for the next one. (I am a bit more cautions than I have to be due to the RAID 10 because the disks are not new, so there is a change for those to fail, too).

0 is Online, Spun UP. Next one should be Span: 1

Disk in slot 2 replacement complete.

Let's consider this fixed and lets focus on T175685.