Page MenuHomePhabricator

Degraded RAID on db1066
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1066. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 1 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 1, Arm: 0
			Media Error Count: 13
			Other Error Count: 1
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 42C (107.60 F)

		Span: 3 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 6
			Drive's position: DiskGroup: 0, Span: 3, Arm: 0
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: =====> 158 <=====
			Last Predictive Failure Event Seq Number: 2892

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 40C (104.00 F)

=== RaidStatus completed

Event Timeline

Marostegui added a project: DBA.

This is a s1 slave - @Cmjohnson please change the disk when you are back from holidays.
If you need to get some used disks, there are some hosts scheduled for decommission: T166486 T164702

Thanks!

@Marostegui That doesn't work- older hosts have 300GB disks- older but not so much have 600GB ones.

Good catch @jcrespo - thank you.
@Cmjohnson please advise if you ran out of 600GB spare disks.

Thanks guys

Cmjohnson added a subscriber: faidon.

New disks need to be ordered . A task has been created and escalated to @faidon T170446

Disk replaced and rebuilding

Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Rebuild
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

This is taking a long time to be rebuilt :-/ - It is still doing it.

Change 367859 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1066 for maintenance

https://gerrit.wikimedia.org/r/367859

Change 367859 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1066 for maintenance

https://gerrit.wikimedia.org/r/367859

I depool it and now it finishes :-(

jcrespo assigned this task to Cmjohnson.