Page MenuHomePhabricator

Degraded RAID on db1066
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1066. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 1 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 1, Arm: 0
			Media Error Count: 13
			Other Error Count: 1
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 42C (107.60 F)

		Span: 3 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 6
			Drive's position: DiskGroup: 0, Span: 3, Arm: 0
			Media Error Count: 0
			Other Error Count: 0
			Predictive Failure Count: =====> 158 <=====
			Last Predictive Failure Event Seq Number: 2892

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 40C (104.00 F)

=== RaidStatus completed

Details

Related Gerrit Patches:
operations/mediawiki-config : mastermariadb: Depool db1066 for maintenance

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 1 2017, 9:52 PM
Marostegui triaged this task as Medium priority.Jul 2 2017, 5:44 AM
Marostegui added a project: DBA.
Marostegui added subscribers: Cmjohnson, Marostegui.

This is a s1 slave - @Cmjohnson please change the disk when you are back from holidays.
If you need to get some used disks, there are some hosts scheduled for decommission: T166486 T164702

Thanks!

jcrespo added a subscriber: jcrespo.Jul 3 2017, 8:44 AM

@Marostegui That doesn't work- older hosts have 300GB disks- older but not so much have 600GB ones.

Good catch @jcrespo - thank you.
@Cmjohnson please advise if you ran out of 600GB spare disks.

Thanks guys

Cmjohnson moved this task from Backlog to Blocked on the ops-eqiad board.Jul 12 2017, 8:08 PM
Cmjohnson added a subscriber: faidon.

New disks need to be ordered . A task has been created and escalated to @faidon T170446

Disk replaced and rebuilding

Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Rebuild
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

This is taking a long time to be rebuilt :-/ - It is still doing it.

Change 367859 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1066 for maintenance

https://gerrit.wikimedia.org/r/367859

Change 367859 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1066 for maintenance

https://gerrit.wikimedia.org/r/367859

jcrespo added a comment.EditedJul 26 2017, 8:29 AM

I depool it and now it finishes :-(

jcrespo closed this task as Resolved.Jul 26 2017, 2:16 PM
jcrespo assigned this task to Cmjohnson.