Page MenuHomePhabricator

Degraded RAID on db1064
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1064. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 1 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 2
			Drive's position: DiskGroup: 0, Span: 1, Arm: 0
			Media Error Count: 529
			Other Error Count: 0
			Predictive Failure Count: =====> 133 <=====
			Last Predictive Failure Event Seq Number: 3542

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: Online, Spun Up
				Media Type: Hard Disk Device
				Drive Temperature: 41C (105.80 F)

		Span: 3 - Number of PDs: 2

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 6
			Drive's position: DiskGroup: 0, Span: 3, Arm: 0
			Media Error Count: 11
			Other Error Count: 1
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 40C (104.00 F)

=== RaidStatus completed

Details

Related Gerrit Patches:
operations/puppet : productiondb1064: Disable notifications
operations/mediawiki-config : masterdb-eqiad.php: Depool db1064

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2018, 12:56 AM
Marostegui triaged this task as Medium priority.Mar 2 2018, 6:36 AM
Marostegui added a project: DBA.
Marostegui added subscribers: Cmjohnson, Marostegui.

This is a slave in s4.
There is only one spare disk left and we will use it for db1068 (s4 master - T188187#4016615) so we need to order more as per @Cmjohnson comment (T188187#4016491)

Marostegui moved this task from Triage to In progress on the DBA board.Mar 2 2018, 7:09 AM
Marostegui mentioned this in Unknown Object (Task).Mar 6 2018, 9:34 AM
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Mar 6 2018, 3:40 PM

I have set to offline 32:2 due to errors.
This host has now 2 failed disks.

@Cmjohnson do you have some used disks somewhere? at least to replace one of them. We have now 2 spans degraded...

mark added a subscriber: mark.Mar 7 2018, 3:30 PM

Let me see what I have for used spare disks

Change 416964 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1064

https://gerrit.wikimedia.org/r/416964

Change 416964 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1064

https://gerrit.wikimedia.org/r/416964

Mentioned in SAL (#wikimedia-operations) [2018-03-07T15:47:20Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1064, it is not performing well with 2 failed disks - T188685 (duration: 01m 16s)

With the two servers disks failed and the server depooled it is struggling to catch up. It is slowly doing...

Change 416977 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1064: Disable notifications

https://gerrit.wikimedia.org/r/416977

Change 416977 merged by Marostegui:
[operations/puppet@production] db1064: Disable notifications

https://gerrit.wikimedia.org/r/416977

@Marostegui I swapped both disks with used disks we had from decommissioned servers. The disks are currently rebuilding. Please resolve this task once it's completed.

Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Rebuild
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Rebuild
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

Marostegui closed this task as Resolved.Mar 8 2018, 7:00 AM
Marostegui assigned this task to Cmjohnson.

Thanks Chris, it looks good now:

root@db1064:~# megacli -LDPDInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.271 TB
Sector Size         : 512
Mirror Data         : 3.271 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
root@db1064:~# megacli -LDPDInfo -aAll | egrep -i "slot|error"
Slot Number: 0
Media Error Count: 0
Other Error Count: 0
Slot Number: 1
Media Error Count: 0
Other Error Count: 0
Slot Number: 2
Media Error Count: 0
Other Error Count: 0
Slot Number: 3
Media Error Count: 0
Other Error Count: 0
Slot Number: 4
Media Error Count: 0
Other Error Count: 0
Slot Number: 5
Media Error Count: 0
Other Error Count: 0
Slot Number: 6
Media Error Count: 2
Other Error Count: 0
Slot Number: 7
Media Error Count: 0
Other Error Count: 0
Slot Number: 8
Media Error Count: 0
Other Error Count: 0
Slot Number: 9
Media Error Count: 0
Other Error Count: 0
Slot Number: 10
Media Error Count: 0
Other Error Count: 0
Slot Number: 11
Media Error Count: 0
Other Error Count: 0

Mentioned in SAL (#wikimedia-operations) [2018-03-08T07:15:15Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Revert: Depool db1064, it is not performing well with 2 failed disks - T188685 (duration: 01m 31s)