Degraded RAID on labsdb1001
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Jul 24 2017, 10:18 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host labsdb1001. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives per span: 2
	Number of Spans: 6
	Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU

		Span: 4 - Number of PDs: 2

			PD: 1 Information
			Enclosure Device ID: 16
			Slot Number: 9
			Drive's position: DiskGroup: 0, Span: 4, Arm: 1
			Media Error Count: 8
			Other Error Count: 26
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 42C (107.60 F)

=== RaidStatus completed

Details

	Subject	Repo	Branch	Lines +/-
	DON'T MERGE: labsdb: in case labsdb1001 falls over	operations/puppet	production	+4 -4

Customize query in gerrit

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Jul 24 2017, 10:18 PM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2017, 10:18 PM

bd808 added a project: Data-Services.Jul 24 2017, 10:29 PM

I have no idea why things didn't explode here.

I would have ready the labsdb1001 depool patch just in case.

I think this must be one of the two drives in a RAID1 configuration for the OS itself rather than a drive in the RAID0 data array. We should really get this changed out tomorrow if at all possible then. We are on borrowed time :)

@Cmjohnson are you hopefully at the DC tomorrow?

Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptJul 24 2017, 11:13 PM

• chasemp added a project: cloud-services-team (Kanban).Jul 24 2017, 11:15 PM

hmm

# cat /proc/mdstat
Personalities :
unused devices: <none>

Change 367625 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] DON'T MERGE: labsdb: in case labsdb1001 falls over

https://gerrit.wikimedia.org/r/367625

gerritbot added a project: Patch-For-Review.Jul 24 2017, 11:37 PM

In T171538#3468451, @chasemp wrote:
hmm
# cat /proc/mdstat
Personalities :
unused devices: <none>

From the alert above, it looks like this is a MegaRAID HW RAID, not software md RAID, i.e. you need to use megacli to troubleshoot this. I didn't log in to the box, but from the task description it looks like this is a disk that's part of a RAID10 of 2x6=12 disks, not of a 2-member RAID1.

The disk failed is part of a HW RAID10:

# megacli -LDPDInfo -aAll

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.271 TB
Sector Size         : 512
Mirror Data         : 3.271 TB
State               : Degraded
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only
Number of Spans: 6
Span: 0 - Number of PDs: 2

These disks are 600G disks, which I don't think we have spares at the moment (they were ordered recently though, so they will probably arrive this week I assume?: T170446#3436627

• Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jul 25 2017, 3:10 PM

Disk replaced and rebuilding

Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Rebuild
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

• Cmjohnson moved this task from Up next to High Priority Task on the ops-eqiad board.Jul 25 2017, 7:05 PM

thanks you @Cmjohnson

[root@labsdb1001 05:04 /root]
# megacli -pdrbld -showprog -physdrv\[16:9\] -aALL

Device(Encl-16 Slot-9) is not in rebuild process

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 3.271 TB
Sector Size         : 512
Mirror Data         : 3.271 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2

Liuxinyu970226 unsubscribed.Jul 26 2017, 7:05 AM

Change 367625 abandoned by Rush:
DON'T MERGE: labsdb: in case labsdb1001 falls over

Reason:
no longer needed

https://gerrit.wikimedia.org/r/367625

bd808 moved this task from Inbox to Done on the cloud-services-team (Kanban) board.Jul 28 2017, 11:23 PM

Degraded RAID on labsdb1001Closed, ResolvedPublicActions

Description

Details

Event Timeline

Degraded RAID on labsdb1001
Closed, ResolvedPublic
Actions