Page MenuHomePhabricator

Degraded RAID on db1131
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host db1131. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
	State: =====> Degraded <=====
	Number Of Drives: 6
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 6

			PD: 0 Information
			ERROR: =====> MISSING DRIVE INFO <=====

=== RaidStatus completed

Event Timeline

@wiki_willy this host is under warranty, can we get a new disk for it?

[35898752.940170] megaraid_sas 0000:18:00.0: 726 (647382021s/0x0001/CRIT) - VD 00/0 is now DEGRADED
[35898999.592143] megaraid_sas 0000:18:00.0: 728 (647382270s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 0
Marostegui moved this task from Triage to In progress on the DBA board.

This is s6 primary database master

Controller's log in case it is needed to get the RMA:

seqNum: 0x000002d1
Time: Mon Jul  6 20:20:21 2020

Code: 0x0000010c
Class: 1
Locale: 0x02
Event Description: PD 00(e0x20/s0) Path 500056b33ce418c0  reset (Type 03)
Event Data:
===========
Device ID: 0
Enclosure Index: 32
Slot Number: 0
Error: 3


seqNum: 0x000002d2
Time: Mon Jul  6 20:20:21 2020

Code: 0x00000070
Class: 1
Locale: 0x02
Event Description: Removed: PD 00(e0x20/s0)
Event Data:
===========
Device ID: 0
Enclosure Index: 32
Slot Number: 0


seqNum: 0x000002d3
Time: Mon Jul  6 20:20:21 2020

Code: 0x000000f8
Class: 0
Locale: 0x02
Event Description: Removed: PD 00(e0x20/s0) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b33ce418c0,0000000000000000
Event Data:
===========
Device ID: 0
Enclosure Device ID: 32
Enclosure Index: 1
Slot Number: 0
SAS Address 1: 500056b33ce418c0
SAS Address 2: 0


seqNum: 0x000002d4
Time: Mon Jul  6 20:20:21 2020

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 00(e0x20/s0) from ONLINE(18) to FAILED(11)
Event Data:
===========
Device ID: 0
Enclosure Index: 32
Slot Number: 0
Previous state: 24
New state: 17


seqNum: 0x000002d5
Time: Mon Jul  6 20:20:21 2020

Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from OPTIMAL(3) to DEGRADED(2)
Event Data:
===========
Target Id: 0
Previous state: 3
New state: 2


seqNum: 0x000002d6
Time: Mon Jul  6 20:20:21 2020

Code: 0x000000fb
Class: 2
Locale: 0x01
Event Description: VD 00/0 is now DEGRADED
Event Data:
===========
Target Id: 0


seqNum: 0x000002d7
Time: Mon Jul  6 20:20:22 2020

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 00(e0x20/s0) from FAILED(11) to UNCONFIGURED_BAD(1)
Event Data:
===========
Device ID: 0
Enclosure Index: 32
Slot Number: 0
Previous state: 17
New state: 1


seqNum: 0x000002d8
Time: Mon Jul  6 20:24:30 2020

Code: 0x000000b9
Class: 2
Locale: 0x04
Event Description: Enclosure PD 20(c None/p1) phy bad for slot 0
Event Data:
===========
Device ID: 32
Enclosure Index: 1
Slot Number: 255
Index: 0
wiki_willy added a subscriber: Jclark-ctr.

@Jclark-ctr - can you send in the RMA for this one, when you get in later today? Thanks, Willy

Confirm
Confirmed: Service Request 1029100504 was successfully submitted.

. @Jclark-ctr TSR report is attached and will email

@Marostegui Tsr report showed a few more errors and dell would like to address. what day works best to schedule downtime?

Good morning John,

Per our phone conversations this morning, we have determined that there is a need to replace the system board of the server, as well as replacement of DIMM A10 and the SSD in slot 0. The reason for the system board is because the CMOS battery is failing and frequently these clips break when trying to replace the battery. My leadership has requested we replace the board
```.

Thanks @Jclark-ctr. We need to schedule a maintenance window as this is an active master. I will get that done next week and let you know when you can power off the host and replace the board.

Thank you

I will failover db1131 to db1093 on Tuesday 14th at 05:00 AM UTC

Change 611964 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1093 to s6 master

https://gerrit.wikimedia.org/r/611964

Change 611965 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/611965

Change 611964 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1093 to s6 master

https://gerrit.wikimedia.org/r/611964

Mentioned in SAL (#wikimedia-operations) [2020-07-14T05:00:10Z] <marostegui> Starting s6 failover from db1131 to db1093 - T257253

Mentioned in SAL (#wikimedia-operations) [2020-07-14T05:00:39Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set s6 as read-only for maintenance T257253', diff saved to https://phabricator.wikimedia.org/P11887 and previous config saved to /var/cache/conftool/dbconfig/20200714-050039-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-14T05:01:58Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1093 to s6 master and remove read-only from s6 T257253', diff saved to https://phabricator.wikimedia.org/P11888 and previous config saved to /var/cache/conftool/dbconfig/20200714-050157-marostegui.json

Change 611965 merged by Marostegui:
[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/611965

The switchover was done, db1131 is no longer the primary master
Times:
RO started: 05:00:39
RO finished: 05:01:58
Total RO: 1 minute and 19 seconds

Change 612471 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1131: Disable notifications

https://gerrit.wikimedia.org/r/612471

Change 612471 merged by Marostegui:
[operations/puppet@production] db1131: Disable notifications

https://gerrit.wikimedia.org/r/612471

@Jclark-ctr you can now proceed: db1131 is off.
Please power it back on once the on-site maintenance is done.

Thank you!

@Jclark-ctr everything done from your side? I see the host is back up.
What was done in the end?

@Marostegui Yes all items finished sorry for not commenting. Dell did not come till very late yesterday

Thanks @Jclark-ctr - just for the record in case this host has future issues, was the mainboard and DIMM modules replaced as well as the hard disk?

The RAID looks good

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 4.364 TB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 4.364 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 6
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
Number of Spans: 1
Span: 0 - Number of PDs: 6

Change 613023 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1131: Enable notifications

https://gerrit.wikimedia.org/r/613023

Change 613023 merged by Marostegui:
[operations/puppet@production] db1131: Enable notifications

https://gerrit.wikimedia.org/r/613023

This is all done, host fully back in production
Pending: what was exactly replaced on this host on-site, so we can track that just in case this host has issues again //cc @Jclark-ctr

@Marostegui - here are the details below on what Dell replaced. The DIMM A10, the SSD in slot 0, and the system board (though the board wasn't bad...it was just the CMOS battery that needed replacement) @Jclark-ctr - does that sound right? Thanks, Willy

Good morning John,

Per our phone conversations this morning, we have determined that there is a need to replace the system board of the server, as well as replacement of DIMM A10 and the SSD in slot 0. The reason for the system board is because the CMOS battery is failing and frequently these clips break when trying to replace the battery. My leadership has requested we replace the board

Thanks willy

Yes.
DIMM A10
SSD slot 0
Main board