Page MenuHomePhabricator

Degraded RAID on ms-be2067
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host ms-be2067. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 2 (Target Id: 2)
	RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0
	State: =====> Offline <=====
	Number Of Drives: 1
	Number of Spans: 1
	Current Cache Policy: WriteThrough, ReadAheadNone, Cached, No Write Cache if Bad BBU

		Span: 0 - Number of PDs: 1

			PD: 0 Information
			Enclosure Device ID: 32
			Slot Number: 0
			Drive's position: DiskGroup: 0, Span: 0, Arm: 0
			Media Error Count: 0
			Other Error Count: 261
			Predictive Failure Count: 0
			Last Predictive Failure Event Seq Number: 0

				Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
				Firmware state: =====> Failed <=====
				Media Type: Hard Disk Device
				Drive Temperature: 31C (87.80 F)

=== RaidStatus completed

Event Timeline

Hi,
This drive is now unmounted, so can be swapped at your earliest convenience, please :)
Thanks!

Papaul triaged this task as Medium priority.Aug 1 2022, 3:01 PM
Create Dispatch: Success
You have successfully submitted request SR147890192.

Hi @Papaul I may be missing something obvious, but I don't think the storage is quite right here - as far as I can see there isn't a new disk visible, and if I visit the idrac, it tells me there's one drive "Physical Disk 0:2:0" in state "removed". Could you have another look, please?

Thanks, and sorry for the bother.

Perhaps relatedly, but perhaps not, kern.log is unhappy about /dev/sdz since sdc was removed:

Aug  3 15:18:02 ms-be2067 kernel: [2595942.387928] sd 0:2:2:0: SCSI device is re
moved
Aug  3 15:18:05 ms-be2067 kernel: [2595945.605821] sd 0:2:25:0: [sdz] tag#250 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Aug  3 15:18:05 ms-be2067 kernel: [2595945.605827] sd 0:2:25:0: [sdz] tag#250 CDB: Write(16) 8a 00 00 00 00 00 80 97 29 60 00 00 01 b8 00 00
Aug  3 15:18:05 ms-be2067 kernel: [2595945.605832] blk_update_request: I/O error, dev sdz, sector 2157390176 op 0x1:(WRITE) flags 0x800 phys_seg 55 prio class 0
Aug  3 15:18:05 ms-be2067 kernel: [2595945.617003] iomap_finish_ioend: 39 callbacks suppressed
Aug  3 15:18:15 ms-be2067 kernel: [2595945.617006] sdz1: writeback error on inode 2149037748, offset 0, sector 2157390616

...and it's still logging errors now. (I think that's drive with Slot Number: 23 if that helps).

Check again and please resolve this task when done

@Papaul sorry, I don't understand your comment, but I've rechecked, and there are still kernel log errors re sdz and the idrac still thinks there's one removed drive....

disk was bad it was replaced now you need to put the replaced disk back in the raid.

Well, I tried our usual procedure https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings and the first two commands work OK, but attempting to make a new single-disk RAID out of the new drive fails because the adaptor thinks there's no new drive:

mvernon@ms-be2067:~$ sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0
                                     

There is no configurable physical drive available on adapter 0.

Exit Code: 0x0c

Maybe @fgiunchedi can spot what I'm doing wrong with the replaced drive [and have a better explanation for what's up with sdz...]

re: the original failed disk I can confirm that slot 0 (where the disk was) isn't currently listed:

root@ms-be2067:~# megacli -pdlist -aALL | grep 'Slot Number'
Slot Number: 1
Slot Number: 2
Slot Number: 3
Slot Number: 4
Slot Number: 5
Slot Number: 6
Slot Number: 7
Slot Number: 8
Slot Number: 9
Slot Number: 10
Slot Number: 11
Slot Number: 12
Slot Number: 13
Slot Number: 14
Slot Number: 15
Slot Number: 16
Slot Number: 17
Slot Number: 18
Slot Number: 19
Slot Number: 20
Slot Number: 21
Slot Number: 22
Slot Number: 23
Slot Number: 24
Slot Number: 25

Maybe we can try confirm what's in slot 0 and reseat the drive?

re: sdz I'm not sure though, perhaps another bad drive (i.e. unrelated to this failure?)

@Papaul could you take another look at this, please, and see if we can get the replacement disk to be visible to the RAID controller?

Also, I don't know if it's possible that /dev/sdz got knocked while you were working on this system, but could you check it's seated properly, please? That drive will need replacing too if we can't get it happy again...

/dev/sdz is scsi@0:2.25.0 and the physical drive inside logical drive 25 (which is Target Id: 25) is Slot Number: 23

Thanks!

I need this server depool so i can shut it down to work on this disk issue.

Icinga downtime and Alertmanager silence (ID=0b0cee87-305c-4cd2-acf0-ac3d3f5b8587) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: RAID battery failure

ms-be2032.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=51e5c728-88e2-4e83-acf6-7e651f6e7d29) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: disk fault investigation

ms-be2067.codfw.wmnet

@Papaul I've shut ms-be2067 down for you to work on it.
[ignore the downtime on ms-be2032 here, that was a typo]

I requested for another disk to be sent to me. The server is back up

Create Dispatch: Success
You have successfully submitted request SR148961821.

Give it a minute i am upgrading the BIOS on it

I should be receiving a new disk sometimes today. If the new disk doesn't work then i will open a ticket with Dell.

I received the new disk and I will need the server offline so i can work on it. Thanks

Icinga downtime and Alertmanager silence (ID=eb9685af-e0f7-4513-a789-7a96488ffc40) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: disk fault investigation

ms-be2067.codfw.wmnet

@Papaul I've shut this server down for you to work on it.

When the drive is removed from the server the IDRAC detected it and when it is re-placed back, the IDRAC detected it as well but the controller doesn't

Drive 0 is installed in disk drive bay 1.	Thu 18 Aug 2022 17:42:47
Drive 0 is removed from disk drive bay 1.	Thu 18 Aug 2022 17:42:02

I have a ticket open with Dell to send me a back plane. the servers is back online for now. thanks

Change 824686 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: ms-be2067/sdc1 has failed

https://gerrit.wikimedia.org/r/824686

Change 824686 merged by MVernon:

[operations/puppet@production] swift: ms-be2067/sdc1 has failed

https://gerrit.wikimedia.org/r/824686

Good afternoon Papaul,

I have submitted DPS 432866984 for the replacement backplane to ship out. Service is scheduled for Thursday 08/25/22. The tech will call upon assignment to provide his contact details for site access.

Dell technician will be on site today between 10am CT and 2pm. Is is possible to get this server offline for the back plane replacement?

Thanks

Icinga downtime and Alertmanager silence (ID=c4a39dbe-2fb0-4745-99c3-76e40de3820e) set by eevans@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: backplane replacement

ms-be2067.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-08-25T16:40:03Z] <urandom> shutting down ms-be2067.codfw.wmnet for backplane replacement -- T314049

@Papaul the host is shut down; Please let me know as soon as it's back up

@Eevans thanks the host is back online. the back plane replacement fixed the issue .

Mentioned in SAL (#wikimedia-operations) [2022-08-25T19:36:57Z] <urandom> rebooting ms-be2067 to "fix" disk enumeration(?) -- T314049

Mentioned in SAL (#wikimedia-operations) [2022-08-25T20:14:30Z] <urandom> re-rebooting ms-be2067 to "fix" disk enumeration(?) -- T314049

Change 829763 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: ms-be2037/sdg1 failed; ms-be2067/sdc1 fixed

https://gerrit.wikimedia.org/r/829763

Change 829763 merged by MVernon:

[operations/puppet@production] swift: ms-be2037/sdg1 failed; ms-be2067/sdc1 fixed

https://gerrit.wikimedia.org/r/829763