Degraded RAID on sodium
Open, HighPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host sodium. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 1 failed LD(s) (Degraded)

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_megacli
=== RaidStatus (does not include components in optimal state)
name: Adapter #0

	Virtual Drive: 0 (Target Id: 0)
	RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
	State: =====> Degraded <=====
	Number Of Drives: 4
	Number of Spans: 1
	Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

=== RaidStatus completed

created a self dispatch ticket with Dell.

You have successfully submitted request WO10392987.

Cmjohnson moved this task from Backlog to Being worked on on the ops-eqiad board.Aug 28 2018, 4:29 PM
ArielGlenn triaged this task as High priority.Aug 29 2018, 10:25 AM

Swapped the disk in sodium

return shipping info
USPS 9202 3946 5301 2439 6565 62
FEDX 9611918 2393026 76406583

Cmjohnson added a subscriber: ArielGlenn.

@ArielGlenn Can you help get this disk back into rotation. Shows as unconfigured good

Whoever has clinic duty should probably take this and hand it to the right person. I think @Dzahn and/or @MoritzMuehlenhoff may oversee the ubuntu mirror boxes (if not, please excuse the ping).

Cmjohnson reassigned this task from ArielGlenn to Dzahn.Oct 18 2018, 4:11 PM

@Dzahn can you help put the disk back into the raid cfg please

Wasn't aware i was overseeing mirror boxes and I have never done this before but tried to follow T205364#4641757 and the cheatsheet linked there.

Eventually i was able to identify the new drive and needed parameters but i got stuck at:

root@sodium:~# megacli -PdReplaceMissing -PhysDrv [32:1] -Array0 -row1 -a0
                                     
Cannot perform replace operation with drives having different Sector size . 

Exit Code: 0x01
Dzahn added a comment.EditedOct 26 2018, 11:52 PM
 for disk in 0 1 2 3; do megacli -PDInfo -PhysDrv [32:${disk}] -aALL | grep "^Sector Size"; done
Sector Size:  512
Sector Size:  4096
Sector Size:  512
Sector Size:  512

@Cmjohnson I think it won't work with this disk ^ because it uses 4K sector size. Do you have another replacement that also uses traditional 512 byte sector size like the existing disks?

https://en.wikipedia.org/wiki/Advanced_Format

Dzahn reassigned this task from Dzahn to Cmjohnson.Oct 26 2018, 11:53 PM
Dzahn added a subscriber: akosiaris.

The disk has been swapped

Dzahn claimed this task.Oct 31 2018, 5:04 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-05T22:41:24Z] <mutante> sodium - reboot after disk replacement (T202705)

Dzahn added a comment.EditedNov 5 2018, 10:55 PM

Hmm.. I still see that one of the disks has a different sector size, as before:

root@sodium:~# for disk in 0 1 2 3; do megacli -PDInfo -PhysDrv [32:${disk}] -aALL | grep "Physical Sector"; done
Physical Sector Size:  512
Physical Sector Size:  4096
Physical Sector Size:  512
Physical Sector Size:  512

and they also have different raw size in TB:

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 5.458 TB [0x57541e96 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]

I can't rebuild the RAID with it , "Cannot perform replace operation with drives having different Sector size"

Dzahn reassigned this task from Dzahn to Cmjohnson.Nov 5 2018, 10:56 PM

@Cmjohnson The problem appears unchanged to me. Can you replace disk 1 with a disk that also uses a physical sector size of 512?

Cmjohnson moved this task from Being worked on to Blocked on the ops-eqiad board.Tue, Dec 11, 6:38 PM