Page MenuHomePhabricator

Storage problems with new host db1133
Closed, ResolvedPublic

Description

As per our IRC chat here is the task.
Looks like there is a bad disk on db1133, there are multiple reports of it:

	properties
		CreationTimestamp = 20190507090057.000000-300
		ElementName = System Event Log Entry
		RecordData = Fault detected on drive 2 in disk drive bay 1.
		RecordFormat = string Description
		RecordID = 42

	properties
		CreationTimestamp = 20190507084454.000000-300
		ElementName = System Event Log Entry
		RecordData = Fault detected on drive 2 in disk drive bay 1.
		RecordFormat = string Description
		RecordID = 40

That is making the RAID degraded from the start and the host is misbehaving on the installation (it never formats the /srv partition)
It is also pretty impossible to access the RAID controller menu on start (at least remotely), it just hangs when entering the menu at:

Copyright(c) 2016 Avago Technologies
Press <Ctrl><R> to Run Configuration Utility
HA -0 (Bus 24 Dev 0) PERC H730P Adapter
FW package: 25.5.5.0005

Can we get Dell to send a new one and re-create the RAID?

Update from T222731#5187356
Looks like replacing a disk wasn't enough and now the server is reporting multiple disks failures.
Maybe broken RAID controller or maybe broken main board. Needs follow up with the vendor.

Event Timeline

@Cmjohnson So I saw this on the idrac:

/admin1/system1/logs1/log1-> show record44

	properties
		CreationTimestamp = 20190507090745.000000-300
		ElementName = System Event Log Entry
		RecordData = Drive 2 in disk drive bay 1 is operating normally.
		RecordFormat = string Description
		RecordID = 43

Even though it said failed on the RAID as shown at T222731#5169505, I forced the disk to be online and started an installation. But it crashed again:

/admin1/system1/logs1/log1-> show record45

	properties
		CreationTimestamp = 20190507092932.000000-300
		ElementName = System Event Log Entry
		RecordData = Fault detected on drive 2 in disk drive bay 1.
		RecordFormat = string Description
		RecordID = 44
	associations

This disk needs replacement

new disk ordered You have successfully submitted request SR990443425.

Update from Chris after replacing the disk: there is a bigger issue now....now all but 1 disk is reporting bad including the newly replaced disk. this will required more work

Marostegui renamed this task from Bad disk on new host db1133 to Storage problems with new host db1133.May 20 2019, 8:03 AM
Marostegui updated the task description. (Show Details)

This host can be taken down for debugging anytime without heads up to the DBAs - it doesn't even have an OS

@Marostegui - Chris is out on vacation this week, so I'll follow up with him when he's back on Tuesday. ~Willy

Thank you!
We still have 3 more hosts to keeps us busy with, but as this probably involves getting pieces replaced...it might take a sometime to get them delivered

Update on this server. I have updated all of the f/w including the raid card. I am able to isolate the problem to slot 0 right now. I moved the disks around and they do not report any errors only the slot. I have blown out the raid several times and re-configured but the error keeps coming back. I have reseated the raid card as well.

Next step is Dell

Thanks for the heads up!
Let's see what Dell says

You have successfully submitted request SR991779294.

they declined my ticket...says I didn't isolate the problem well enough.

Is there anything I can do from my side to help on that?

Dell is sending me a new Raid card, cables and backplane. Sorry, it took so long, I had to call them after they denied my second request.

I have re-imaged the host after Chris did it yesterday and everything looks good: RAID, memory, CPUS...

root@db1133:~# megacli -LdPdInfo -a0

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 4.364 TB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 4.364 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 6
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
Number of Spans: 1
Span: 0 - Number of PDs: 6


root@db1133:~# megacli -LdPdInfo -a0 | grep state
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
root@db1133:~# megacli -LdPdInfo -a0 | grep state | wc -l
6

root@db1133:~# megacli -LdPdInfo -a0 | grep -i Raw
Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Raw Size: 1.455 TB [0xba4d4ab0 Sectors]

Thank you @Cmjohnson for painful host fixed!