Page MenuHomePhabricator

Storage problems with new host db1133
Open, Needs TriagePublic

Description

As per our IRC chat here is the task.
Looks like there is a bad disk on db1133, there are multiple reports of it:

	properties
		CreationTimestamp = 20190507090057.000000-300
		ElementName = System Event Log Entry
		RecordData = Fault detected on drive 2 in disk drive bay 1.
		RecordFormat = string Description
		RecordID = 42

	properties
		CreationTimestamp = 20190507084454.000000-300
		ElementName = System Event Log Entry
		RecordData = Fault detected on drive 2 in disk drive bay 1.
		RecordFormat = string Description
		RecordID = 40

That is making the RAID degraded from the start and the host is misbehaving on the installation (it never formats the /srv partition)
It is also pretty impossible to access the RAID controller menu on start (at least remotely), it just hangs when entering the menu at:

Copyright(c) 2016 Avago Technologies
Press <Ctrl><R> to Run Configuration Utility
HA -0 (Bus 24 Dev 0) PERC H730P Adapter
FW package: 25.5.5.0005

Can we get Dell to send a new one and re-create the RAID?

Update from T222731#5187356
Looks like replacing a disk wasn't enough and now the server is reporting multiple disks failures.
Maybe broken RAID controller or maybe broken main board. Needs follow up with the vendor.

Event Timeline

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptMay 7 2019, 3:08 PM
Marostegui updated the task description. (Show Details)May 7 2019, 3:13 PM

@Cmjohnson So I saw this on the idrac:

/admin1/system1/logs1/log1-> show record44

	properties
		CreationTimestamp = 20190507090745.000000-300
		ElementName = System Event Log Entry
		RecordData = Drive 2 in disk drive bay 1 is operating normally.
		RecordFormat = string Description
		RecordID = 43

Even though it said failed on the RAID as shown at T222731#5169505, I forced the disk to be online and started an installation. But it crashed again:

/admin1/system1/logs1/log1-> show record45

	properties
		CreationTimestamp = 20190507092932.000000-300
		ElementName = System Event Log Entry
		RecordData = Fault detected on drive 2 in disk drive bay 1.
		RecordFormat = string Description
		RecordID = 44
	associations

This disk needs replacement

new disk ordered You have successfully submitted request SR990443425.

@Cmjohnson any ETA for this disk?
Thanks!

Update from Chris after replacing the disk: there is a bigger issue now....now all but 1 disk is reporting bad including the newly replaced disk. this will required more work

Marostegui renamed this task from Bad disk on new host db1133 to Storage problems with new host db1133.Mon, May 20, 8:03 AM
Marostegui updated the task description. (Show Details)

This host can be taken down for debugging anytime without heads up to the DBAs - it doesn't even have an OS

@Marostegui - Chris is out on vacation this week, so I'll follow up with him when he's back on Tuesday. ~Willy

Thank you!
We still have 3 more hosts to keeps us busy with, but as this probably involves getting pieces replaced...it might take a sometime to get them delivered

Update on this server. I have updated all of the f/w including the raid card. I am able to isolate the problem to slot 0 right now. I moved the disks around and they do not report any errors only the slot. I have blown out the raid several times and re-configured but the error keeps coming back. I have reseated the raid card as well.

Next step is Dell

Thanks for the heads up!
Let's see what Dell says

You have successfully submitted request SR991779294.

they declined my ticket...says I didn't isolate the problem well enough.

Is there anything I can do from my side to help on that?