Storage problems with new host db1133
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	May 7 2019, 3:08 PM

Description

As per our IRC chat here is the task.
Looks like there is a bad disk on db1133, there are multiple reports of it:

	properties
		CreationTimestamp = 20190507090057.000000-300
		ElementName = System Event Log Entry
		RecordData = Fault detected on drive 2 in disk drive bay 1.
		RecordFormat = string Description
		RecordID = 42

	properties
		CreationTimestamp = 20190507084454.000000-300
		ElementName = System Event Log Entry
		RecordData = Fault detected on drive 2 in disk drive bay 1.
		RecordFormat = string Description
		RecordID = 40

That is making the RAID degraded from the start and the host is misbehaving on the installation (it never formats the /srv partition)
It is also pretty impossible to access the RAID controller menu on start (at least remotely), it just hangs when entering the menu at:

Copyright(c) 2016 Avago Technologies
Press <Ctrl><R> to Run Configuration Utility
HA -0 (Bus 24 Dev 0) PERC H730P Adapter
FW package: 25.5.5.0005

Can we get Dell to send a new one and re-create the RAID?

Update from T222731#5187356
Looks like replacing a disk wasn't enough and now the server is reporting multiple disks failures.
Maybe broken RAID controller or maybe broken main board. Needs follow up with the vendor.

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Marostegui	T220170 Address Database hardware infrastructure blockers on datacenter switchover & multi-dc deployment
		Unknown Object (Task)
Resolved	Marostegui	T217396 Decommission db1061-db1073
Resolved	Marostegui	T211613 rack/setup/install db11[26-38].eqiad.wmnet
Resolved	• Cmjohnson	T222731 Storage problems with new host db1133

Event Timeline

Marostegui created this task.May 7 2019, 3:08 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptMay 7 2019, 3:08 PM

Marostegui updated the task description. (Show Details)May 7 2019, 3:13 PM

Marostegui mentioned this in T222682: Productionize db11[26-38].May 8 2019, 9:35 AM

Captura de pantalla 2019-05-09 a las 11.16.10.png (283×546 px, 49 KB)

@Cmjohnson So I saw this on the idrac:

/admin1/system1/logs1/log1-> show record44

	properties
		CreationTimestamp = 20190507090745.000000-300
		ElementName = System Event Log Entry
		RecordData = Drive 2 in disk drive bay 1 is operating normally.
		RecordFormat = string Description
		RecordID = 43

Even though it said failed on the RAID as shown at T222731#5169505, I forced the disk to be online and started an installation. But it crashed again:

/admin1/system1/logs1/log1-> show record45

	properties
		CreationTimestamp = 20190507092932.000000-300
		ElementName = System Event Log Entry
		RecordData = Fault detected on drive 2 in disk drive bay 1.
		RecordFormat = string Description
		RecordID = 44
	associations

This disk needs replacement

new disk ordered You have successfully submitted request SR990443425.

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.May 9 2019, 3:07 PM

Thank you!

@Cmjohnson any ETA for this disk?
Thanks!

Update from Chris after replacing the disk: there is a bigger issue now....now all but 1 disk is reporting bad including the newly replaced disk. this will required more work

Marostegui renamed this task from Bad disk on new host db1133 to Storage problems with new host db1133.May 20 2019, 8:03 AM

Marostegui updated the task description. (Show Details)

This host can be taken down for debugging anytime without heads up to the DBAs - it doesn't even have an OS

@Marostegui - Chris is out on vacation this week, so I'll follow up with him when he's back on Tuesday. ~Willy

Thank you!
We still have 3 more hosts to keeps us busy with, but as this probably involves getting pieces replaced...it might take a sometime to get them delivered

Update on this server. I have updated all of the f/w including the raid card. I am able to isolate the problem to slot 0 right now. I moved the disks around and they do not report any errors only the slot. I have blown out the raid several times and re-configured but the error keeps coming back. I have reseated the raid card as well.

Next step is Dell

Thanks for the heads up!
Let's see what Dell says

You have successfully submitted request SR991779294.

they declined my ticket...says I didn't isolate the problem well enough.

Is there anything I can do from my side to help on that?

Marostegui triaged this task as High priority.Jun 18 2019, 4:41 AM

Dell is sending me a new Raid card, cables and backplane. Sorry, it took so long, I had to call them after they denied my second request.

Great news! Thanks a lot!!

I have re-imaged the host after Chris did it yesterday and everything looks good: RAID, memory, CPUS...

root@db1133:~# megacli -LdPdInfo -a0

Adapter #0

Number of Virtual Disks: 1
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 4.364 TB
Sector Size         : 512
Is VD emulated      : Yes
Mirror Data         : 4.364 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 6
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
Number of Spans: 1
Span: 0 - Number of PDs: 6


root@db1133:~# megacli -LdPdInfo -a0 | grep state
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
root@db1133:~# megacli -LdPdInfo -a0 | grep state | wc -l
6

root@db1133:~# megacli -LdPdInfo -a0 | grep -i Raw
Raw Size: 1.746 TB [0xdf8fe2b0 Sectors]
Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Raw Size: 1.455 TB [0xba4d4ab0 Sectors]
Raw Size: 1.455 TB [0xba4d4ab0 Sectors]

Thank you @Cmjohnson for painful host fixed!

	F28979249: Captura de pantalla 2019-05-09 a las 11.16.10.png
	May 9 2019, 9:16 AM

Storage problems with new host db1133Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Storage problems with new host db1133
Closed, ResolvedPublic
Actions

Related Objects
Search...