Page MenuHomePhabricator

Degraded RAID on elastic2088
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic2088. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda2[0] sdb2[1](F)
      78058496 blocks super 1.2 [2/1] [U_]
      
md1 : active (auto-read-only) raid1 sda3[0] sdb3[1]
      999424 blocks super 1.2 [2/2] [UU]
      	resync=PENDING
      
md2 : active raid0 sda4[0] sdb4[1]
      3591647232 blocks super 1.2 512k chunks
      
unused devices: <none>

Event Timeline

Hello DC Ops, this host was acting flaky before (see T361286 ). I'm not sure what the next steps should be, but just wanted to provide that context.

Jhancock.wm subscribed.

this error reoccured.

A fatal error was detected on a component at bus 101 device 0 function 0.

I'm gonna open a troubleshooting ticket with Dell because I'm not 100% sure which device is having the errors. it's likely to be the hba card but I want them to confirm.

SR188057676

Change #1016427 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: move failing host elastic2088 to insetup

https://gerrit.wikimedia.org/r/1016427

Change #1016427 merged by Ryan Kemper:

[operations/puppet@production] elastic: move failing host elastic2088 to insetup

https://gerrit.wikimedia.org/r/1016427

Hello DC Ops,

This host is unreachable via SSH. We went ahead and shut it off from the DRAC; it's all yours if you need to send it back/replace hardware/etc.

follow up: still going back and forth with Dell.

The host is alerting in Icinga, should it be downtimed?

Mentioned in SAL (#wikimedia-operations) [2024-04-10T13:30:53Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on elastic2088.codfw.wmnet with reason: T361525

Mentioned in SAL (#wikimedia-operations) [2024-04-10T13:30:58Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on elastic2088.codfw.wmnet with reason: T361525

Sorry for the noise, I've just downtimed this host.

Update: Dell finally agreed to replace the HBA card. I sent the shipping address confirmation just now. Hopefully it'll be here tomorrow. Latest Monday morning.

@bking I got the HBA card replaced and it booted without any issues that I can find in the iDRAC. Can you check CLI to see if the raid is still degraded?

@Jhancock.wm looks good, thanks for your help! I'm taking off the DC Ops tags and putting this back in our queue to finish off.

bking updated Other Assignee, added: RKemper.
bking removed a project: ops-codfw.
bking removed a subscriber: Jhancock.wm.

Change #1020375 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] site.pp: move elastic2088 back into production

https://gerrit.wikimedia.org/r/1020375

Change #1020375 merged by Ryan Kemper:

[operations/puppet@production] site.pp: move elastic2088 back into production

https://gerrit.wikimedia.org/r/1020375

Mentioned in SAL (#wikimedia-operations) [2024-04-17T02:48:31Z] <ryankemper> T361525 Trying to powercycle elastic2088 thru mgmt port (host not responding to ssh)

Was able to get a puppet run on elastic2088, but since that run a couple hours ago the host is ssh unreachable (it hangs indefinitely). Seeing some concerning stuff in the drac via getsel on elastic2088.mgmt.codfw.wmnet:

-------------------------------------------------------------------------------
Record:      1014
Date/Time:   04/17/2024 02:40:42
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 101 device 0 function 0.
-------------------------------------------------------------------------------
Record:      1015
Date/Time:   04/17/2024 02:40:42
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 100 device 4 function 0.
-------------------------------------------------------------------------------

Sounds like some sort of communication issue with the PCI bus? Perhaps it needs a re-seat? @Jhancock.wm

Change #1020238 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "site.pp: move elastic2088 back into production"

https://gerrit.wikimedia.org/r/1020238

Change #1020238 merged by Ryan Kemper:

[operations/puppet@production] Revert "site.pp: move elastic2088 back into production"

https://gerrit.wikimedia.org/r/1020238

bking removed bking as the assignee of this task.Wed, Apr 17, 1:12 PM
bking updated Other Assignee, removed: RKemper.
bking updated the task description. (Show Details)

@RKemper I am going to check it out and get back in touch with dell. These are the same errors we were getting before the card was replaced.

Tried to run a diagnostic from the Lifecycle controller. Haunted because of a DIMM error on B4. It's been replaced. re-running the diagnostic to check for any more issues.

All tests passed on the diagnostic test, including the pci bus. It's pinging on the idrac and the network ips.
@RKemper give it another go. @ me if you run into an issue again.

RKemper claimed this task.

Looks good on our end, thanks!