Degraded RAID on elastic2088
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Mon, Apr 1, 9:48 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic2088. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda2[0] sdb2[1](F)
      78058496 blocks super 1.2 [2/1] [U_]
      
md1 : active (auto-read-only) raid1 sda3[0] sdb3[1]
      999424 blocks super 1.2 [2/2] [UU]
      	resync=PENDING
      
md2 : active raid0 sda4[0] sdb4[1]
      3591647232 blocks super 1.2 512k chunks
      
unused devices: <none>

Details

Subject	Repo	Branch	Lines +/-
Revert "site.pp: move elastic2088 back into production"	operations/puppet	production	+5 -0
site.pp: move elastic2088 back into production	operations/puppet	production	+0 -5
elastic: move failing host elastic2088 to insetup	operations/puppet	production	+5 -0

Customize query in gerrit

Related Objects

Mentioned In: T353878: Service implementation for elastic2087-2109
Mentioned Here: T361286: Fatal error detected on elastic2088

Event Timeline

ops-monitoring-bot created this task.Mon, Apr 1, 9:48 PM

Restricted Application added subscribers: Marostegui, Aklapper. · View Herald TranscriptMon, Apr 1, 9:48 PM

Hello DC Ops, this host was acting flaky before (see T361286 ). I'm not sure what the next steps should be, but just wanted to provide that context.

Marostegui unsubscribed.Tue, Apr 2, 4:50 AM

Jhancock.wm moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.Tue, Apr 2, 2:14 PM

this error reoccured.

A fatal error was detected on a component at bus 101 device 0 function 0.

I'm gonna open a troubleshooting ticket with Dell because I'm not 100% sure which device is having the errors. it's likely to be the hba card but I want them to confirm.

SR188057676

bking edited projects, added Data-Platform-SRE; removed SRE.Tue, Apr 2, 6:43 PM

RKemper subscribed.Tue, Apr 2, 7:01 PM

Change #1016427 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: move failing host elastic2088 to insetup

https://gerrit.wikimedia.org/r/1016427

Change #1016427 merged by Ryan Kemper:

[operations/puppet@production] elastic: move failing host elastic2088 to insetup

https://gerrit.wikimedia.org/r/1016427

Hello DC Ops,

This host is unreachable via SSH. We went ahead and shut it off from the DRAC; it's all yours if you need to send it back/replace hardware/etc.

bking mentioned this in T353878: Service implementation for elastic2087-2109.Thu, Apr 4, 3:59 PM

Gehel triaged this task as High priority.Thu, Apr 4, 6:56 PM

Gehel moved this task from Incoming to 2024.03.25 - 2024.04.14 on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE.

RKemper moved this task from Backlog to Blocked / Waiting on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.Thu, Apr 4, 6:59 PM

follow up: still going back and forth with Dell.

The host is alerting in Icinga, should it be downtimed?

Mentioned in SAL (#wikimedia-operations) [2024-04-10T13:30:53Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on elastic2088.codfw.wmnet with reason: T361525

Mentioned in SAL (#wikimedia-operations) [2024-04-10T13:30:58Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on elastic2088.codfw.wmnet with reason: T361525

Sorry for the noise, I've just downtimed this host.

Update: Dell finally agreed to replace the HBA card. I sent the shipping address confirmation just now. Hopefully it'll be here tomorrow. Latest Monday morning.

@bking I got the HBA card replaced and it booted without any issues that I can find in the iDRAC. Can you check CLI to see if the raid is still degraded?

@Jhancock.wm looks good, thanks for your help! I'm taking off the DC Ops tags and putting this back in our queue to finish off.

bking claimed this task.Fri, Apr 12, 7:40 PM

bking updated Other Assignee, added: RKemper.

bking removed a project: ops-codfw.

bking moved this task from Blocked / Waiting to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.

bking removed a subscriber: Jhancock.wm.

Gehel edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE (2024.03.25 - 2024.04.14).Mon, Apr 15, 12:39 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

Change #1020375 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] site.pp: move elastic2088 back into production

https://gerrit.wikimedia.org/r/1020375

Change #1020375 merged by Ryan Kemper:

[operations/puppet@production] site.pp: move elastic2088 back into production

https://gerrit.wikimedia.org/r/1020375

Mentioned in SAL (#wikimedia-operations) [2024-04-17T02:48:31Z] <ryankemper> T361525 Trying to powercycle elastic2088 thru mgmt port (host not responding to ssh)

Was able to get a puppet run on elastic2088, but since that run a couple hours ago the host is ssh unreachable (it hangs indefinitely). Seeing some concerning stuff in the drac via getsel on elastic2088.mgmt.codfw.wmnet:

-------------------------------------------------------------------------------
Record:      1014
Date/Time:   04/17/2024 02:40:42
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 101 device 0 function 0.
-------------------------------------------------------------------------------
Record:      1015
Date/Time:   04/17/2024 02:40:42
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 100 device 4 function 0.
-------------------------------------------------------------------------------

Sounds like some sort of communication issue with the PCI bus? Perhaps it needs a re-seat? @Jhancock.wm

Change #1020238 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "site.pp: move elastic2088 back into production"

https://gerrit.wikimedia.org/r/1020238

Change #1020238 merged by Ryan Kemper:

[operations/puppet@production] Revert "site.pp: move elastic2088 back into production"

https://gerrit.wikimedia.org/r/1020238

Jhancock.wm added a project: ops-codfw.Wed, Apr 17, 1:12 PM

bking removed bking as the assignee of this task.Wed, Apr 17, 1:12 PM

bking updated Other Assignee, removed: RKemper.

bking updated the task description. (Show Details)

@RKemper I am going to check it out and get back in touch with dell. These are the same errors we were getting before the card was replaced.

Tried to run a diagnostic from the Lifecycle controller. Haunted because of a DIMM error on B4. It's been replaced. re-running the diagnostic to check for any more issues.

All tests passed on the diagnostic test, including the pci bus. It's pinging on the idrac and the network ips.
@RKemper give it another go. @ me if you run into an issue again.

Looks good on our end, thanks!

Maintenance_bot removed a project: Patch-For-Review.Fri, Apr 26, 6:33 PM

Degraded RAID on elastic2088Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Degraded RAID on elastic2088
Closed, ResolvedPublic
Actions