Page MenuHomePhabricator

db1224 crashed - hardware error
Closed, ResolvedPublic

Description

db1224 got rebooted.
The following error showed up on the idrac:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   01/08/2024 12:35:12
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 4 device 0 function 0.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   01/08/2024 12:35:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   01/08/2024 12:35:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   01/08/2024 12:35:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/08/2024 12:35:12
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/08/2024 12:35:13
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   01/08/2024 12:35:13
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   01/08/2024 12:35:13
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   01/08/2024 12:35:13
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   01/08/2024 12:35:13
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   01/08/2024 12:35:13
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   01/08/2024 12:35:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   01/08/2024 12:35:14
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------

This host is quite new.
@ops-eqiad please advise

Event Timeline

Marostegui renamed this task from db1224 hardware error to db1224 crashed - hardware error.Jan 9 2024, 6:27 AM
Marostegui triaged this task as Medium priority.
Marostegui created this task.
Marostegui moved this task from Triage to In progress on the DBA board.

If a reboot/power off is needed, please let us know, as we'd need to depool+stop mariadb.

Confirmed: Service Request 183160693 was successfully submitted.

@Marostegui Dell has requested firmware updates and reseating device NetXtreme BCM5720 Gigabit Ethernet PCIe on bus 4. When is a good time to take server down for reseating and firmware updates?

@Jclark-ctr I can switch it off any day starting tomorrow, when would it work for you?

Yes that works for me tomorrow just let me know Thanks

Great, I will comment on this task once it is off. Thank you!

Change 993833 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1224: Disable notifications

https://gerrit.wikimedia.org/r/993833

Mentioned in SAL (#wikimedia-operations) [2024-01-30T06:29:30Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1224 T354591', diff saved to https://phabricator.wikimedia.org/P55858 and previous config saved to /var/cache/conftool/dbconfig/20240130-062930-root.json

Change 993833 merged by Marostegui:

[operations/puppet@production] db1224: Disable notifications

https://gerrit.wikimedia.org/r/993833

@Jclark-ctr this host is now off, you can proceed whenever you want.

@Marostegui firmware updates have been completed and is ready to be put back in service

Thanks - I can reach the host. I will take it from here. Thank you!

I have started to repool this host. Thanks for your help John!