Page MenuHomePhabricator

db2189 crashed
Closed, ResolvedPublic

Description

db2189 went down and HW logs only show multiple entries of:

-------------------------------------------------------------------------------
Record:      98
Date/Time:   05/26/2026 11:51:03
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      99
Date/Time:   05/26/2026 11:51:03
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------

Is there anything that can be done on-site to try to troubleshoot this further and check if there's something else failing?
Thanks!

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedJhancock.wm

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added a parent task: Restricted Task.

working on it. might reboot a few times.

Mentioned in SAL (#wikimedia-operations) [2026-05-27T14:38:57Z] <fceratto@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 99 days, 0:00:00 on db2189.codfw.wmnet with reason: crashed T427376

@FCeratto-WMF okay the error code we got was inconclusive. it could mean a lot of things including just out of date firmware. I've updated the bios and the idrac. I do see a cpu machine check error but it should have also been resolved by the firmware update. should be good to add this one back now.

@FCeratto-WMF okay the error code we got was inconclusive. it could mean a lot of things including just out of date firmware. I've updated the bios and the idrac. I do see a cpu machine check error but it should have also been resolved by the firmware update. should be good to add this one back now.

Thanks Jenn - I will repool the host.