hw troubleshooting: Unidentified for db1246.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Assigned To

Authored By

	ABran-WMF
	Mar 12 2024, 2:22 PM

Description

- Provide FQDN of system.

db1246.eqiad.wmnet

- If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.

machine is depooled and can be taken down anytime

- Put system into a failed state in Netbox.

done

- Provide urgency of request, along with justification (redundancy, dependencies, etc)

medium

- Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)

IPMI logged: The system board BP1 PG voltage is outside of range. Tue Mar 12 2024 14:52:10
System is stuck on initializing after a hard reboot: https://usercontent.irccloud-cdn.com/file/iqBkpxwM/image.png

- Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Details

	Subject	Repo	Branch	Lines +/-
	db1246: Disable notifications	operations/puppet	production	+1 -0
	installserver: Format /srv in db1246	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T363119: db1246 crashed
T361968: db1246 crashed

Event Timeline

ABran-WMF changed the task status from Open to In Progress.Mar 12 2024, 2:22 PM

ABran-WMF created this task.

Change 1010530 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1246: Disable notifications

https://gerrit.wikimedia.org/r/1010530

Change 1010530 merged by Arnaudb:

[operations/puppet@production] db1246: Disable notifications

https://gerrit.wikimedia.org/r/1010530

Maintenance_bot added a project: SRE.Mar 12 2024, 2:29 PM

Maintenance_bot removed a project: Patch-For-Review.

For what is worth, this error is also present on the HW logs, even though it is a month old, it might be an indication of something else

-------------------------------------------------------------------------------
Record:      26
Date/Time:   02/20/2024 06:53:35
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 4 device 0 function 0.
-------------------------------------------------------------------------------

I've managed to get it boot up past the grub and it looks storage related:

Starting default.target
[85598.425324] XFS (dm-0): Metadata corruption detected at xfs_agi_verify+0x11a/0x170 [xfs], xfs_agi block 0x1e3a5e02
[85598.435932] XFS (dm-0): Unmount and run xfs_repair
[85598.440731] XFS (dm-0): First 128 bytes of corrupted metadata buffer:
[85598.447172] 00000000: 58 41 47 49 00 00 00 01 00 00 00 01 03 c7 4b c0  XAGI..........K.
[85598.455178] 00000010: 00 00 00 40 00 00 00 03 00 00 00 01 00 00 00 3b  ...@...........;
[85598.463176] 00000020: 00 00 00 c0 ff ff ff ff ff ff ff ff 00 00 00 c1  ................
[85598.471176] 00000030: 00 00 00 c2 00 00 00 c3 00 00 00 c4 ff ff ff ff  ................
[85598.479175] 00000040: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
[85598.487177] 00000050: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
[85598.495183] 00000060: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
[85598.503182] 00000070: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................

So I am wondering whether it is the whole backplane which is broken cc @wiki_willy

++ @VRiley-WMF & @Jclark-ctr for troubleshooting the hardware. (host was installed a few quarters ago)

Followed dell troubleshooting steps. updated firmware for Bios ,idrac already most recent multiple bios firmwares versions have come out flagged as urgent

Voltage on both psu look good

Dell troubleshooting steps

Check the electrical environment, exclude environment voltage instability as the cause of server failure.
Update the iDRAC and BIOS to the latest version

QNA42558_en_US__2icon Note: How to update a PowerEdge server is explained in this article.

Try to swap any component mentioned in the error message and update the firmware

－PSx PG Fail: PS means Power Supply, try to swap the PSU1 with the PSU2 and check if the error follows the slot. Try to update the PSU firmware.
－NDC PG voltage: The NDC is the Network Card on the mainboard. Try to reseat the card and update the driver and firmware.
－System Board BP 1.5v PG voltage: Update the BIOS to the latest version.
－System Board fail-safe voltage: Update the BIOS to the latest version.

Rebooting the server resulted in the same XFS errors and the OS doesn't boot past mounting the filesystem. I am going to reimage it in case it was all corrupted after the crash. Hopefully this will fix it, if not...it means the storage is physically broken. Will report back

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db1246.eqiad.wmnet with OS bookworm

Change 1010896 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] installserver: Format /srv in db1246

https://gerrit.wikimedia.org/r/1010896

gerritbot added a project: Patch-For-Review.Mar 13 2024, 2:40 PM

Change 1010896 merged by Marostegui:

[operations/puppet@production] installserver: Format /srv in db1246

https://gerrit.wikimedia.org/r/1010896

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db1246.eqiad.wmnet with OS bookworm executed with errors:

db1246 (FAIL)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db1246.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1002 for host db1246.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1002 for host db1246.eqiad.wmnet with OS bookworm completed:

db1246 (WARN)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403131504_marostegui_837493_db1246.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Updated Netbox status failed -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Maintenance_bot removed a project: Patch-For-Review.Mar 13 2024, 3:31 PM

So the filesystem was totally corrupted. A full reimage (deleting all the partitions) seems to have fixed it.
I am recloning the host right now and I will leave it replicating for a few days to make sure the storage is stable.

ABran-WMF moved this task from Triage to In progress on the DBA board.Mar 14 2024, 8:14 AM

Started to repool this host.

Maintenance_bot moved this task from In progress to Done on the DBA board.Mar 19 2024, 6:29 AM

Marostegui mentioned this in T361968: db1246 crashed.Apr 5 2024, 7:30 PM

Marostegui mentioned this in T363119: db1246 crashed.Apr 23 2024, 2:49 PM

hw troubleshooting: Unidentified for db1246.eqiad.wmnetClosed, ResolvedPublicRequestActions

Description

Details

Related Objects

Event Timeline

hw troubleshooting: Unidentified for db1246.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions