Page MenuHomePhabricator

clouddb1019 down
Closed, ResolvedPublic

Description

[14:06:06]  <+icinga-wm> PROBLEM - Host clouddb1019 is DOWN: PING CRITICAL - Packet loss = 100%

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
-------------------------------------------------------------------------------
Record:      8
Date/Time:   04/09/2026 12:05:09
Source:      system
Severity:    Critical
Description: CPU 2 MEM345 VDDQ PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   04/09/2026 12:06:18
Source:      system
Severity:    Critical
Description: The system board Pfault fail-safe voltage is outside of range.
-------------------------------------------------------------------------------

ops-eqiad can you check on site? The above errors seem HW related.

this server is out of warranty i performed flea power drain and did come up. i am updating firmwares right now you might see it reboot a few times

Thank you @Jclark-ctr - let us know when we can take over.
Thankfully its replacement will arrive soon (famous last words) (T405296)

@Marostegui the hardware error has cleared for now, but the system is reporting filesystem corruption and will need to be reimaged. Let’s keep the ticket open until it’s fully back up in case the error returns. This type of issue is usually indicative of a mainboard problem.

[  515.841068] XFS (dm-0): Internal error ltbno + ltlen > bno at line 1955 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_free_ag_extent+0x3f4/0x860 [xfs]
[  515.855757] XFS (dm-0): Corruption detected. Unmount and run xfs_repair
[  515.862376] 00000000: 1b 31 1e 11 00 00 00 00 68 84 00 00 00 00 00 00  .1......h.......
[  515.870384] XFS (dm-0): Internal error xfs_efi_item_recover at line 662 of file fs/xfs/xfs_extfree_item.c.  Caller xlog_recover_process_intents+0xa3/0x300 [xfs]
[  515.885782] XFS (dm-0): Corruption detected. Unmount and run xfs_repair
[  515.892395] XFS (dm-0): Internal error xfs_trans_cancel at line 1096 of file fs/xfs/xfs_trans.c.  Caller xfs_efi_item_recover+0x269/0x290 [xfs]
[  515.906089] XFS (dm-0): Corruption of in-memory data (0x8) detected at xfs_trans_cancel+0x146/0x150 [xfs] (fs/xfs/xfs_trans.c:1097).  Shutting down filesystem.
[  515.920387] XFS (dm-0): Please unmount the filesystem and rectify the problem(s)
[  515.927787] XFS (dm-0): Failed to recover intents

Thanks John - let me reimage it now

Change #1269467 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] installserver: Wipe clouddb1019 entirely

https://gerrit.wikimedia.org/r/1269467

Change #1269474 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] clouddb1019: Disable notifications

https://gerrit.wikimedia.org/r/1269474

Change #1269474 merged by Marostegui:

[operations/puppet@production] clouddb1019: Disable notifications

https://gerrit.wikimedia.org/r/1269474

Change #1269467 merged by Marostegui:

[operations/puppet@production] installserver: Wipe clouddb1019 entirely

https://gerrit.wikimedia.org/r/1269467

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie

Marostegui triaged this task as Medium priority.Apr 10 2026, 5:10 AM
Marostegui moved this task from Triage to In progress on the DBA board.

@Jclark-ctr I am not able to reimage the host, it is not rebooting, can you check onsite what's on the screen? I've tried several times to reboot it manually but there's no output at all.

Marostegui mentioned this in Unknown Object (Task).Apr 10 2026, 5:43 AM

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors:

  • clouddb1019 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1019.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie

I have not had any luck with getting it to power on. I will start Monday with pulling parts from decom servers to try to get it back up.

Thank you - let me know if I can help

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors:

  • clouddb1019 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1019.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie

Found a fried circuit on the board. Replaced the board and moved the CPUs over since the new ones did not match. The fault still continued on the new board. with old cpu

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie

Thanks John for trying swapping many parts - unfortunately it didn't work so I am going to close this task and open a new one to decommission this host.
This host is dead.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors:

  • clouddb1019 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1019.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Change #1272216 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] clouddb1019.yaml: Remove file

https://gerrit.wikimedia.org/r/1272216

Change #1272216 merged by Marostegui:

[operations/puppet@production] clouddb1019.yaml: Remove file

https://gerrit.wikimedia.org/r/1272216

Change #1273769 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Productionize clouddb1032

https://gerrit.wikimedia.org/r/1273769

Change #1273769 abandoned by Marostegui:

[operations/puppet@production] mariadb: Productionize clouddb1032

Reason:

Wrong patch

https://gerrit.wikimedia.org/r/1273769