[14:06:06] <+icinga-wm> PROBLEM - Host clouddb1019 is DOWN: PING CRITICAL - Packet loss = 100%
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| In Progress | fnegri | T382607 Decision request - Who runs wikireplicas cookbooks | |||
| Open | None | T393388 [wikireplicas] Alert when views are out of sync | |||
| In Progress | fnegri | T351637 [wikireplicas] add proper dry-run/diff mode to maintain-views | |||
| Resolved | fnegri | T422806 [wikireplicas] Update grants for "maintainviews" user | |||
| Resolved | Request | Jhancock.wm | T408407 decommission es2028 | ||
| Open | Marostegui | T422365 Migration to Debian Trixie of production database-related hosts | |||
| Resolved | Marostegui | T406981 Compile and test MariaDB 10.11 on Debian Trixie | |||
| Resolved | Marostegui | T407472 Install a testing db with Debian Trixie | |||
| Resolved | Marostegui | T410369 Install Debian Trixie on one s1 host | |||
| Resolved | Jclark-ctr | T410388 PXE failing on db1169 | |||
| Unknown Object (Task) | |||||
| Open | None | T409162 Q2:rack/setup/install clouddb1026-1033 | |||
| Open | fnegri | T415165 Install a clouddb hosts with Debian Trixie | |||
| Resolved | Jclark-ctr | T422813 clouddb1019 down | |||
| Resolved | Request | Jclark-ctr | T423151 decommission clouddb1019.eqiad.wmnet |
Event Timeline
------------------------------------------------------------------------------- Record: 8 Date/Time: 04/09/2026 12:05:09 Source: system Severity: Critical Description: CPU 2 MEM345 VDDQ PG voltage is outside of range. ------------------------------------------------------------------------------- Record: 9 Date/Time: 04/09/2026 12:06:18 Source: system Severity: Critical Description: The system board Pfault fail-safe voltage is outside of range. -------------------------------------------------------------------------------
this server is out of warranty i performed flea power drain and did come up. i am updating firmwares right now you might see it reboot a few times
Thank you @Jclark-ctr - let us know when we can take over.
Thankfully its replacement will arrive soon (famous last words) (T405296)
@Marostegui the hardware error has cleared for now, but the system is reporting filesystem corruption and will need to be reimaged. Let’s keep the ticket open until it’s fully back up in case the error returns. This type of issue is usually indicative of a mainboard problem.
[ 515.841068] XFS (dm-0): Internal error ltbno + ltlen > bno at line 1955 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_ag_extent+0x3f4/0x860 [xfs] [ 515.855757] XFS (dm-0): Corruption detected. Unmount and run xfs_repair [ 515.862376] 00000000: 1b 31 1e 11 00 00 00 00 68 84 00 00 00 00 00 00 .1......h....... [ 515.870384] XFS (dm-0): Internal error xfs_efi_item_recover at line 662 of file fs/xfs/xfs_extfree_item.c. Caller xlog_recover_process_intents+0xa3/0x300 [xfs] [ 515.885782] XFS (dm-0): Corruption detected. Unmount and run xfs_repair [ 515.892395] XFS (dm-0): Internal error xfs_trans_cancel at line 1096 of file fs/xfs/xfs_trans.c. Caller xfs_efi_item_recover+0x269/0x290 [xfs] [ 515.906089] XFS (dm-0): Corruption of in-memory data (0x8) detected at xfs_trans_cancel+0x146/0x150 [xfs] (fs/xfs/xfs_trans.c:1097). Shutting down filesystem. [ 515.920387] XFS (dm-0): Please unmount the filesystem and rectify the problem(s) [ 515.927787] XFS (dm-0): Failed to recover intents
Change #1269467 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] installserver: Wipe clouddb1019 entirely
Change #1269474 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] clouddb1019: Disable notifications
Change #1269474 merged by Marostegui:
[operations/puppet@production] clouddb1019: Disable notifications
Change #1269467 merged by Marostegui:
[operations/puppet@production] installserver: Wipe clouddb1019 entirely
Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie
@Jclark-ctr I am not able to reimage the host, it is not rebooting, can you check onsite what's on the screen? I've tried several times to reboot it manually but there's no output at all.
Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors:
- clouddb1019 (FAIL)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1019.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie
I have not had any luck with getting it to power on. I will start Monday with pulling parts from decom servers to try to get it back up.
Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie
Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors:
- clouddb1019 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1019.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie
Found a fried circuit on the board. Replaced the board and moved the CPUs over since the new ones did not match. The fault still continued on the new board. with old cpu
Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie
Thanks John for trying swapping many parts - unfortunately it didn't work so I am going to close this task and open a new one to decommission this host.
This host is dead.
Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host clouddb1019.eqiad.wmnet with OS trixie executed with errors:
- clouddb1019 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console clouddb1019.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.
Change #1272216 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] clouddb1019.yaml: Remove file
Change #1272216 merged by Marostegui:
[operations/puppet@production] clouddb1019.yaml: Remove file
Change #1273769 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] mariadb: Productionize clouddb1032
Change #1273769 abandoned by Marostegui:
[operations/puppet@production] mariadb: Productionize clouddb1032
Reason:
Wrong patch