Page MenuHomePhabricator

hw troubleshooting: CPU1 thermal fault for wdqs1015.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

Template

FQDN: wdqs1015.eqiad.wmnet
Depooled: Yes
Netbox Status: Failed (https://netbox.wikimedia.org/dcim/devices/4253/)
Urgency: High-Medium (wdqs-main is less robust to host failures given it routinely falls under excessive load)

Issue

wdqs1015 thermal-tripped; SEL/LC log shows an escalating CPU 1 (CPU.Socket.1) thermal fault:

2026-05-28 02:31 — CPU 1 over-temp + throttle, followed by machine-check error (self-recovered)
2026-05-31 09:52 — CPU 1 machine-check error (reboot)
2026-05-31 10:22 — CPU 1 thermal trip (over-temperature) -> host powered off, down since

System Board Inlet Temp is normal (25C), so this is likely localized to CPU 1's cooling, not rack airflow.
Please inspect CPU 1 heatsink/fan seating, thermal paste, and fan health; reseat/replace as needed.
(Host is already downtimed and depooled)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2026-06-01T20:37:40Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wdqs1015.eqiad.wmnet with reason: T427852 hw failure

This server is out of warranty @RKemper. but I am looking at it right now

@RKemper @wiki_willy I have gone through all decommissioned servers and do not have a matching Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz available to replace CPU1.

I also reviewed all active EOL servers and found only three servers with the same CPU. Those servers were scheduled for decommissioning in December 2024.

wdqs1011 Active refreshed in T376670 in Dec '24
wdqs1012 Active refreshed in T376670 in Dec '24
wdqs1013 Active refreshed in T376670 in Dec '24

I did attempt the firmware updates, but after rebooting, the server became unresponsive and will not boot.

At this point, I would need a compatible CPU to test whether the issue is with the motherboard or the CPU itself.

Discussed with @RKemper via IRC. He mentioned that we should decommission this one if the replacement is already here T423314 and is racked and cabled, pending a Puppet fix to image the server.

Closing this ticket Opened Decom ticket T428582 for Data Platform

Change #1299606 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: decom wdqs1015

https://gerrit.wikimedia.org/r/1299606