Template
FQDN: wdqs1015.eqiad.wmnet
Depooled: Yes
Netbox Status: Failed (https://netbox.wikimedia.org/dcim/devices/4253/)
Urgency: High-Medium (wdqs-main is less robust to host failures given it routinely falls under excessive load)
Issue
wdqs1015 thermal-tripped; SEL/LC log shows an escalating CPU 1 (CPU.Socket.1) thermal fault:
2026-05-28 02:31 — CPU 1 over-temp + throttle, followed by machine-check error (self-recovered) 2026-05-31 09:52 — CPU 1 machine-check error (reboot) 2026-05-31 10:22 — CPU 1 thermal trip (over-temperature) -> host powered off, down since
System Board Inlet Temp is normal (25C), so this is likely localized to CPU 1's cooling, not rack airflow.
Please inspect CPU 1 heatsink/fan seating, thermal paste, and fan health; reseat/replace as needed.
(Host is already downtimed and depooled)