Page MenuHomePhabricator

clouddb1020 crash
Closed, DuplicatePublic

Description

Notification Type: PROBLEM
Host: clouddb1020
State: DOWN
Address: 10.64.48.11
Info: PING CRITICAL - Packet loss = 100%

Date/Time: Tue Sept 28 15:37:06 UTC 2021

Acknowledged by :

Web console shows thermal event on CPU 1

LC logs from the web console show:

2021-09-28 16:35:13 CPU0001 CPU 1 has a thermal trip (over-temperature) event.

Log Sequence Number:
225
Detailed Description:
The processor temperature increased beyond the operational range. Thermal protection shut down the processor. Factors external to the processor may have induced this exception.
Recommended Action:
Review logs for fan failures, replace failed fans. If no fan failures are detected, check inlet temperature (if available) and reinstall processor heatsink.

Event Timeline

Change 724457 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] wikireplicas: depool clouddb1020

https://gerrit.wikimedia.org/r/724457

Change 724457 merged by Bstorm:

[operations/puppet@production] wikireplicas: depool clouddb1020

https://gerrit.wikimedia.org/r/724457

Mentioned in SAL (#wikimedia-cloud) [2021-09-28T15:58:44Z] <bstorm> depooled clouddb1020 for repair T291961

dbprox1019 was alerting on haproxy failover I have ack'ed the alert

[09:54:18]  <+icinga-wm> ACKNOWLEDGEMENT - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2 Marostegui https://phabricator.wikimedia.org/T291961 https://wikitech.wikimedia.org/wiki/HAProxy