Page MenuHomePhabricator

hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)

System is fully down at the moment. Redundant partner server is now working. Will try to bring up server via iDRAC. This is a highly active wikireplicas database server and the service is in a degraded state. Medium urgency I guess.

The only logs that really work are on web console.
LC logs from the web console show:

2021-09-28 16:35:13 CPU0001 CPU 1 has a thermal trip (over-temperature) event.

Log Sequence Number:
225
Detailed Description:
The processor temperature increased beyond the operational range. Thermal protection shut down the processor. Factors external to the processor may have induced this exception.
Recommended Action:
Review logs for fan failures, replace failed fans. If no fan failures are detected, check inlet temperature (if available) and reinstall processor heatsink.

  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Event Timeline

This does not seem related to T289159 as it is a different rack, but you never know.

Mentioned in SAL (#wikimedia-cloud) [2021-09-28T16:21:45Z] <bstorm> powering on clouddb1020 via remote console T291963

Mentioned in SAL (#wikimedia-cloud) [2021-09-28T16:23:20Z] <bstorm> downtime for clouddb1020 to reduce re-pages in case this goes badly T291963

That's a big nope from the server on restarting via console. It has a processor reporting bad voltage and other fun. System Event Log is attached.

Change 724478 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding dhcpd file and site.pp for new puppetmaster servers

https://gerrit.wikimedia.org/r/724478

Change 724478 merged by Cmjohnson:

[operations/puppet@production] Adding dhcpd file and site.pp for new puppetmaster servers

https://gerrit.wikimedia.org/r/724478

Change 724586 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] clouddb1020: Disable notifications

https://gerrit.wikimedia.org/r/724586

Change 724586 merged by Marostegui:

[operations/puppet@production] clouddb1020: Disable notifications

https://gerrit.wikimedia.org/r/724586

dbprox1019 was alerting on haproxy failover I have ack'ed the alert

[09:54:18]  <+icinga-wm> ACKNOWLEDGEMENT - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2 Marostegui https://phabricator.wikimedia.org/T291961 https://wikitech.wikimedia.org/wiki/HAProxy
Cmjohnson claimed this task.

CPU1 replaced, the bios updated during the reboot, no errors cleared log.

@Marostegui I think this host is ready to get moving again. Would you like to check it and try getting replication up again? I'm hanging back in case you'd rather I don't mess with the state for those purposes.

@Bstorm Should this task be reopened or is there another task for follow up?

@Bstorm Should this task be reopened or is there another task for follow up?

A marvelous question. This seems scoped to the hardware in some ways. I'll make a subtask.

Enabled notifications for this host.