Maniphest T291963

hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Assigned To

Authored By

	• Bstorm
	Sep 28 2021, 4:07 PM

Description

- Provide FQDN of system.
- If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- Put system into a failed state in Netbox.
- Provide urgency of request, along with justification (redundancy, dependencies, etc)

System is fully down at the moment. Redundant partner server is now working. Will try to bring up server via iDRAC. This is a highly active wikireplicas database server and the service is in a degraded state. Medium urgency I guess.

- Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)

The only logs that really work are on web console.
LC logs from the web console show:

2021-09-28 16:35:13 CPU0001 CPU 1 has a thermal trip (over-temperature) event.

Log Sequence Number:
225
Detailed Description:
The processor temperature increased beyond the operational range. Thermal protection shut down the processor. Factors external to the processor may have induced this exception.
Recommended Action:
Review logs for fan failures, replace failed fans. If no fan failures are detected, check inlet temperature (if available) and reinstall processor heatsink.

- Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Details

	Subject	Repo	Branch	Lines +/-
	clouddb1020: Disable notifications	operations/puppet	production	+1 -0
	Adding dhcpd file and site.pp for new puppetmaster servers	operations/puppet	production	+16 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Duplicate		None	T291961 clouddb1020 crash
Resolved	Request	• Cmjohnson	T291963 hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet
Resolved		None	T292850 Re-enable clouddb1020 wikireplica (analytics s5 and s8)

Event Timeline

• Bstorm created this task.Sep 28 2021, 4:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 28 2021, 4:07 PM

• Bstorm added a parent task: T291961: clouddb1020 crash.Sep 28 2021, 4:08 PM

• Bstorm updated the task description. (Show Details)Sep 28 2021, 4:12 PM

• Bstorm added projects: cloud-services-team (Hardware), Data-Services, ops-eqiad.

• Bstorm merged a task: T291961: clouddb1020 crash.

• Bstorm moved this task from Backlog to Wiki replicas on the Data-Services board.

• Bstorm added a subscriber: • Marostegui.

This does not seem related to T289159 as it is a different rack, but you never know.

Mentioned in SAL (#wikimedia-cloud) [2021-09-28T16:21:45Z] <bstorm> powering on clouddb1020 via remote console T291963

Mentioned in SAL (#wikimedia-cloud) [2021-09-28T16:23:20Z] <bstorm> downtime for clouddb1020 to reduce re-pages in case this goes badly T291963

That's a big nope from the server on restarting via console. It has a processor reporting bad voltage and other fun. System Event Log is attached.

sel.csv380 BDownload

• Bstorm updated the task description. (Show Details)Sep 28 2021, 4:26 PM

• Bstorm updated the task description. (Show Details)

RhinosF1 subscribed.Sep 28 2021, 4:33 PM

Maintenance_bot added a project: SRE.Sep 28 2021, 4:45 PM

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Sep 28 2021, 5:17 PM

Change 724478 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] Adding dhcpd file and site.pp for new puppetmaster servers

https://gerrit.wikimedia.org/r/724478

gerritbot added a project: Patch-For-Review.Sep 28 2021, 5:43 PM

Change 724478 merged by Cmjohnson:

[operations/puppet@production] Adding dhcpd file and site.pp for new puppetmaster servers

https://gerrit.wikimedia.org/r/724478

Maintenance_bot removed a project: Patch-For-Review.Sep 28 2021, 6:10 PM

Change 724586 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] clouddb1020: Disable notifications

https://gerrit.wikimedia.org/r/724586

Change 724586 merged by Marostegui:

[operations/puppet@production] clouddb1020: Disable notifications

https://gerrit.wikimedia.org/r/724586

Maintenance_bot removed a project: Patch-For-Review.Sep 29 2021, 5:10 AM

LSobanski subscribed.Sep 29 2021, 11:32 AM

dbprox1019 was alerting on haproxy failover I have ack'ed the alert

[09:54:18]  <+icinga-wm> ACKNOWLEDGEMENT - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2 Marostegui https://phabricator.wikimedia.org/T291961 https://wikitech.wikimedia.org/wiki/HAProxy

Ticket opened with Dell, SR1071934085

CPU1 replaced, the bios updated during the reboot, no errors cleared log.

@Marostegui I think this host is ready to get moving again. Would you like to check it and try getting replication up again? I'm hanging back in case you'd rather I don't mess with the state for those purposes.

Manuel is out, adding @Kormat.

@Bstorm Should this task be reopened or is there another task for follow up?

In T291963#7412052, @LSobanski wrote:

@Bstorm Should this task be reopened or is there another task for follow up?

A marvelous question. This seems scoped to the hardware in some ways. I'll make a subtask.

Enabled notifications for this host.

rook closed subtask T292850: Re-enable clouddb1020 wikireplica (analytics s5 and s8) as Resolved.Jan 19 2023, 2:01 PM

	F34660654: sel.csv
	Sep 28 2021, 4:25 PM

hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnetClosed, ResolvedPublicRequestActions

Description

Details

Related ObjectsSearch...

Event Timeline

hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet
Closed, ResolvedPublicRequest
Actions

Related Objects
Search...