Cloudvirt1063.eqiad.wmnet overheating
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Dec 14 2023, 12:40 AM

Description

We've lost clouvirt1063 twice now to overheating. Probably needs some thermal paste.

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	Papaul	T342537 Q1:rack/setup/install cloudvirt10[62-67]
Resolved	fnegri	T353406 NodeDown cloudvirt1063
Resolved	• taavi	T352595 NodeDown
Resolved	Jclark-ctr	T353408 Cloudvirt1063.eqiad.wmnet overheating
Duplicate	None	T354491 NodeDown

Event Timeline

See T352595 for previous episode.

Maintenance_bot added a project: SRE.Dec 14 2023, 12:45 AM

Server is in warranty

Confirmed: Service Request 181697839 was successfully submitted.

• taavi edited projects, added cloud-services-team (Hardware), Cloud-VPS; removed cloud-services-team.Dec 14 2023, 8:52 PM

• taavi added a parent task: T352595: NodeDown.

• taavi moved this task from Backlog to Hardware faults on the cloud-services-team (Hardware) board.

Icinga downtime and Alertmanager silence (ID=0cee941c-9871-4463-b392-d45794163f4d) set by taavi@cumin1001 for 30 days, 0:00:00 on 1 host(s) and their services with reason: host is down, downtiming in icinga too

cloudvirt1063.eqiad.wmnet

fnegri mentioned this in T353406: NodeDown cloudvirt1063.Dec 21 2023, 4:53 PM

@Andrew Dell would like to replace cpu and reapply thermal paste they would like to preform service today is server still down?

In T353408#9423493, @Jclark-ctr wrote:

@Andrew Dell would like to replace cpu and reapply thermal paste they would like to preform service today is server still down?

Yes, the host is still down, feel free to perform any required maintenance.

@Jclark-ctr the host was restarted on Dec 22 at 18:29 UTC. Has the CPU been replaced?

@fnegri yes cpu was replaced

cpu was replaced by dell on Dec 22. performed cpu self test multiple times with no errors, Also tech did swap cpu1 and cpu2 locations.

Jclark-ctr closed this task as Resolved.Jan 2 2024, 3:14 PM

Mentioned in SAL (#wikimedia-cloud-feed) [2024-01-02T16:18:40Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T353408)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-01-02T16:18:45Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=99) (T353408)

Andrew reopened this task as Open.Jan 7 2024, 7:27 PM

Andrew added a subtask: T354491: NodeDown.

This host just died again. I've evacuated all non-canary VMs, waiting for it to cool down and restart so I can look at logs.

Mentioned in SAL (#wikimedia-cloud) [2024-01-07T19:34:19Z] <andrewbogott> evacuating all VMs from cloudvirt1063. T353408

Mentioned in SAL (#wikimedia-cloud) [2024-01-07T19:34:29Z] <andrewbogott> removed cloudvirt1063 from 'ceph' aggregate, added to 'maintenance' aggregate T353408

Reopened Ticket with Dell

Updated firmware per Dells Request cleared logs resent new tsr report. waiting for response.

@Andrew before i change from PerformancePerWatt to PerformanceOptimized do you have any hesitations with that change?

Thank you for logs provided. My sincere apologies for delay in replying to you. I had a consultation with L2 dept. We observed that there were DIMM errors for DIMMs B5/B6 (esp. DIMM B6 initialization errors). At same time, system profile is set to PerformancePerWatt(OS) that may cause CPU to throttle at high load conditions. Please perform below steps.

Boot into F2 ->System BIOS ->System Profile Settings ->System Profile ->Set it to PerformanceOptimized. Save changes and exit. This would prevent CPU from throttling and preventing thermal trips.

Power drain server and then clear NVRAM.

Power drain:
Shut down system. Disconnect all external devices. Unplug power cables. Ensure there is no power supply to server. Press and hold down power button for 30 secs and release it. Wait for 2 mins for iDRAC to initialize. This is essential for complete reinitialization of CPU Sockets.
Clear NVRAM :
https://www.youtube.com/watch?v=QgAenCZu-o0

If issue persists, please provide fresh TSR logs. This would help us send additional parts (2 DIMMs+R. Ctrl Panel assy) as needed with FE in addition to motherboard that is already available onsite.

In T353408#9467046, @Jclark-ctr wrote:

@Andrew before i change from PerformancePerWatt to PerformanceOptimized do you have any hesitations with that change?

@Jclark-ctr, as far as I can tell, PerformanceOptimized is a better choice for a hypervisor, which this is. So changing that setting sounds right to me. It does entail changing more than one thing at a time, which may confuse your testing but I guess that's up to the Dell folks.

Andrew merged tasks: T354496: NodeDownForLong Node cloudvirt1063 has been down for long., T354491: NodeDown, T354497: NeutronAgentDownForLong A Neutron agent has been down for more than 2h, VMs will have connectivity issues.Jan 22 2024, 7:44 PM

updated system settings server is back up now

thanks! Let's let this sit w/out workload for a week or so and see if stays up, then we can try giving it some work to do.

Jclark-ctr moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Jan 29 2024, 7:14 PM

@Andrew following up to see if this has been put back into service?

It's back in service but only as of today.

closing ticket 7 days no faults

Cloudvirt1063.eqiad.wmnet overheatingClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Cloudvirt1063.eqiad.wmnet overheating
Closed, ResolvedPublic
Actions

Related Objects
Search...