Page MenuHomePhabricator

Cloudvirt1063.eqiad.wmnet overheating
Closed, ResolvedPublic

Description

We've lost clouvirt1063 twice now to overheating. Probably needs some thermal paste.

Event Timeline

Server is in warranty

Confirmed: Service Request 181697839 was successfully submitted.

Icinga downtime and Alertmanager silence (ID=0cee941c-9871-4463-b392-d45794163f4d) set by taavi@cumin1001 for 30 days, 0:00:00 on 1 host(s) and their services with reason: host is down, downtiming in icinga too

cloudvirt1063.eqiad.wmnet

@Andrew Dell would like to replace cpu and reapply thermal paste they would like to preform service today is server still down?

@Andrew Dell would like to replace cpu and reapply thermal paste they would like to preform service today is server still down?

Yes, the host is still down, feel free to perform any required maintenance.

@Jclark-ctr the host was restarted on Dec 22 at 18:29 UTC. Has the CPU been replaced?

cpu was replaced by dell on Dec 22. performed cpu self test multiple times with no errors, Also tech did swap cpu1 and cpu2 locations.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-01-02T16:18:40Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T353408)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-01-02T16:18:45Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=99) (T353408)

This host just died again. I've evacuated all non-canary VMs, waiting for it to cool down and restart so I can look at logs.

Mentioned in SAL (#wikimedia-cloud) [2024-01-07T19:34:19Z] <andrewbogott> evacuating all VMs from cloudvirt1063. T353408

Mentioned in SAL (#wikimedia-cloud) [2024-01-07T19:34:29Z] <andrewbogott> removed cloudvirt1063 from 'ceph' aggregate, added to 'maintenance' aggregate T353408

Updated firmware per Dells Request cleared logs resent new tsr report. waiting for response.

@Andrew before i change from PerformancePerWatt to PerformanceOptimized do you have any hesitations with that change?

Thank you for logs provided. My sincere apologies for delay in replying to you. I had a consultation with L2 dept. We observed that there were DIMM errors for DIMMs B5/B6 (esp. DIMM B6 initialization errors). At same time, system profile is set to PerformancePerWatt(OS) that may cause CPU to throttle at high load conditions. Please perform below steps.

  1. Boot into F2 ->System BIOS ->System Profile Settings ->System Profile ->Set it to PerformanceOptimized. Save changes and exit. This would prevent CPU from throttling and preventing thermal trips.
  1. Power drain server and then clear NVRAM.

Power drain:
Shut down system. Disconnect all external devices. Unplug power cables. Ensure there is no power supply to server. Press and hold down power button for 30 secs and release it. Wait for 2 mins for iDRAC to initialize. This is essential for complete reinitialization of CPU Sockets.
Clear NVRAM :
https://www.youtube.com/watch?v=QgAenCZu-o0

  1. If issue persists, please provide fresh TSR logs. This would help us send additional parts (2 DIMMs+R. Ctrl Panel assy) as needed with FE in addition to motherboard that is already available onsite.

@Andrew before i change from PerformancePerWatt to PerformanceOptimized do you have any hesitations with that change?

@Jclark-ctr, as far as I can tell, PerformanceOptimized is a better choice for a hypervisor, which this is. So changing that setting sounds right to me. It does entail changing more than one thing at a time, which may confuse your testing but I guess that's up to the Dell folks.

updated system settings server is back up now

thanks! Let's let this sit w/out workload for a week or so and see if stays up, then we can try giving it some work to do.

@Andrew following up to see if this has been put back into service?

It's back in service but only as of today.

closing ticket 7 days no faults