We've lost clouvirt1063 twice now to overheating. Probably needs some thermal paste.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Unknown Object (Task) | |||||
Resolved | Papaul | T342537 Q1:rack/setup/install cloudvirt10[62-67] | |||
Resolved | fnegri | T353406 NodeDown cloudvirt1063 | |||
Resolved | • taavi | T352595 NodeDown | |||
Resolved | Jclark-ctr | T353408 Cloudvirt1063.eqiad.wmnet overheating | |||
Duplicate | None | T354491 NodeDown |
Event Timeline
Server is in warranty
Confirmed: Service Request 181697839 was successfully submitted.
Icinga downtime and Alertmanager silence (ID=0cee941c-9871-4463-b392-d45794163f4d) set by taavi@cumin1001 for 30 days, 0:00:00 on 1 host(s) and their services with reason: host is down, downtiming in icinga too
cloudvirt1063.eqiad.wmnet
@Andrew Dell would like to replace cpu and reapply thermal paste they would like to preform service today is server still down?
@Jclark-ctr the host was restarted on Dec 22 at 18:29 UTC. Has the CPU been replaced?
cpu was replaced by dell on Dec 22. performed cpu self test multiple times with no errors, Also tech did swap cpu1 and cpu2 locations.
Mentioned in SAL (#wikimedia-cloud-feed) [2024-01-02T16:18:40Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T353408)
Mentioned in SAL (#wikimedia-cloud-feed) [2024-01-02T16:18:45Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=99) (T353408)
This host just died again. I've evacuated all non-canary VMs, waiting for it to cool down and restart so I can look at logs.
Mentioned in SAL (#wikimedia-cloud) [2024-01-07T19:34:19Z] <andrewbogott> evacuating all VMs from cloudvirt1063. T353408
Mentioned in SAL (#wikimedia-cloud) [2024-01-07T19:34:29Z] <andrewbogott> removed cloudvirt1063 from 'ceph' aggregate, added to 'maintenance' aggregate T353408
Updated firmware per Dells Request cleared logs resent new tsr report. waiting for response.
@Andrew before i change from PerformancePerWatt to PerformanceOptimized do you have any hesitations with that change?
Thank you for logs provided. My sincere apologies for delay in replying to you. I had a consultation with L2 dept. We observed that there were DIMM errors for DIMMs B5/B6 (esp. DIMM B6 initialization errors). At same time, system profile is set to PerformancePerWatt(OS) that may cause CPU to throttle at high load conditions. Please perform below steps.
- Boot into F2 ->System BIOS ->System Profile Settings ->System Profile ->Set it to PerformanceOptimized. Save changes and exit. This would prevent CPU from throttling and preventing thermal trips.
- Power drain server and then clear NVRAM.
Power drain:
Shut down system. Disconnect all external devices. Unplug power cables. Ensure there is no power supply to server. Press and hold down power button for 30 secs and release it. Wait for 2 mins for iDRAC to initialize. This is essential for complete reinitialization of CPU Sockets.
Clear NVRAM :
https://www.youtube.com/watch?v=QgAenCZu-o0
- If issue persists, please provide fresh TSR logs. This would help us send additional parts (2 DIMMs+R. Ctrl Panel assy) as needed with FE in addition to motherboard that is already available onsite.
@Jclark-ctr, as far as I can tell, PerformanceOptimized is a better choice for a hypervisor, which this is. So changing that setting sounds right to me. It does entail changing more than one thing at a time, which may confuse your testing but I guess that's up to the Dell folks.
thanks! Let's let this sit w/out workload for a week or so and see if stays up, then we can try giving it some work to do.