Page MenuHomePhabricator

CPU temperature issues in cp hosts
Closed, ResolvedPublic

Assigned To
Authored By
Vgutierrez
Sep 4 2024, 11:28 AM
Referenced Files
F62284645: Screenshot at 2025-06-10 10-15-00.png
Jun 10 2025, 5:15 PM
F58391798: Screenshot at 2025-02-12 08-06-02.png
Feb 12 2025, 4:13 PM
F58261843: Screenshot at 2025-01-23 15-41-03.png
Jan 23 2025, 11:48 PM
F57555764: 3.png
Sep 26 2024, 1:18 AM
F57555760: 2.png
Sep 26 2024, 1:18 AM
F57555757: 1.png
Sep 26 2024, 1:18 AM

Description

We got cp servers in esams && magru with temperature issues:

vgutierrez@cumin1002:~$ sudo -i cumin 'A:cp' 'zgrep "Core temperature is above threshold, cpu clock is throttled" /var/log/kern.lo* > /dev/null && echo "CPU is getting throttled"  || echo "CPU is OK"'
112 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112
===== NODE GROUP =====                                                                                                                                                           
(6) cp[3071-3072].esams.wmnet,cp[7009,7011,7015-7016].magru.wmnet                                                                                                                
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----                                                                                                                      
CPU is getting throttled                                                                                                                                                         
===== NODE GROUP =====                                                                                                                                                           
(106) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3070,3073-3081].esams.wmnet,cp[7001-7008,7010,7012-7014].magru.wmnet,cp[4037-4052].ulsfo.wmnet                                                                                                                                              
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----                                                                                                                      
CPU is OK                                                                                                                                                                        
================

impacted hosts:

  • cp3071
  • cp3072
  • cp7009
  • cp7011
  • cp7015
  • cp7016

Note that we already had an SSD crash on cp7015 (T371554)

Details

Event Timeline

@RobH / @wiki_willy could we get this task prioritized on your side?

I'm now looking into these. Overall just these specific servers report heat issues while they are weighted the same as other cp hosts within the same fleet?

In the past, heat issues that are sporadic within a fleet typically tend to be caused by improper installation of thermal paste or its degradation over time. I'm going to split the esams items to their own sub-task and open a support case for them. If the issue is fixed by new thermal paste in ESAMS, we'll do the same for MAGRU.

RobH changed the task status from Open to Stalled.Sep 17 2024, 5:20 PM

Stalling parent task while working on fixing the esams hosts (esams is easier to get parts in and out than magru, so esams is better testbed for the repair).

@Vgutierrez: willy mentioned to me in our 1:1 that traffic thought these may be something other than a thermal paste issue and that I should expect an update on this task with details.

As we're planning ot move ahead with thermal paste swap on the two esams hosts next week, should we do something else instead? Please advise.

Apologies for the long text that follows but the TL;DR is that we think that issues in magru are not confined to just the CPU on the affected hosts but rather, the servers themselves and thus possibly the entire rack, given the number of affected hosts.

NVMe temperatures

While this task and T374986 document the CPU throttling due to the increased temperatures, while digging into this, we observed the temperature reported by the NVMe drives is also higher in magru than a comparable site ulsfo, even though magru on average gets almost half of the traffic ulsfo does:

1.png (822×1 px, 125 KB)

https://grafana.wikimedia.org/goto/BdVLzEgNR?orgId=1

2.png (695×1 px, 87 KB)

https://grafana.wikimedia.org/goto/wajjfUgHg?orgId=1 (comparison between magru, ulsfo, esams)

It seems like in magru, even though we have not hit the warning or critical temperatures for the NVMes (confirmed via a cumin query), we are quite close in some cases, if you look at the peaks of temperatures above:

$ sudo nvme id-ctrl /dev/nvme0n1 # random host in magru, to show the crit/warn temperatures
wctemp    : 343 (69.85 °C)
cctemp    : 350 (76.85 °C)

Given that magru is serving traffic for just three countries in South America so far and not even being utilized to its ideal peak capacity, this trend might be worrying. But more importantly, this contradicts the assumption that the temperature issues are only confined to the CPUs and suggests a problem with the rack(s).

Also affects LVS and DNS hosts in magru

To check that this is not an issue with just the cp hosts, running a comparison for the DNS boxes (two each in magru and ulsfo):

3.png (852×1 px, 120 KB)

https://grafana.wikimedia.org/goto/iLZVGEgNR?orgId=1

While dns700[12] have an average of ~100 req/sec more than dns400[34], this still does not explain the ~25 °C difference between the hosts in the sites. Thus load is unlikely to be a factor here. If we compare it with esams included, we see that even though esams averages around ~1k req/sec on each box, the temperatures in magru are still higher.

Similarly, it is also affecting the LVS hosts lvs700[12]. Note that lvs7003 is not affected but it is also the backup host and not serving any traffic. Even then it reports a temperature difference of ~30 °C between the comparable lvs4010 (backup in ulsfo).

$ sudo cumin 'A:lvs-magru' 'dmesg -T | grep -i "core temperature is above"'
3 hosts will be targeted:
lvs[7001-7003].magru.wmnet
OK to proceed on 3 hosts? Enter the number of affected hosts to confirm or "q" to quit: 3
===== NODE GROUP =====
(1) lvs7001.magru.wmnet
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----
[Tue Jun 11 20:48:21 2024] mce: CPU72: Core temperature is above threshold, cpu clock is throttled (total events = 52)
[Wed Jun 26 16:18:21 2024] mce: CPU14: Core temperature is above threshold, cpu clock is throttled (total events = 738)
[Mon Jul 29 08:47:48 2024] mce: CPU68: Core temperature is above threshold, cpu clock is throttled (total events = 1526)
[Mon Jul 29 08:47:48 2024] mce: CPU20: Core temperature is above threshold, cpu clock is throttled (total events = 1522)
[Fri Aug 23 07:17:56 2024] mce: CPU68: Core temperature is above threshold, cpu clock is throttled (total events = 6321)
[Fri Aug 23 07:17:56 2024] mce: CPU20: Core temperature is above threshold, cpu clock is throttled (total events = 6317)
===== NODE GROUP =====
(1) lvs7002.magru.wmnet
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----
[Wed Jun  5 05:01:11 2024] mce: CPU26: Core temperature is above threshold, cpu clock is throttled (total events = 9)

Timeline (and the non-relation to load)

The timeline of setting up magru is as follows:

  1. We finished provisioning and bringing the servers "live" (not serving production traffic) by May 2 2024. SAL
  2. On May 2 2024, we turned on the measure-magru.wikimedia.org domain that points to upload-lb.wikimedia.org. SAL

https://grafana.wikimedia.org/goto/79wLSygHR?orgId=1

We were averaging ~30rps to the cp servers at this point.

  1. But even then, we were already hitting temperatures in excess of 90 °C a week later, without the site servicing any meaningful production traffic.

https://grafana.wikimedia.org/goto/gGW5IsgNR?orgId=1

This thus again confirms that issue is not related to the load because if we were reaching Tjunction temperatures (and exceeding Tcase) without any real usage on the CPUs.

Ruling out BIOS issues

We started by ruling out that PerfPerWattOptimizedOs is correctly set on all hosts in magru. We confirmed this via Redfish.

("get", "/redfish/v1/Systems/System.Embedded.1/Bios").json()['Attributes']['SysProfile']

Since the provisioning cookbook was used, it is unlikely that any other settings were missed but to account for changes in firmware/iDRAC/Redfish we then also verified EnergyPerformanceBias being set to BalancedPerformance and ProcPwrPerf to OsDbpm for all cp hosts.

Summary

Based on the above, here is a summary of the current observations:

  1. The temperature issues not only affect the CPUs but also the NVMes. Thus thermal paste on the CPUs is unlikely to be an issue.
  2. The issue is not limited to the cp hosts in magru and also extends to the DNS hosts, Ganeti cluster, and LVSes where there is a ~30 °C difference.
  3. It is unlikely that load is an issue give that magru serves the least amount of traffic compared to other sites, for all clusters: cp, DNS, LVS, Ganeti.
  4. The BIOS settings/provisioning also seems to be an unlikely culprit.
  5. The issue may become more profound if we shift more countries to magru.

Thanks for providing all the details on this, @ssingh. @RobH - as we chatted about earlier today, we could ask Ascenty to double-check that there are enough perf tiles in the cold aisle, confirm that the blanket panels are in place (and if not, add them), and possibly get a temperature and humidity reading in that area. Thanks, Willy

Draft body of support request for magru temp investigation: https://docs.google.com/document/d/1T-XwSS_Rwfb9nfC1aHQW4AjptLjxiviyZfGFdFcowZY/edit?usp=sharing

Copied below for ease of reference, but any suggested edits should take place on the google doc:

Support,

We're seeing higher temperature levels from our servers in our racks at your facility than expected when compared to our other sites. When we check the server's intake temperatures, we're seeing a large divergence between hosts of anywhere 19C to 25C within the same rack.

We would like to ask for a temperature investigation on our two racks to check for the following items:

  • Ensure blanking panels are installed on the following U spaces. If no panels are installed, does Ascenty provide for use? If so, please install onto:
    • B3: U: 1,15-33, 35-36, 38-42, 44-46. Please ensure no blanking panels on U34, 37, 43.
    • B4: U: 1, 14-33, 35-36, 38-39, 41-42, 44-46. Please ensure no blanking panels on U34, 37, 40, 43.
  • Please take temperature measurements after blanking panel installation and adjust perforated floor tiles as needed to ensure all points in rack (lower, middle, top) are receiving the same level of cooling.

Once the above is complete, please provide feedback if panels were installed (if possible snap some photos) and provide feedback if floor tiles had to be adjusted or if temps were consistent across the rack.

I'm going to keep the draft document open and simple-english it more over the next couple of hours before I submit into the Ascenty portal.

Opened ticket CS1011077 for the above updated google doc draft.

Hi @RobH: Any follow-up from Ascenty on when they plan on installing the blanking panels? Thanks!

Hi @RobH: Any follow-up from Ascenty on when they plan on installing the blanking panels? Thanks!

The panels were installed successfully at the end of last week, we should see resulting better temps out of magru now.

Additionally the two esams hosts had their cpu thermal paste reapplied about 7 hours ago, so they should stop throttling for temp issues.

Unfortunately, it appears that we're still having throttling issues in magru:

brett@cumin2002:~$ sudo -i cumin 'A:cp' 'zgrep "Core temperature is above threshold, cpu clock is throttled" /var/log/kern.lo* > /dev/null && echo "CPU is getting throttled"  || echo "CPU is OK"'
112 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112
===== NODE GROUP =====
(7) cp1109.eqiad.wmnet,cp[7002,7005,7009,7011,7013,7016].magru.wmnet
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----
CPU is getting throttled
===== NODE GROUP =====
(105) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1108,1110-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001,7003-7004,7006-7008,7010,7012,7014-7015].magru.wmnet,cp[4037-4052].ulsfo.wmnet
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----
CPU is OK

max by(instance) (ipmi_temperature_celsius{instance=~"^cp7.*"}) over the last 24 hours yields:

InstanceTemperature (Celcius)
cp700185
cp700290
cp700388
cp700491
cp700591
cp700687
cp700791
cp700888
cp700990
cp701086
cp701189
cp701291
cp701390
cp701489
cp701581
cp701691

Some observations:

Has the BIOS version disparity been tested?

Some observations:

That stinks! I'll have to open a task and ask them to ensure blanking panels have been installed and report our temp issues so they can investigate on their end now that our reshuffle is done.

RobH added a subtask: Restricted Task.Dec 11 2024, 6:56 PM
BCornwall changed the task status from Stalled to In Progress.Dec 12 2024, 7:11 PM
BCornwall triaged this task as High priority.
RobH changed the status of subtask Restricted Task from Open to In Progress.Jan 22 2025, 5:53 PM

@Vgutierrez,

In T382026#10489496, @RobH wrote:

Good afternoon Dear
The infrastructure team installed a Blanking Panel to improve the equipment's cooling effectiveness.
follow photos

{F58259455}

So it is 70F at the front of the cabinets and panel (or panels, unclear due to google translate) were installed. Time to check temp readings now.

Hi, @RobH, thanks for doing this!

At first glance, nothing's improved. The inlet temps are acceptable at ~20° yet the CPUs are still hitting ~90°. Outlet temps are ~60°.

Fans are averaging ~20% utilization and CPU utilization is at 5%. Yet the CPUs are melting themselves. Did the thermal paste get verified?

Check out these fan speeds/CPU temps:

Screenshot at 2025-01-23 15-41-03.png (955×2 px, 379 KB)

(Grafana link)

The first dip on all the hosts was unrelated to anything I did - not sure what happened there but the temperatures became more tolerable with just a nominal increase in fan speed. The second dip was me setting the Thermal Profile Optimization to "Maximum Performance (Performance Optimized)" and then shortly afterwards a fan offset speed of "Medium". They're now running at 60% and the CPU is down to 48°. Outlet temps are down to 33°.

You know better than I do, but I'd hazard that wearing out the fans would be better than wearing out the silicon. I'd also guess that reducing the outlet temps would help the cabinet as a whole. What do you think of forcing the fans to run harder magru-wide? (and maybe esams since it's got a similar problem).

[Edit] To be clear, this isn't an ideal solution - Why the CPUs are basically idling at 90° at minimal workload and 66° back when they weren't serving anything is still very concerning - but it's a step to fixing the issue in the immediate.

<cut for brevity>
The first dip on all the hosts was unrelated to anything I did - not sure what happened there but the temperatures became more tolerable with just a nominal increase in fan speed. The second dip was me setting the Thermal Profile Optimization to "Maximum Performance (Performance Optimized)" and then shortly afterwards a fan offset speed of "Medium". They're now running at 60% and the CPU is down to 48°. Outlet temps are down to 33°.

You know better than I do, but I'd hazard that wearing out the fans would be better than wearing out the silicon. I'd also guess that reducing the outlet temps would help the cabinet as a whole. What do you think of forcing the fans to run harder magru-wide? (and maybe esams since it's got a similar problem).

[Edit] To be clear, this isn't an ideal solution - Why the CPUs are basically idling at 90° at minimal workload and 66° back when they weren't serving anything is still very concerning - but it's a step to fixing the issue in the immediate.

I think that making those changes for the temp/fan in magru first would be ideal and lets see if it fixes things fleet wide. We did have 2 hosts have thermal paste reapplied out in ESAMS for temp use and saw no real change to them, I don't recall what specific hosts they were however I just recall it didn't help things.

So +1 to magru fleet wide bios performance/fan adjustments to bring temps back in line where they should be. We know its not the infeed temps (since we saw from the photo its feeding into the rack at 19.6C/67F), so this seems like a logical next step to me.

We're going to want to manually test this in magru, and if it fixes the issue, we'll need to create a task to fix the bios provisioning script for normal defaults (what it does now) and a flag to set the bios settings accordingly.

So the changes were Thermal Profile Optimization to "Maximum Performance (Performance Optimized)" and the fan speed from low to medium? (Where exactly is fan speed setting and how did you change in the https or command line?). Then if you like, either you can change the magru hosts or I can, just we should track the time of change across the fleet and track the results.

Mentioned in SAL (#wikimedia-operations) [2025-01-24T21:47:07Z] <brett> Testing thermal settings on cp7004 (T373993)

Mentioned in SAL (#wikimedia-operations) [2025-01-24T21:51:30Z] <brett@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7004.magru.wmnet with reason: Thermal settings testing (T373993)

I did some more testing:

(Rounded/eyeballed averages)

ProfileOffsetFan RPSCPU Temp (Celsius)
DefaultNone4k80
Maximum PerformanceNone4k80
DefaultLow7k50
Maximum PerformanceLow7k50
DefaultMedium10k40
Maximum PerformanceMedium10k40

It looks like only setting the "Maximum Performance" without any offset doesn't change anything for low-usage temperatures - its value is being a little more aggressive to ramp up the fans under heavy load. When running with stress-tests (stress --cpu $(nprocs)) I noticed a few degrees of difference versus running under the default mode.

the idrac docs state:

Default Thermal Profile Settings (Minimum Power) — Implies that the thermal algorithm uses the same system profile settings that is defined under System BIOS > System BIOS Settings > System Profile Settings page.

Maximum Performance (Performance Optimized):

  • Reduced probability of memory or CPU throttling.
  • Increased probability of turbo mode activation.
  • Generally, higher fan speeds at idle and stress loads.

The setting is "Performance Per Watt (OS)", which leads me to believe that perhaps it's already set to "Performance" and that my notice of a few degrees of difference was happenstance.

Running the stress tests with offset "Medium" resulted in a peak of 65°; "Low" resulted in 75°. Based on this, I'd say these are my two choices when factoring in idle/max workloads:

OffsetFan RPS rangeCPU Temp range
Low7k-9k50-75
Medium10k-12k40-65

As the Xeon Gold 5318Y's tcase is 87° I'd say that setting fan offset to Low and Profile to "Maximum Performance" (just for good measure) would be my vote. For now, I've set all magru hosts to this profile, starting at 2025-01-24 23:45.

(These settings are found in ConfigurationSystem SettingsHardware Settings in idrac, and with racadm set system.thermalsettings.FanSpeedOffset X, where X is 0 (low), 1 (medium), 2 (high), 3 (max), or 255 (none) .)

RobH closed subtask Restricted Task as Resolved.Jan 30 2025, 3:12 PM

@RobH Looks like the offset change has made a good difference

Screenshot at 2025-02-12 08-06-02.png (628×1 px, 221 KB)

How would you feel about applying this to esams as well and then codifying this?

@RobH Looks like the offset change has made a good difference

Screenshot at 2025-02-12 08-06-02.png (628×1 px, 221 KB)

How would you feel about applying this to esams as well and then codifying this?

Can we tell if there was any performance change in the host itself from these changes? I wouldn't think this would do anything except increase host performance and perhaps (but not likely) decrease fan lifespan, so I'm all for the change. I just don't want us to push it and suddenly have negative user impact from the cp hosts slowing down. Thank you for making this change and checking its result!

We will need to also have the provision cookbook updated for a new thermal profile setting flag to set these automatically via that cookbook.

The setting is "Performance Per Watt (OS)", which leads me to believe that perhaps it's already set to "Performance" and that my notice of a few degrees of difference was happenstance.
(These settings are found in Configuration →System Settings →Hardware Settings in idrac, and with racadm set system.thermalsettings.FanSpeedOffset X, where X is 0 (low), 1 (medium), 2 (high), 3 (max), or 255 (none) .)

Change #1121086 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/cookbooks@master] provision: Adjust thermal profile for F4

https://gerrit.wikimedia.org/r/1121086

I don't see any change in performance - The throttling notifications only come sparingly so I doubt we'd see much of a difference until resources became more strained.

We will need to also have the provision cookbook updated for a new thermal profile setting flag to set these automatically via that cookbook.

I'm submitting a patch but would love your scrutiny to ensure I'm using the redfish API properly. I discovered ThermalSettings.1.ThermalProfile and ThermalSettings.1.FanSpeedOffset at

$ ssh -f -N -L 8443:cp7004.mgmt.magru.wmnet:443 cumin2002.codfw.wmnet
$ curl -s -k -u "$REDU:$REDP" https://localhost:8443/redfish/v1/Managers/System.Embedded.1/Attributes | jq '.Attributes | {"ThermalSettings.1.ThermalProfile","ThermalSettings.1.FanSpeedOffset"}'
{
  "ThermalSettings.1.ThermalProfile": "Maximum Performance",
  "ThermalSettings.1.FanSpeedOffset": "Low"
}

Hopefully the accompanying patch appropriately handles setting these values. Is there an easy way to test this?

@BCornwall the easiest way is probably to use test-cookbook on a cumin host, using a depooled magru cp node as target. Once we are sure that the settings do what we want, we can definitely think about making them permanent (possibly for a certain class of hosts only). Lemme know your thoughts :)

Mentioned in SAL (#wikimedia-operations) [2025-06-04T18:37:45Z] <brett> depooling cp7001 for CPU stress testing and temperature effects (T373993)

Fun little tidbit: Our power consumption lowered after increasing the fan speeds in magru

Screenshot at 2025-06-10 10-15-00.png (862×1 px, 166 KB)

Re-assigning to @RobH: Rob, can you check the hot aisle in magru for us?

Re-assigning to @RobH: Rob, can you check the hot aisle in magru for us?

I can, but can you advise why it would be useful? We've had them read the input temps on the cold aisle for us, which is what the intake on the hosts pulls from. The hot aisle isn't going to affect the input temps on our hosts, and we'll be paying a few hundred dollars for the tech to go use a thermometer directly behind our racks.

Re-assigning to @RobH: Rob, can you check the hot aisle in magru for us?

I can, but can you advise why it would be useful? We've had them read the input temps on the cold aisle for us, which is what the intake on the hosts pulls from. The hot aisle isn't going to affect the input temps on our hosts, and we'll be paying a few hundred dollars for the tech to go use a thermometer directly behind our racks.

Hi Rob: thanks for checking! We discussed with @wiki_willy and he suggested that we get the hot aisle temperature as well, to try to figure out the reason for the high temperatures in magru, where the cold aisle is running cooler than esams but the CPUs are still running hot, with the same hardware and significantly less usage than esams.

Ok, if Willy asked then I can put in the ticket no worries. I was asking so I could include it in the reasoning for the ticket later to him!

CS1147925

Support,

We've been investigating an ongoing temperature issue in our servers, and would like to have the following done at our two racks in SP3:

  • Check the cold aisle temperature readings for racks B03 and B04 and report them to this ticket.
  • Check hot aisle temperature readings for racks B03 and B04 and report them to this ticket.
  • Photograph back of each rack.
  • Ensure proper airflow out of the back of the racks are not impacting server temperatures.

Thanks in advance,

Please note I've saved the photos of the audit to the DC Ops google drive folder, under "2025 magru temp audit".

Cold intake temps range from 18.7 to 20.1C in, with the hot side ranging from 22.3C to 27.7C (directly behind the servers at lowest point in rack).

Since the air at hot side mid and high is actually lower temp than directly behind the hosts in the bottom of the rack, it demonstrates we have airflow in the hot aisle (as its clearing out the hot air directly behind the hosts.)

With the temps coming back as acceptable for intake and hot aisle output, and the decision to run the fans on high in the hosts, is there anything else pending this task?

After discussion within both Traffic and DC Ops we're going to resolve this with the fans just running faster.

Change #1121086 abandoned by BCornwall:

[operations/cookbooks@master] provision: Adjust thermal profile for F4

Reason:

We've decided against codifying this.

https://gerrit.wikimedia.org/r/1121086