Page MenuHomePhabricator

restbase1007.eqiad.wmnet CPU temperature?
Closed, ResolvedPublic

Description

I'm seeing lots of the following in dmesg on restbase1007.eqiad.wmnet.

...

[2066012.756766] CPU18: Package temperature above threshold, cpu clock throttled (total events = 270381699)
[2066012.767351] CPU18: Package temperature/speed normal
[2066012.778795] CPU16: Package temperature above threshold, cpu clock throttled (total events = 270381574)
[2066012.797803] CPU20: Package temperature above threshold, cpu clock throttled (total events = 270381871)
[2066012.808390] CPU20: Package temperature/speed normal
[2066012.912956] CPU28: Package temperature above threshold, cpu clock throttled (total events = 270381644)
[2066012.923541] CPU28: Package temperature/speed normal
[2066012.945971] CPU14: Package temperature/speed normal
[2066012.985958] CPU26: Package temperature/speed normal
[2066013.128114] CPU30: Package temperature/speed normal
[2066013.178156] CPU24: Package temperature above threshold, cpu clock throttled (total events = 270381657)
[2066013.254169] CPU6: Package temperature/speed normal
[2066013.434292] CPU22: Package temperature above threshold, cpu clock throttled (total events = 270381849)
[2066013.446308] CPU8: Package temperature/speed normal
[2066013.446316] CPU0: Package temperature/speed normal
[2066013.478369] CPU12: Package temperature/speed normal
[2066013.519421] CPU2: Package temperature above threshold, cpu clock throttled (total events = 270379760)
[2066013.586441] CPU4: Package temperature above threshold, cpu clock throttled (total events = 270380443)
[2066013.600407] CPU10: Package temperature/speed normal

Event Timeline

This is one of the three boxes (restbase1007-1009) where a second CPU was installed later.

I will need to power this server off and re-apply thermal paste. LMK a good time to do this. Approx downtime is 10 minutes

I will need to power this server off and re-apply thermal paste. LMK a good time to do this. Approx downtime is 10 minutes

Any time is as good as another @Cmjohnson; As far as I'm concerned, you can take it down at your earliest convenience

Re-applied thermal paste. Let's wait the weekend before closing the task.

I will need to power this server off and re-apply thermal paste. LMK a good time to do this. Approx downtime is 10 minutes

see also T130930: restbase1007 not assembling raid after reboot for failure to come back up after reboot

Re-applied thermal paste. Let's wait the weekend before closing the task.

FWIW, I don't see any more temperature alerts in dmesg; We can probably safely close this task.

Cmjohnson claimed this task.

outstanding! Resolving this task please re-open if it happens again.