Page MenuHomePhabricator

scb1003, scb1004 exhibit temperature problems
Closed, ResolvedPublic

Description

We 've noticed the following on scb1003, scb1004

[Wed Nov 16 17:57:06 2016] CPU31: Package temperature/speed normal
[Wed Nov 16 17:57:07 2016] CPU13: Package temperature above threshold, cpu clock throttled (total events = 472512251)
[Wed Nov 16 17:57:07 2016] CPU1: Package temperature/speed normal
[Wed Nov 16 17:57:07 2016] CPU15: Package temperature above threshold, cpu clock throttled (total events = 472511448)
[Wed Nov 16 17:57:07 2016] CPU15: Package temperature/speed normal
[Wed Nov 16 17:57:22 2016] CPU0: Core temperature above threshold, cpu clock throttled (total events = 53988734)
[Wed Nov 16 17:57:22 2016] CPU16: Core temperature above threshold, cpu clock throttled (total events = 53989212)

Either the CPUs are indeed overheating due to some reason, or there is some bug in firmware that is causing this. I suppose we could try applying thermal paste on the CPUs

Event Timeline

I find it odd that so many servers are seeing these overheating issues. Thermal paste has worked in the past. I will need to purchase more thermal paste. Let's plan on doing this next week (Nov 21-23)

I find it odd that so many servers are seeing these overheating issues.

I agree. FWIW the server' s CPU usage does not explain overheating. On average [1] it's ~20% and the spikes (which is what interests us here) are ~50%. Which is not something that would in any case justify overheating.

Thermal paste has worked in the past. I will need to purchase more thermal paste. Let's plan on doing this next week (Nov 21-23)

Great, let me know so I can depool and shut them down.

[1] https://grafana.wikimedia.org/dashboard/db/server-board?var-server=scb1003&var-network=eth0

@akosiaris I am sorry this got buried. Should we schedule a time?

Scheduled for Friday Dec 9th late US morning. I 'll be depooling+shutting down the hosts a bit before

Mentioned in SAL (#wikimedia-operations) [2016-12-09T13:36:29Z] <akosiaris> depool fully scb1003, scb1004 T150882

Depooled and shutdown scb1003, scb1004. Scheduled downtime in icinga as well

@Cmjohnson The servers are ready for their thermal paste treatment.

Both servers have had their thermal paste removed and replaced.

I 've fully repooled the servers, let's wait a couple of days and see.

Mentioned in SAL (#wikimedia-operations) [2016-12-09T20:57:43Z] <akosiaris> fully repool scb1003, scb1004, T150882

Both servers have not spewed any warning during the last 2 days, I am happily gonna resolve this. Thanks @Cmjohnson !

Mentioned in SAL (#wikimedia-operations) [2016-12-12T08:56:32Z] <akosiaris> increase by 50% the weight of scb1003, scb1004 for most services on it now that they no longer exhibit temperature problems. These boxes are more powerful then scb1001, scb1002 and should be able to serve more requests. T150882