Page MenuHomePhabricator

ulsfo temperature-related exceptions
Closed, ResolvedPublic

Description

Tons of these in the logs, first one logged on November 19th, last one on November 22nd:

Nov 19 20:17:08 lvs4002 kernel: [5595225.203814] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
Nov 19 20:17:08 lvs4002 kernel: [5595225.203816] CPU14: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 19 20:17:08 lvs4002 kernel: [5595225.203818] CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 19 20:17:08 lvs4002 kernel: [5595225.203820] CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 19 20:17:08 lvs4002 kernel: [5595225.203821] CPU10: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 19 20:17:08 lvs4002 kernel: [5595225.203823] CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 19 20:17:08 lvs4002 kernel: [5595225.203825] CPU8: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 19 20:17:08 lvs4002 kernel: [5595225.203826] CPU12: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 19 20:17:08 lvs4002 kernel: [5595225.204815] CPU14: Package temperature/speed normal
Nov 19 20:17:08 lvs4002 kernel: [5595225.204816] CPU0: Package temperature/speed normal

Also:

root@lvs4001:~# mcelog 
Hardware event. This is not a software error.
MCE 0
CPU 7 THERMAL EVENT TSC 3a8765725d4d07 
TIME 1448464128 Wed Nov 25 15:08:48 2015
Processor 7 below trip temperature. Throttling disabled
STATUS 88010a82 MCGSTATUS 0
MCGCAP 1000c14 APICID 26 SOCKETID 1 
CPUID Vendor Intel Family 6 Model 45
Hardware event. This is not a software error.
MCE 1
CPU 0 THERMAL EVENT TSC 3a8780dfa8ed47 
TIME 1448464171 Wed Nov 25 15:09:31 2015
Processor 0 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 88000bc3 MCGSTATUS 0
MCGCAP 1000c14 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 45
mcelog: CPU 7 on socket 1 received unknown error
mcelog: CPU 0 on socket 0 received unknown error
mcelog: Location: CPU 7 on socket 1
mcelog: Location: CPU 0 on socket 0

What went wrong in ulsfo? Colocation failure? Why didn't we catch it while it was happening?

Event Timeline

faidon created this task.Nov 25 2015, 3:13 PM
faidon raised the priority of this task from to High.
faidon updated the task description. (Show Details)
faidon added a project: ops-ulsfo.
faidon added a subscriber: faidon.
Restricted Application added a project: Operations. · View Herald TranscriptNov 25 2015, 3:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a subscriber: RobH.Nov 25 2015, 6:51 PM

We were not alerted to anything by UnitedLayer, however those are indeed spikes.

I'll open up a trouble ticket with them. I'm onsite today, and it is quite warm in their DC floor.

This happened just now, after a reboot:

Nov 30 15:04:10 lvs4001 kernel: [  169.562309] CPU7: Core temperature above threshold, cpu clock throttled (total events = 1)
Nov 30 15:04:10 lvs4001 kernel: [  169.562311] CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 30 15:04:10 lvs4001 kernel: [  169.562313] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 30 15:04:10 lvs4001 kernel: [  169.562314] CPU13: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 30 15:04:10 lvs4001 kernel: [  169.562316] CPU11: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 30 15:04:10 lvs4001 kernel: [  169.562318] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 30 15:04:10 lvs4001 kernel: [  169.562320] CPU15: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 30 15:04:10 lvs4001 kernel: [  169.562323] CPU9: Package temperature above threshold, cpu clock throttled (total events = 1)
Nov 30 15:04:10 lvs4001 kernel: [  169.563297] CPU1: Package temperature/speed normal
Nov 30 15:04:10 lvs4001 kernel: [  169.563298] CPU11: Package temperature/speed normal
Nov 30 15:04:10 lvs4001 kernel: [  169.563299] CPU13: Package temperature/speed normal
Nov 30 15:04:10 lvs4001 kernel: [  169.563299] CPU3: Package temperature/speed normal
Nov 30 15:04:10 lvs4001 kernel: [  169.563300] CPU15: Package temperature/speed normal
Nov 30 15:04:10 lvs4001 kernel: [  169.563301] CPU5: Package temperature/speed normal
Nov 30 15:04:10 lvs4001 kernel: [  169.563301] CPU9: Package temperature/speed normal
Nov 30 15:04:10 lvs4001 kernel: [  169.638376] CPU7: Core temperature/speed normal
Nov 30 15:04:10 lvs4001 kernel: [  169.638382] mce: [Hardware Error]: Machine check events logged
Nov 30 15:06:21 lvs4001 kernel: [  300.433447] CPU0: Package temperature above threshold, cpu clock throttled (total events = 36229)
Nov 30 15:06:21 lvs4001 kernel: [  300.658259] CPU2: Package temperature above threshold, cpu clock throttled (total events = 36343)
Nov 30 15:06:21 lvs4001 kernel: [  300.698305] CPU4: Package temperature above threshold, cpu clock throttled (total events = 36348)
Nov 30 15:06:21 lvs4001 kernel: [  300.735292] CPU6: Package temperature/speed normal
Nov 30 15:06:22 lvs4001 kernel: [  300.839145] CPU10: Package temperature above threshold, cpu clock throttled (total events = 36302)
Nov 30 15:06:22 lvs4001 kernel: [  300.839151] CPU8: Package temperature above threshold, cpu clock throttled (total events = 36320)
Nov 30 15:06:22 lvs4001 kernel: [  300.840136] CPU8: Package temperature/speed normal
Nov 30 15:06:22 lvs4001 kernel: [  300.842172] CPU12: Package temperature/speed normal
Nov 30 15:06:22 lvs4001 kernel: [  300.888145] CPU14: Package temperature/speed normal

…so it's not something that spiked in the past but is OK now.

fgiunchedi added a subscriber: fgiunchedi.EditedDec 1 2015, 5:12 PM

do we have monitorable PDUs in ulsfo? that might help too, see also T109903: Add PDU redundancy server/router/switch checks in Icinga

This is still happening:

root@lvs4001:~# tail /var/log/kern.log
Feb  9 12:19:34 lvs4001 kernel: [6120361.221303] CPU10: Package temperature/speed normal
Feb  9 12:19:34 lvs4001 kernel: [6120361.511071] CPU8: Package temperature/speed normal
Feb  9 12:19:34 lvs4001 kernel: [6120361.647923] CPU14: Package temperature above threshold, cpu clock throttled (total events = 1447696385)
Feb  9 12:21:22 lvs4001 kernel: [6120469.151739] CPU7: Package temperature above threshold, cpu clock throttled (total events = 1373490357)
Feb  9 12:21:43 lvs4001 kernel: [6120490.583139] CPU3: Package temperature above threshold, cpu clock throttled (total events = 1373505063)
Feb  9 12:21:51 lvs4001 kernel: [6120498.257800] CPU13: Package temperature/speed normal
Feb  9 12:21:51 lvs4001 kernel: [6120498.445703] CPU15: Package temperature/speed normal
Feb  9 12:21:51 lvs4001 kernel: [6120498.713576] CPU5: Package temperature/speed normal
Feb  9 12:21:51 lvs4001 kernel: [6120498.796512] CPU1: Package temperature above threshold, cpu clock throttled (total events = 1373511133)
Feb  9 12:21:51 lvs4001 kernel: [6120498.807091] CPU1: Package temperature/speed normal
root@lvs4001:~#

It occurs only on lvs4001/4002, not 4003/4004, so only rack 1.22 and not 1.23.

BBlack added a subscriber: BBlack.Feb 9 2016, 1:41 PM

See also/merge: T125205

RobH added a comment.Feb 9 2016, 9:10 PM

I didn't end up opening that ticket when I said I would, but it has been opened as of today.

I've requested they investigate the temperature differential between racks 1.23 (which is not giving any alarms) and rack 1.22 (which has high temp alarms).

RobH mentioned this in Unknown Object (Task).Feb 9 2016, 11:50 PM
RobH added a comment.EditedMar 1 2016, 11:16 PM

UL has stated there is no temp issues in our rack. We've now have a thermal camera to take readings onsite, and I'll be visiting onsite shortly. (Once T128424 also arrives.)

RobH added a comment.Apr 13 2016, 7:19 PM

In addition to replacing the thermal paste on all lvs4001-4004, I've knocked out the following hosts from T125205.

  • cp4008
  • cp4010
  • cp4011
  • cp4012

That took care of all lvs machines, and all the cp systems with 1k+ temp alerts in their logs on T125205. In addition to those, I knocked out cp4003 & cp4004, which also had degraded thermal paste.

Since I'm now out of paste, I'll need to order more. The large tubes are enough to accomplish 5 systems each.

At 25 hosts in ulsfo, I need to order at least 3 more tubes. I'll make it 5 and we'll keep some onsite.

RobH closed this task as Resolved.Jun 24 2016, 6:20 PM
RobH claimed this task.

This was completed months ago, and I neglected to close out this task.

Restricted Application added a subscriber: Southparkfan. · View Herald TranscriptJun 24 2016, 6:20 PM