Page MenuHomePhabricator

cp307[12] thermal issues
Closed, ResolvedPublic

Description

cp307[12] are experiencing thermal issues, documented on parent task T373993.

Error log

vgutierrez@cumin1002:~$ sudo -i cumin 'A:cp' 'zgrep "Core temperature is above threshold, cpu clock is throttled" /var/log/kern.lo* > /dev/null && echo "CPU is getting throttled"  || echo "CPU is OK"'
112 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112
===== NODE GROUP =====                                                                                                                                                           
(6) cp[3071-3072].esams.wmnet,cp[7009,7011,7015-7016].magru.wmnet                                                                                                                
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----                                                                                                                      
CPU is getting throttled                                                                                                                                                         
===== NODE GROUP =====                                                                                                                                                           
(106) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3070,3073-3081].esams.wmnet,cp[7001-7008,7010,7012-7014].magru.wmnet,cp[4037-4052].ulsfo.wmnet                                                                                                                                              
----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' -----                                                                                                                      
CPU is OK

We'll need to open support cases for each of these hosts and have Dell dispatch a repair engineer to ESAMS for full downtime and new thermal paste application on these two hosts.

Checklist

  • Dell support case opened for cp307[12] : cp3071:7QGW8X3:197773213, cp3072:4QGW8X3:197773352
  • Dell engineer scheduled for on-site repair of both hosts
  • Interxion Remote Visit ticket opened for engineer visit
  • Maint window scheduled with Traffic
  • On-site work occurance
  • Post work check for resolution of thermal issues on cp3071
  • Post work check for resolution of thermal issues on cp3072

Event Timeline

Answering here @RobH question:

Hey I made some assumptions on the cp hosts troubleshooting but should check with you: Those hosts are under the same weight conditions as all their related hosts correct? If so, then indeed somehting is off and likely thermal paste

That's right, all hosts have the same weight, during normal conditions they should handle the same load on average

So the SEL/idrac logs show no thermal events, and dell support is attempting to deny these support requests.

On checking cp3071, I don't see any thermal events in the logs:

root@cp3071:/var/log# zgrep "CPU" kern.lo*
root@cp3071:/var/log# zgrep "throttled" kern.lo*
root@cp3071:/var/log# zgrep "6182072.466337" kern.lo*
kern.log:Sep 18 08:29:58 cp3071 kernel: [6182072.466337] TCP: tcp_parse_options: Illegal window scaling value 93 > 14 received

@Vgutierrez: Can you advise what I'm doing incorrectly here that I'm unable to reproduce the error logs you pulled from the cumin host?

@RobH:

sukhe@cumin1002:~$ sudo cumin 'A:cp' 'dmesg -T | grep -q -i "core temperature is above" && echo "CPU throttled due to high temperature" || echo "CPU is OK"'
112 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112
===== NODE GROUP =====                                                                                                                                                                                                                         
(9) cp[3071-3072].esams.wmnet,cp[7005,7009,7011-7013,7015-7016].magru.wmnet                                                                                                                                                                    
----- OUTPUT of 'dmesg -T | grep ...echo "CPU is OK"' -----                                                                                                                                                                                    
CPU throttled due to high temperature

(Let us know if you want a dump of the dmesg output verbatim and we can provide that.)

You are right that we are not seeing this in SEL. Out of curiosity, I ran getsensorinfo and I found:

Sensor Type : TEMPERATURE
<Sensor Name>            <Status>    <Reading> <lc> <uc>  <lnc>[R/W]  <unc>[R/W]
[Key = iDRAC.Embedded.1#CPU1Temp]
CPU1 Temp                     Ok      84C      3C   101C     NA [N]      NA [N]

[Key = iDRAC.Embedded.1#CPU2Temp]
CPU2 Temp                     Ok      75C      3C   101C     NA [N]      NA [N]

You are the expert but only lc uc values are set and not the warning ones (lnc or unc). And since we never meet the threshold of 101C here anyway, we are probably not logging the temperature exceeding 90C but less than 101C. Can this be the reason SEL says nothing about this? Because the kernel is reporting throttling and it also cannot be a software issue otherwise we would be seeing it on more hosts or basically the entire cluster in that sense. The 5318Y Tcase temperature seems to 87C, which is what we seem to be exceeding here so the issue is real in that sense.

Can we provide Dell with this information? Or failing which, can we ask them what they want to see given that we are seeing this in the logs?

I think they'll want a dump of the dmesg directly for the CPU temperature incidents so we can point at where it had to throttle down at exact dates/time, since now they are saying there is no error.

My previous comment shows my command line, whichis the non-cumin call of the same info, well I thought it was but I obviously have something incorrect.

Can you point me to the log file with the exact issues and/or pull that data for cp307[12] directly to here so I can provide to Dell NL support? Also provide the commands run so I'll dot he same for the magru related hosts, thanks!

Hi @RobH: Sharing the cumin command the output so that you have some timestamps (UTC) ready to go (esams one is at the end but I am just dumping all for later use):

sukhe@cumin1002:~$ sudo cumin 'A:cp' 'dmesg -T | grep -i "core temperature is above"'
112 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112
===== NODE GROUP =====                                                                                                                
(1) cp7016.magru.wmnet                                                                                                               
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----                                                                           
[Fri Jun 14 22:15:36 2024] mce: CPU70: Core temperature is above threshold, cpu clock is throttled (total events = 228)               
[Fri Jun 14 22:15:36 2024] mce: CPU22: Core temperature is above threshold, cpu clock is throttled (total events = 227)               
[Mon Jun 17 01:45:45 2024] mce: CPU70: Core temperature is above threshold, cpu clock is throttled (total events = 356)
[Mon Jun 17 01:45:45 2024] mce: CPU22: Core temperature is above threshold, cpu clock is throttled (total events = 355)
[Wed Jun 19 02:45:22 2024] mce: CPU76: Core temperature is above threshold, cpu clock is throttled (total events = 389)
[Wed Jun 19 02:45:22 2024] mce: CPU28: Core temperature is above threshold, cpu clock is throttled (total events = 389)
[Mon Jun 24 11:45:40 2024] mce: CPU22: Core temperature is above threshold, cpu clock is throttled (total events = 975)
[Wed Jun 26 22:15:25 2024] mce: CPU0: Core temperature is above threshold, cpu clock is throttled (total events = 1130)
[Wed Jul  3 19:15:40 2024] mce: CPU22: Core temperature is above threshold, cpu clock is throttled (total events = 1682)
[Fri Jul  5 18:15:12 2024] mce: CPU18: Core temperature is above threshold, cpu clock is throttled (total events = 1771)
[Fri Jul  5 18:15:12 2024] mce: CPU50: Core temperature is above threshold, cpu clock is throttled (total events = 1973)
[Fri Jul  5 18:15:12 2024] mce: CPU66: Core temperature is above threshold, cpu clock is throttled (total events = 1772)
[Fri Jul  5 18:15:12 2024] mce: CPU2: Core temperature is above threshold, cpu clock is throttled (total events = 1973)
[Wed Sep  4 08:14:01 2024] mce: CPU70: Core temperature is above threshold, cpu clock is throttled (total events = 2371)
[Wed Sep  4 08:14:01 2024] mce: CPU22: Core temperature is above threshold, cpu clock is throttled (total events = 2373)
===== NODE GROUP =====                                                                                                                
(1) cp7013.magru.wmnet                                                                                                                
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----                                                                           
[Thu Jun 20 08:23:03 2024] mce: CPU56: Core temperature is above threshold, cpu clock is throttled (total events = 177)               
[Thu Jun 20 08:23:03 2024] mce: CPU8: Core temperature is above threshold, cpu clock is throttled (total events = 177)                
[Thu Jun 20 08:23:25 2024] mce: CPU4: Core temperature is above threshold, cpu clock is throttled (total events = 141)
[Thu Jun 20 08:23:25 2024] mce: CPU52: Core temperature is above threshold, cpu clock is throttled (total events = 141)
[Sun Jun 30 13:53:27 2024] mce: CPU54: Core temperature is above threshold, cpu clock is throttled (total events = 170)
[Sun Jun 30 13:53:27 2024] mce: CPU6: Core temperature is above threshold, cpu clock is throttled (total events = 170)
[Tue Jul  9 03:17:08 2024] mce: CPU66: Core temperature is above threshold, cpu clock is throttled (total events = 303)
[Tue Jul  9 03:17:08 2024] mce: CPU18: Core temperature is above threshold, cpu clock is throttled (total events = 303)
[Sat Jul 13 08:17:30 2024] mce: CPU18: Core temperature is above threshold, cpu clock is throttled (total events = 338)
[Sun Sep 15 07:47:12 2024] mce: CPU54: Core temperature is above threshold, cpu clock is throttled (total events = 354)
===== NODE GROUP =====                                                                                                                
(1) cp7015.magru.wmnet                                                                                                                
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----                                                                           
[Fri Aug  2 03:13:53 2024] mce: CPU28: Core temperature is above threshold, cpu clock is throttled (total events = 2)                 
[Fri Aug  2 03:13:53 2024] mce: CPU4: Core temperature is above threshold, cpu clock is throttled (total events = 2)                  
[Fri Aug  2 03:13:58 2024] mce: CPU0: Core temperature is above threshold, cpu clock is throttled (total events = 2)
[Fri Aug  2 03:13:58 2024] mce: CPU24: Core temperature is above threshold, cpu clock is throttled (total events = 2)
[Sun Aug  4 15:53:33 2024] mce: CPU0: Core temperature is above threshold, cpu clock is throttled (total events = 22)
[Sun Aug  4 15:53:33 2024] mce: CPU26: Core temperature is above threshold, cpu clock is throttled (total events = 20)
[Sun Aug  4 15:53:33 2024] mce: CPU2: Core temperature is above threshold, cpu clock is throttled (total events = 20)
[Wed Aug  7 23:01:04 2024] mce: CPU26: Core temperature is above threshold, cpu clock is throttled (total events = 47)
[Fri Aug  9 20:10:35 2024] mce: CPU2: Core temperature is above threshold, cpu clock is throttled (total events = 79)
[Fri Aug  9 20:10:35 2024] mce: CPU26: Core temperature is above threshold, cpu clock is throttled (total events = 79)
[Sun Aug 18 02:23:53 2024] mce: CPU12: Core temperature is above threshold, cpu clock is throttled (total events = 193)
[Sun Aug 18 10:53:38 2024] mce: CPU26: Core temperature is above threshold, cpu clock is throttled (total events = 209)
[Sun Aug 18 10:53:38 2024] mce: CPU2: Core temperature is above threshold, cpu clock is throttled (total events = 209)
[Tue Aug 20 17:53:25 2024] mce: CPU44: Core temperature is above threshold, cpu clock is throttled (total events = 261)
[Tue Aug 20 17:53:25 2024] mce: CPU20: Core temperature is above threshold, cpu clock is throttled (total events = 261)
[Wed Aug 21 02:23:40 2024] mce: CPU34: Core temperature is above threshold, cpu clock is throttled (total events = 292)
[Wed Aug 21 02:23:40 2024] mce: CPU10: Core temperature is above threshold, cpu clock is throttled (total events = 292)
[Sat Aug 24 11:07:25 2024] mce: CPU34: Core temperature is above threshold, cpu clock is throttled (total events = 330)
[Sat Aug 24 11:07:25 2024] mce: CPU10: Core temperature is above threshold, cpu clock is throttled (total events = 330)
[Sun Aug 25 18:53:46 2024] mce: CPU28: Core temperature is above threshold, cpu clock is throttled (total events = 569)
[Sun Aug 25 18:53:46 2024] mce: CPU4: Core temperature is above threshold, cpu clock is throttled (total events = 569)
[Thu Aug 29 18:53:30 2024] mce: CPU30: Core temperature is above threshold, cpu clock is throttled (total events = 359)
[Thu Aug 29 18:53:30 2024] mce: CPU6: Core temperature is above threshold, cpu clock is throttled (total events = 359)
[Thu Aug 29 21:43:47 2024] mce: CPU2: Core temperature is above threshold, cpu clock is throttled (total events = 363)
[Thu Aug 29 21:43:53 2024] mce: CPU26: Core temperature is above threshold, cpu clock is throttled (total events = 363)
[Thu Aug 29 22:23:20 2024] mce: CPU44: Core temperature is above threshold, cpu clock is throttled (total events = 362)
[Thu Aug 29 22:23:20 2024] mce: CPU20: Core temperature is above threshold, cpu clock is throttled (total events = 362)
[Fri Aug 30 08:14:27 2024] mce: CPU24: Core temperature is above threshold, cpu clock is throttled (total events = 426)
[Fri Aug 30 08:14:27 2024] mce: CPU0: Core temperature is above threshold, cpu clock is throttled (total events = 426)
[Sat Aug 31 20:24:17 2024] mce: CPU26: Core temperature is above threshold, cpu clock is throttled (total events = 389)
[Sat Aug 31 20:24:17 2024] mce: CPU2: Core temperature is above threshold, cpu clock is throttled (total events = 389)
[Sat Aug 31 20:24:23 2024] mce: CPU28: Core temperature is above threshold, cpu clock is throttled (total events = 709)
[Sat Aug 31 20:24:23 2024] mce: CPU4: Core temperature is above threshold, cpu clock is throttled (total events = 709)
[Sat Aug 31 20:24:28 2024] mce: CPU8: Core temperature is above threshold, cpu clock is throttled (total events = 404)
[Sat Aug 31 20:24:28 2024] mce: CPU32: Core temperature is above threshold, cpu clock is throttled (total events = 404)
[Sat Aug 31 20:24:31 2024] mce: CPU24: Core temperature is above threshold, cpu clock is throttled (total events = 457)
[Sat Aug 31 20:24:31 2024] mce: CPU0: Core temperature is above threshold, cpu clock is throttled (total events = 457)
[Sun Sep  1 11:07:27 2024] mce: CPU4: Core temperature is above threshold, cpu clock is throttled (total events = 721)
[Sun Sep  1 15:54:06 2024] mce: CPU24: Core temperature is above threshold, cpu clock is throttled (total events = 481)
[Sun Sep  1 15:54:06 2024] mce: CPU0: Core temperature is above threshold, cpu clock is throttled (total events = 481)
[Mon Sep  2 06:23:46 2024] mce: CPU14: Core temperature is above threshold, cpu clock is throttled (total events = 434)
[Mon Sep  2 08:53:47 2024] mce: CPU4: Core temperature is above threshold, cpu clock is throttled (total events = 779)
[Tue Sep  3 08:23:43 2024] mce: CPU28: Core temperature is above threshold, cpu clock is throttled (total events = 787)
[Tue Sep  3 08:23:43 2024] mce: CPU4: Core temperature is above threshold, cpu clock is throttled (total events = 787)
[Tue Sep  3 08:23:44 2024] mce: CPU24: Core temperature is above threshold, cpu clock is throttled (total events = 516)
[Tue Sep  3 08:23:44 2024] mce: CPU0: Core temperature is above threshold, cpu clock is throttled (total events = 516)
[Tue Sep  3 09:57:43 2024] mce: CPU34: Core temperature is above threshold, cpu clock is throttled (total events = 489)
[Tue Sep  3 09:57:43 2024] mce: CPU10: Core temperature is above threshold, cpu clock is throttled (total events = 489)
===== NODE GROUP =====                                                                                                                
(1) cp7012.magru.wmnet                                                                                                                
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----                                                                           
[Fri Jul 19 05:44:46 2024] mce: CPU55: Core temperature is above threshold, cpu clock is throttled (total events = 393)               
[Tue Jul 23 03:14:48 2024] mce: CPU3: Core temperature is above threshold, cpu clock is throttled (total events = 379)                
[Tue Jul 23 03:14:48 2024] mce: CPU51: Core temperature is above threshold, cpu clock is throttled (total events = 379)
===== NODE GROUP =====                                                                                                                
(1) cp7011.magru.wmnet                                                                                                                
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----                                                                           
[Sat Jun  8 19:15:26 2024] mce: CPU52: Core temperature is above threshold, cpu clock is throttled (total events = 768)               
[Thu Jun 13 00:45:44 2024] mce: CPU20: Core temperature is above threshold, cpu clock is throttled (total events = 1328)              
[Thu Jun 13 00:45:44 2024] mce: CPU68: Core temperature is above threshold, cpu clock is throttled (total events = 1325)
[Fri Jun 14 11:15:37 2024] mce: CPU84: Core temperature is above threshold, cpu clock is throttled (total events = 1374)
[Fri Jun 14 11:15:37 2024] mce: CPU36: Core temperature is above threshold, cpu clock is throttled (total events = 1373)
[Fri Jun 28 08:46:14 2024] mce: CPU54: Core temperature is above threshold, cpu clock is throttled (total events = 3474)
[Mon Jul  1 23:16:31 2024] mce: CPU84: Core temperature is above threshold, cpu clock is throttled (total events = 4125)
[Mon Jul  1 23:16:31 2024] mce: CPU36: Core temperature is above threshold, cpu clock is throttled (total events = 4118)
[Thu Jul 11 01:15:46 2024] mce: CPU36: Core temperature is above threshold, cpu clock is throttled (total events = 5563)
[Tue Jul 16 05:16:07 2024] mce: CPU66: Core temperature is above threshold, cpu clock is throttled (total events = 5616)
[Tue Jul 23 09:15:04 2024] mce: CPU84: Core temperature is above threshold, cpu clock is throttled (total events = 5783)
[Sun Aug 25 12:16:06 2024] mce: CPU84: Core temperature is above threshold, cpu clock is throttled (total events = 6045)
[Sun Aug 25 12:16:06 2024] mce: CPU36: Core temperature is above threshold, cpu clock is throttled (total events = 6034)
[Sat Aug 31 07:16:05 2024] mce: CPU20: Core temperature is above threshold, cpu clock is throttled (total events = 6936)
[Sat Aug 31 07:16:05 2024] mce: CPU68: Core temperature is above threshold, cpu clock is throttled (total events = 6944)
[Sat Sep  7 04:16:14 2024] mce: CPU54: Core temperature is above threshold, cpu clock is throttled (total events = 6071)
[Sat Sep  7 04:16:14 2024] mce: CPU6: Core temperature is above threshold, cpu clock is throttled (total events = 6050)
[Sat Sep  7 04:16:14 2024] mce: CPU48: Core temperature is above threshold, cpu clock is throttled (total events = 6908)
[Sat Sep  7 04:16:14 2024] mce: CPU0: Core temperature is above threshold, cpu clock is throttled (total events = 6897)
[Thu Sep 12 07:15:17 2024] mce: CPU84: Core temperature is above threshold, cpu clock is throttled (total events = 6256)
[Thu Sep 12 07:15:17 2024] mce: CPU36: Core temperature is above threshold, cpu clock is throttled (total events = 6244)
[Wed Sep 18 08:45:13 2024] mce: CPU54: Core temperature is above threshold, cpu clock is throttled (total events = 6211)
[Wed Sep 18 08:45:13 2024] mce: CPU6: Core temperature is above threshold, cpu clock is throttled (total events = 6190)
===== NODE GROUP =====                                                                                                                
(1) cp7009.magru.wmnet                                                                                                                
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----                                                                           
[Tue Jun 18 02:15:36 2024] mce: CPU66: Core temperature is above threshold, cpu clock is throttled (total events = 45)                
[Sun Jun 23 23:46:02 2024] mce: CPU68: Core temperature is above threshold, cpu clock is throttled (total events = 90)                
[Mon Jun 24 14:15:50 2024] mce: CPU54: Core temperature is above threshold, cpu clock is throttled (total events = 112)
[Tue Jun 25 05:15:57 2024] mce: CPU6: Core temperature is above threshold, cpu clock is throttled (total events = 117)
[Wed Jun 26 18:50:11 2024] mce: CPU10: Core temperature is above threshold, cpu clock is throttled (total events = 149)
[Wed Jun 26 18:50:11 2024] mce: CPU58: Core temperature is above threshold, cpu clock is throttled (total events = 149)
[Mon Jul  8 00:15:23 2024] mce: CPU4: Core temperature is above threshold, cpu clock is throttled (total events = 526)
[Mon Jul  8 00:15:23 2024] mce: CPU52: Core temperature is above threshold, cpu clock is throttled (total events = 526)
[Mon Jul  8 09:15:35 2024] mce: CPU70: Core temperature is above threshold, cpu clock is throttled (total events = 510)
[Fri Jul 12 07:45:37 2024] mce: CPU54: Core temperature is above threshold, cpu clock is throttled (total events = 695)
[Fri Aug 23 05:45:50 2024] mce: CPU6: Core temperature is above threshold, cpu clock is throttled (total events = 751)
[Fri Aug 23 05:45:50 2024] mce: CPU54: Core temperature is above threshold, cpu clock is throttled (total events = 750)
[Fri Aug 23 05:45:50 2024] mce: CPU56: Core temperature is above threshold, cpu clock is throttled (total events = 602)
===== NODE GROUP =====                                                                                                                
(1) cp7005.magru.wmnet                                                                                                                
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----                                                                           
[Sun Jun  9 02:48:46 2024] mce: CPU55: Core temperature is above threshold, cpu clock is throttled (total events = 15)                
[Wed Jun 19 10:19:59 2024] mce: CPU5: Core temperature is above threshold, cpu clock is throttled (total events = 69)                 
[Wed Jun 19 10:19:59 2024] mce: CPU53: Core temperature is above threshold, cpu clock is throttled (total events = 69)
[Tue Jul 23 04:20:56 2024] mce: CPU51: Core temperature is above threshold, cpu clock is throttled (total events = 110)
[Tue Sep 10 11:48:26 2024] mce: CPU15: Core temperature is above threshold, cpu clock is throttled (total events = 170)
===== NODE GROUP =====                                                                                                                
(1) cp3071.esams.wmnet                                                                                                                
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----                                                                           
[Mon Jul 15 21:29:26 2024] mce: CPU88: Core temperature is above threshold, cpu clock is throttled (total events = 2)                 
[Mon Jul 15 21:29:26 2024] mce: CPU52: Core temperature is above threshold, cpu clock is throttled (total events = 2)                 
[Mon Jul 15 21:29:26 2024] mce: CPU40: Core temperature is above threshold, cpu clock is throttled (total events = 2)
[Mon Jul 15 21:29:26 2024] mce: CPU4: Core temperature is above threshold, cpu clock is throttled (total events = 2)
[Fri Aug  9 16:00:26 2024] mce: CPU48: Core temperature is above threshold, cpu clock is throttled (total events = 20)
===== NODE GROUP =====                                                                                                                
(1) cp3072.esams.wmnet                                                                                                                
----- OUTPUT of 'dmesg -T | grep ...rature is above"' -----                                                                           
[Fri Aug  9 16:00:35 2024] mce: CPU0: Core temperature is above threshold, cpu clock is throttled (total events = 19)                 
[Fri Aug  9 16:00:35 2024] mce: CPU86: Core temperature is above threshold, cpu clock is throttled (total events = 19)                
[Fri Aug  9 16:00:35 2024] mce: CPU92: Core temperature is above threshold, cpu clock is throttled (total events = 19)
[Fri Aug  9 16:00:35 2024] mce: CPU48: Core temperature is above threshold, cpu clock is throttled (total events = 19)

The log files themselves are rotated, which is why you are not seeing this anymore but dmesg is a ring buffer and still has these entries. But regardless of that, we have temperature data for all hosts in Prometheus so we can query that as well and if temperature there exceeds Tcase temperature of 87C, I would say that's a pretty solid confirmation? That information by the way is also reflected on our current dashboards. Let me know if you think that will help and also happy to provide that.

(I would also say that maybe we should warning thresholds in iDRAC so that we have SEL for this in case it happens again but I am not sure if that will make a difference. That's perhaps a different task.)

I've sent over the log output for the two esam hosts to their respective support email threads, lets see what they say! Thank you!

Ongoing conversations via email with support, they've moved onto scheduling an onsite. Sent all location details over along with a proposed maint window of October 2nd. (Everything with them takes a day or two due to time zones so this was over a week out and thus likely problem free for scheduling.)

Hi @RobH: Is this confirmed for tomorrow Oct 2?

Yes, they'll be showing up onsite around 09:00 CET / 00:00 Pacific. We'll want to fully depool and power down these two hosts in advance of their arrival. I figured I would just do a depool on the host directly and a graceful shutdown an hour or two before they show up onsite. Does that sound feasible?

I only got the engineer onsite info and tracking data for the shipment of parts this AM, so its all very last minute for Dell and not their ideal workflow.

Thanks @RobH, that works for us. @Vgutierrez will depool the two hosts in advance of the event and downtime.

Your appointment has been scheduled between Wed, Oct 2, 2024 8:00 AM and Wed, Oct 2, 2024 12:00 PM. Please check back here for updates.
Your technician is scheduled to arrive onsite during a 4-8 hour window ending Wed, Oct 2, 2024 12:00 PM.

Hopefully they arrive at the start of the window, but I'll be standing by during the entirety until they finish work and return the servers to mgmt accessible state.

Since it is just removing and reapplying the thermal paste on both hosts (4 CPUs total), there shouldn't be any resulting config or accessibility changes, other than the hosts being offline during the actual work.

Icinga downtime and Alertmanager silence (ID=94132346-5cb8-4ed8-b2f6-868a8962928b) set by vgutierrez@cumin1002 for 4:00:00 on 2 host(s) and their services with reason: HW maintenance

cp[3071-3072].esams.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-10-02T09:13:08Z] <vgutierrez> repooling cp3071 and cp3072 after HW maintenance - T374986

Thanks @Vgutierrez for the assist, I was ready to go to bed and they took over supporting the remote tech doing the cpu thermal paste swaps.

This is now complete, I'm going to resolve this since the parent task will track specifically if these hosts are now fixed and no longer experiencing cpu throttling events.