cp307[12] are experiencing thermal issues, documented on parent task T373993.
Error log
vgutierrez@cumin1002:~$ sudo -i cumin 'A:cp' 'zgrep "Core temperature is above threshold, cpu clock is throttled" /var/log/kern.lo* > /dev/null && echo "CPU is getting throttled" || echo "CPU is OK"' 112 hosts will be targeted: cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet OK to proceed on 112 hosts? Enter the number of affected hosts to confirm or "q" to quit: 112 ===== NODE GROUP ===== (6) cp[3071-3072].esams.wmnet,cp[7009,7011,7015-7016].magru.wmnet ----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' ----- CPU is getting throttled ===== NODE GROUP ===== (106) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3066-3070,3073-3081].esams.wmnet,cp[7001-7008,7010,7012-7014].magru.wmnet,cp[4037-4052].ulsfo.wmnet ----- OUTPUT of 'zgrep "Core temp...echo "CPU is OK"' ----- CPU is OK
We'll need to open support cases for each of these hosts and have Dell dispatch a repair engineer to ESAMS for full downtime and new thermal paste application on these two hosts.
Checklist
- Dell support case opened for cp307[12] : cp3071:7QGW8X3:197773213, cp3072:4QGW8X3:197773352
- Dell engineer scheduled for on-site repair of both hosts
- Interxion Remote Visit ticket opened for engineer visit
- Maint window scheduled with Traffic
- On-site work occurance
- Post work check for resolution of thermal issues on cp3071
- Post work check for resolution of thermal issues on cp3072