Page MenuHomePhabricator

cloudvirt1025-1030 overheating issues
Open, MediumPublic

Description

While debugging something else, found out a bunch of log entries on dmesg like:

[Mon Aug  9 02:43:37 2021] CPU63: Package temperature above threshold, cpu clock throttled (total events = 453033)

Event Timeline

dcaro triaged this task as Medium priority.Aug 18 2021, 2:23 PM
dcaro created this task.

There were some logs that stopped on the 8th of August and has not happened again, each time all CPUs complained and
then there's an ok message on the same second too:

root@cloudvirt1028:~# dmesg -T | grep 'Package temperature above' | cut -d] -f1 | sort | uniq | sort
[Mon Aug  9 02:37:37 2021
[Mon Aug  9 02:43:37 2021
[Sat Aug  7 18:52:00 2021
[Sat Aug  7 18:57:58 2021
[Sat Aug  7 19:15:10 2021
[Sat Aug  7 19:21:33 2021
[Sat Aug  7 20:01:09 2021
[Sat Aug  7 20:26:29 2021
[Sat Aug  7 20:55:56 2021
[Sat Aug  7 21:20:30 2021
[Sat Aug  7 21:32:59 2021
[Sat Aug  7 21:58:12 2021
[Sat Aug  7 22:03:12 2021
[Sat Aug  7 22:12:26 2021
[Sat Aug  7 22:23:02 2021
[Sat Aug  7 22:31:46 2021
[Sat Aug  7 22:58:59 2021
[Sat Aug  7 23:04:11 2021
[Sat Aug  7 23:14:59 2021
[Sat Aug  7 23:30:30 2021
[Sat Aug  7 23:45:17 2021
[Sat Aug  7 23:50:29 2021
[Sun Aug  8 03:21:42 2021
[Sun Aug  8 05:42:01 2021
[Sun Aug  8 06:40:13 2021
[Sun Aug  8 07:04:24 2021
[Sun Aug  8 07:36:09 2021
[Sun Aug  8 07:42:47 2021
[Sun Aug  8 07:51:42 2021
[Sun Aug  8 07:57:30 2021
[Sun Aug  8 08:03:39 2021
[Sun Aug  8 08:12:17 2021
[Sun Aug  8 08:30:35 2021
[Sun Aug  8 08:38:50 2021
[Sun Aug  8 08:59:13 2021
[Sun Aug  8 09:04:35 2021
[Sun Aug  8 09:12:13 2021
[Sun Aug  8 09:29:10 2021
[Sun Aug  8 09:34:53 2021
[Sun Aug  8 09:40:42 2021
[Sun Aug  8 09:50:15 2021
[Sun Aug  8 09:55:16 2021
[Sun Aug  8 10:00:16 2021
[Sun Aug  8 10:12:21 2021
[Sun Aug  8 10:21:29 2021
[Sun Aug  8 10:44:51 2021
[Sun Aug  8 11:25:05 2021
[Sun Aug  8 12:04:14 2021
[Sun Aug  8 12:56:53 2021
[Sun Aug  8 13:15:09 2021
[Sun Aug  8 13:20:31 2021
[Sun Aug  8 13:32:23 2021
[Sun Aug  8 13:37:30 2021
[Sun Aug  8 13:56:31 2021
[Sun Aug  8 14:02:34 2021
[Sun Aug  8 14:23:33 2021
[Sun Aug  8 14:32:23 2021
[Sun Aug  8 14:39:36 2021
[Sun Aug  8 14:44:46 2021
[Sun Aug  8 14:54:17 2021
[Sun Aug  8 15:55:52 2021
[Sun Aug  8 16:01:01 2021
[Sun Aug  8 16:25:46 2021
[Sun Aug  8 16:31:09 2021
[Sun Aug  8 16:37:15 2021
[Sun Aug  8 16:43:19 2021
[Sun Aug  8 17:00:11 2021
[Sun Aug  8 17:06:26 2021
[Sun Aug  8 17:57:00 2021
[Sun Aug  8 18:02:20 2021
[Sun Aug  8 18:07:20 2021
[Sun Aug  8 18:19:12 2021
[Sun Aug  8 18:25:52 2021
[Sun Aug  8 18:37:15 2021
[Sun Aug  8 18:53:55 2021
[Sun Aug  8 19:00:07 2021
[Sun Aug  8 19:13:09 2021
[Sun Aug  8 19:19:20 2021
[Sun Aug  8 19:25:24 2021
[Sun Aug  8 19:31:59 2021
[Sun Aug  8 19:49:41 2021
[Sun Aug  8 19:54:41 2021
[Sun Aug  8 20:07:34 2021
[Sun Aug  8 20:32:30 2021
[Sun Aug  8 20:45:16 2021
[Sun Aug  8 21:01:03 2021
[Sun Aug  8 22:01:13 2021

Looking on other cloudvirts it seems to be a common issue, will gather some info.

dcaro updated the task description. (Show Details)

This is happening on:

dcaro@cumin1001:~$ sudo cumin cloudvirt1* 'dmesg -T | grep "Package temperature above" | tail | cut -d] -f1 | sort | uniq | sort'
34 hosts will be targeted:
cloudvirt[1012-1014,1016-1046].eqiad.wmnet
Ok to proceed on 34 hosts? Enter the number of affected hosts to confirm or "q" to quit 34
===== NODE GROUP =====
(1) cloudvirt1030.eqiad.wmnet
----- OUTPUT of 'dmesg -T | grep ...rt | uniq | sort' -----
[Tue Sep 28 11:40:19 2021
===== NODE GROUP =====
(1) cloudvirt1026.eqiad.wmnet
----- OUTPUT of 'dmesg -T | grep ...rt | uniq | sort' -----
[Sun Sep 26 12:18:11 2021
===== NODE GROUP =====
(1) cloudvirt1029.eqiad.wmnet
----- OUTPUT of 'dmesg -T | grep ...rt | uniq | sort' -----
[Mon Sep 27 20:54:27 2021
===== NODE GROUP =====
(1) cloudvirt1028.eqiad.wmnet
----- OUTPUT of 'dmesg -T | grep ...rt | uniq | sort' -----
[Thu Sep  9 05:11:48 2021
===== NODE GROUP =====
(1) cloudvirt1027.eqiad.wmnet
----- OUTPUT of 'dmesg -T | grep ...rt | uniq | sort' -----
[Tue Sep 28 12:29:19 2021
===== NODE GROUP =====
(1) cloudvirt1025.eqiad.wmnet
----- OUTPUT of 'dmesg -T | grep ...rt | uniq | sort' -----
[Tue Sep 28 15:01:25 2021

That is row C8 (cloudvirt1025/1026/1027) and D5 (cloudvirt1029/1030)

dcaro renamed this task from cloudvirt1028 overheating issues to cloudvirt1025-1030 overheating issues.Sep 28 2021, 4:27 PM
dcaro removed dcaro as the assignee of this task.Nov 22 2021, 1:38 PM