Page MenuHomePhabricator

labvirt1003 overheating
Closed, ResolvedPublic

Description

Labvirt1003 is misbehaving tonight -- ganglia can't reach it and I can't ssh in. Notably, labs VMs running there seem basically happy; I stopped one that was gobbling CPU to see if that would let me start ssh, to no avail.

I don't see any evidence that OOM killer has run. But, dmesg is full of this:

[1835838.047410] CPU17: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047413] CPU13: Core temperature above threshold, cpu clock throttled (total events = 10510)
[1835838.047414] CPU37: Core temperature above threshold, cpu clock throttled (total events = 10360)
[1835838.047417] CPU18: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047419] CPU19: Package temperature above threshold, cpu clock throttled (total events = 10815)
[1835838.047421] CPU22: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047722] CPU38: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047725] CPU47: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047728] CPU39: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047729] CPU15: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047732] CPU40: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047734] CPU43: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047736] CPU44: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047738] CPU16: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047739] CPU14: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.047741] CPU42: Package temperature above threshold, cpu clock throttled (total events = 10815)
[1835838.048039] CPU21: Package temperature above threshold, cpu clock throttled (total events = 10814)
[1835838.048040] CPU20: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.048041] CPU46: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.048043] CPU45: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.048045] CPU41: Package temperature above threshold, cpu clock throttled (total events = 10815)
[1835838.048048] CPU36: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.048049] CPU12: Package temperature above threshold, cpu clock throttled (total events = 10814)
[1835838.048051] CPU23: Package temperature above threshold, cpu clock throttled (total events = 10816)
[1835838.048052] CPU13: Package temperature above threshold, cpu clock throttled (total events = 10715)
[1835838.048053] CPU37: Package temperature above threshold, cpu clock throttled (total events = 10599)
[1835838.048444] CPU37: Core temperature/speed normal
[1835838.048445] CPU13: Core temperature/speed normal
[1835838.048446] CPU20: Package temperature/speed normal
[1835838.048448] CPU42: Package temperature/speed normal
[1835838.048449] CPU12: Package temperature/speed normal
[1835838.048449] CPU16: Package temperature/speed normal
[1835838.048451] CPU46: Package temperature/speed normal
[1835838.048452] CPU14: Package temperature/speed normal
[1835838.048453] CPU21: Package temperature/speed normal
[1835838.048453] CPU45: Package temperature/speed normal
[1835838.048454] CPU36: Package temperature/speed normal
[1835838.048455] CPU40: Package temperature/speed normal
[1835838.048457] CPU23: Package temperature/speed normal
[1835838.048458] CPU19: Package temperature/speed normal
[1835838.048459] CPU44: Package temperature/speed normal
[1835838.048460] CPU18: Package temperature/speed normal
[1835838.048461] CPU41: Package temperature/speed normal
[1835838.048462] CPU38: Package temperature/speed normal
[1835838.048463] CPU22: Package temperature/speed normal
[1835838.048464] CPU47: Package temperature/speed normal
[1835838.048465] CPU43: Package temperature/speed normal
[1835838.048467] CPU37: Package temperature/speed normal
[1835838.048467] CPU13: Package temperature/speed normal
[1835838.048469] CPU39: Package temperature/speed normal
[1835838.048471] CPU15: Package temperature/speed normal
[1835839.153874] CPU17: Package temperature/speed normal

Event Timeline

Andrew assigned this task to Cmjohnson.
Andrew raised the priority of this task from to Needs Triage.
Andrew updated the task description. (Show Details)
Andrew added projects: Cloud-Services, ops-eqiad.
Andrew added a subscriber: Andrew.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

There's a fair amount of other ugliness in dmesg, e.g.

[1843134.114144] INFO: task gmond:61831 blocked for more than 120 seconds.
[1843134.145729] Not tainted 3.13.0-49-generic #83-Ubuntu
[1843134.171780] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1843134.209079] gmond D ffff882f7fa134c0 0 61831 1 0x00000000
[1843134.209087] ffff880cf5435d70 0000000000000082 ffff88110c2a6000 ffff880cf5435fd8
[1843134.209097] 00000000000134c0 00000000000134c0 ffff88110c2a6000 ffffffff81cdc420
[1843134.209105] ffffffff81cdc424 ffff88110c2a6000 00000000ffffffff ffffffff81cdc428
[1843134.209113] Call Trace:
[1843134.209131] [<ffffffff81726469>] schedule_preempt_disabled+0x29/0x70
[1843134.209135] [<ffffffff817282d5>] mutex_lock_slowpath+0x135/0x1b0
[1843134.209139] [<ffffffff8172836f>] mutex_lock+0x1f/0x2f
[1843134.209148] [<ffffffff81633e15>] rtnl_lock+0x15/0x20
[1843134.209165] [<ffffffffa04a4bcf>] vlan_ioctl_handler+0x4f/0x4a0 [8021q]
[1843134.209176] [<ffffffff8131430b>] ? apparmor_sk_alloc_security+0x2b/0x60
[1843134.209186] [<ffffffff8160a845>] sock_ioctl+0x1b5/0x2c0
[1843134.209192] [<ffffffff811d1200>] do_vfs_ioctl+0x2e0/0x4c0
[1843134.209200] [<ffffffff811bfebe>] ? alloc_file+0x1e/0xf0
[1843134.209207] [<ffffffff811dbc07>] ?
fd_install+0x47/0x60
[1843134.209211] [<ffffffff811d1461>] SyS_ioctl+0x81/0xa0
[1843134.209216] [<ffffffff8173263d>] system_call_fastpath+0x1a/0x1f

I'm guess it's all related -- maybe a busted fan?

I'm going to leave the system up for now, since we might as well minimize the labs outage. I can't imagine this isn't going to require a dc visit though :(

Andrew triaged this task as Unbreak Now! priority.May 16 2015, 4:20 AM
Andrew set Security to None.

Oh, btw, sshd and ganglia-monitor are comatose on that system for reasons that are unclear to me. The mgmt console is working fine.

@Andrew why leaving this up would have "minimized the labs outage" is not clear to me. You've basically left a completely broken system (and an UBN!) ticket open to be consumed over the weekend and I don't agree with this choice.

This server has entered a very bad state for some reason, and it seems that sockets get opened but sending data to them gets stuck forever.

Since all the nrpe checks were hanging (all processes that require network access are in an uninterruptible sleep state )

I tried an "strace ip link show" and that hanged when trying to send data to a socket:

socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, 0) = 3
setsockopt(3, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1048576], 4) = 0
bind(3, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(3, {sa_family=AF_NETLINK, pid=2648, groups=00000000}, [12]) = 0
sendto(3, " \0\0\0\20\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, 0, NULL, 0

which confirms there is some problem with the network stack, but it's not that helpful for further debugging

The only reason why I'm not rebooting this machine is that Andrew implied it would mean having downtime for labs, but I don't really see an alternative to an hard powercycle for now.

According to the ILO sensors both the fans and the temp sensors indicate OK/good health, so I doubt it's actually a matter of overheating.

The labs instances on the box seem to be working fine fwiw.