Page MenuHomePhabricator

Kraz (irc.wikimedia.org) has been flapping on IRC most of day
Closed, ResolvedPublic

Description

https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&c=Miscellaneous+codfw&h=kraz.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS

[09:02:16] <icinga-wm> PROBLEM - configured eth on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:02:17] <icinga-wm> PROBLEM - DPKG on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:02:37] <icinga-wm> PROBLEM - dhclient process on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:02:56] <icinga-wm> PROBLEM - RAID on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:03:18] <icinga-wm> PROBLEM - puppet last run on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:03:36] <icinga-wm> PROBLEM - salt-minion processes on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:03:38] <icinga-wm> PROBLEM - Check size of conntrack table on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:03:48] <icinga-wm> PROBLEM - Disk space on kraz is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:06:06] <Krenair> ^ weird, but ircd is still running
[09:06:40] <Krenair> and the rc messages going through..
[10:24:57] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[10:26:57] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[10:40:56] <icinga-wm> PROBLEM - NTP on kraz is CRITICAL: NTP CRITICAL: No response from NTP server
[10:49:16] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[10:53:16] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[11:15:37] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[11:19:28] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[11:35:56] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[11:43:48] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[11:49:48] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[11:53:47] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[12:06:12] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[12:08:07] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[12:36:16] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[12:48:07] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[13:08:17] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[13:12:17] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[13:24:07] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[13:26:07] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[13:36:17] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[13:40:17] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[14:02:17] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[14:08:17] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[14:14:09] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[14:16:16] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[14:22:08] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[14:36:18] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[14:42:18] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer
[14:46:18] <icinga-wm> RECOVERY - SSH on kraz is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0)
[14:52:17] <icinga-wm> PROBLEM - SSH on kraz is CRITICAL: Server answer

kraz.PNG (1×3 px, 565 KB)

Event Timeline

these were effects of T134242

it's not happening anymore since the VM got restarted

i'd consider it merged into the above and more or less a duplicate

Peachey88 assigned this task to jcrespo.

<jynus> !log trying to restart kraz and planet2001 (both service and console unresponsive)

it's not happening anymore since the VM got restarted

marking resolved