Times in UTC:
[04:39:39] <+icinga-wm> PROBLEM - SSH on analytics1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:43:15] <+icinga-wm> PROBLEM - Host analytics1060 is DOWN: PING CRITICAL - Packet loss = 100%
The HW logs do not show anything. The serial console shows this:
[27380431.549535] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [kworker/u50:1:6317] [27380431.569533] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DiskHealthMonit:12066] Password: Login incorrect analytics1060 login: [27380459.390641] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [DataXceiver for:9936] [27380459.546624] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [kworker/u50:1:6317] [27380459.566621] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DiskHealthMonit:12066] [27380469.953542] INFO: rcu_sched self-detected stall on CPU [27380469.959588] 9-...: (7294726 ticks this GP) idle=fc5/140000000000001/0 softirq=1527227977/1527227977 fqs=3647127 [27380469.971329] (t=7298889 jiffies g=1415222571 c=1415222570 q=3323916) [27380487.387730] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [DataXceiver for:9936] [27380487.563709] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DiskHealthMonit:12066] [27380495.542880] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/u50:1:6317] [27380515.384818] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [DataXceiver for:9936] [27380515.560797] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DiskHealthMonit:12066] Logi Debian GNU/Linux 9 analytics1060 ttyS1 analytics1060 login: [27380523.539968] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/u50:1:6317] [27380532.958988] INFO: rcu_sched self-detected stall on CPU [27380532.965034] 9-...: (7310470 ticks this GP) idle=fc5/140000000000001/0 softirq=1527227977/1527227977 fqs=3654999 [27380532.976775] (t=7314642 jiffies g=1415222571 c=1415222570 q=3326535) [27380543.381906] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [DataXceiver for:9936] [27380543.557885] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DiskHealthMonit:12066]
I logged on the system with root but the host was almost unusable.
Ran w but it never came back.
I guess it needs a hard reset?