Page MenuHomePhabricator

Analytics1060 unresponsive
Closed, ResolvedPublic

Description

Times in UTC:

[04:39:39]  <+icinga-wm>	PROBLEM - SSH on analytics1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:43:15]  <+icinga-wm>	PROBLEM - Host analytics1060 is DOWN: PING CRITICAL - Packet loss = 100%

The HW logs do not show anything. The serial console shows this:

[27380431.549535] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [kworker/u50:1:6317]
[27380431.569533] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DiskHealthMonit:12066]

Password:


Login incorrect
analytics1060 login: [27380459.390641] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [DataXceiver for:9936]
[27380459.546624] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [kworker/u50:1:6317]
[27380459.566621] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DiskHealthMonit:12066]
[27380469.953542] INFO: rcu_sched self-detected stall on CPU
[27380469.959588] 	9-...: (7294726 ticks this GP) idle=fc5/140000000000001/0 softirq=1527227977/1527227977 fqs=3647127
[27380469.971329] 	 (t=7298889 jiffies g=1415222571 c=1415222570 q=3323916)
[27380487.387730] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [DataXceiver for:9936]
[27380487.563709] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DiskHealthMonit:12066]
[27380495.542880] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/u50:1:6317]
[27380515.384818] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [DataXceiver for:9936]
[27380515.560797] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DiskHealthMonit:12066]

Logi
Debian GNU/Linux 9 analytics1060 ttyS1

analytics1060 login: [27380523.539968] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/u50:1:6317]
[27380532.958988] INFO: rcu_sched self-detected stall on CPU
[27380532.965034] 	9-...: (7310470 ticks this GP) idle=fc5/140000000000001/0 softirq=1527227977/1527227977 fqs=3654999
[27380532.976775] 	 (t=7314642 jiffies g=1415222571 c=1415222570 q=3326535)
[27380543.381906] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [DataXceiver for:9936]
[27380543.557885] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DiskHealthMonit:12066]

I logged on the system with root but the host was almost unusable.
Ran w but it never came back.

I guess it needs a hard reset?

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2020-05-06T06:00:01Z] <elukey> powercycle analytics1060 - host stuck - T251973

colewhite triaged this task as Medium priority.May 6 2020, 3:26 PM
elukey claimed this task.