cp4024 kernel errors
Open, NormalPublic

Description

cp4024's varnish backend alarmed at around 17:28 UTC and several kernel errors were present in the dmesg, with this one popping up in the tty:

Message from syslogd@cp4024 at Sep  3 17:56:06 ...
 kernel:[30568.739024] NMI watchdog: BUG: soft lockup - CPU#41 stuck for 22s! [varnishd:2065]

Before doing anything I depooled it and then tried to restart Varnish with much success. Didn't try to attempt any reboot or similar to allow the Traffic team to better triage it.

elukey created this task.Sun, Sep 3, 6:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSun, Sep 3, 6:00 PM
elukey updated the task description. (Show Details)Sun, Sep 3, 6:08 PM
elukey added a comment.Sun, Sep 3, 6:15 PM

elukey@cp4024:~$ top
top - 18:15:02 up  8:48,  2 users,  load average: 771.32, 750.69, 538.32
Tasks: 542 total,   4 running, 538 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.8 us,  1.2 sy,  0.0 ni, 95.5 id,  0.7 wa,  0.0 hi,  0.8 si,  0.0 st
KiB Mem:  39618131+total, 26834315+used, 12783816+free,   173648 buffers
KiB Swap:        0 total,        0 used,        0 free. 13084851+cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1898 vcache    20   0  1.429t 0.123t 0.120t D 100.9 33.2 205:58.32 varnishd
 1328 vcache    20   0  0.132t 0.115t  86388 S   6.3 31.3 210:04.84 varnishd
    1 root      20   0  176540   5452   3032 S   0.0  0.0   0:15.81 systemd
ema triaged this task as Normal priority.Mon, Sep 4, 7:43 AM
ema moved this task from Triage to Caching on the Traffic board.
ema added a subscriber: ema.Mon, Sep 4, 7:46 AM

Thanks @elukey! Yeah cp4024 might be having hardware issues. The system was down yesterday at 9ish AM UTC. I've power-cycled it and it came back online fine, but then after some hours it started with the lockups mentioned in this task description.

ema added a comment.EditedTue, Sep 5, 9:49 AM

On September 1st:

Sep  1 16:21:57 cp4024 kernel: [ 5179.145078] BUG: unable to handle kernel paging request at ffffffffba8fb2e3
Sep  1 16:21:57 cp4024 kernel: [ 5179.152873] IP: [<ffffffffba8fb2e3>] 0xffffffffba8fb2e3
Sep  1 16:21:57 cp4024 kernel: [ 5179.158713] PGD 78c60a067 
Sep  1 16:21:57 cp4024 kernel: [ 5179.161536] PUD 78c60b063 
Sep  1 16:21:57 cp4024 kernel: [ 5179.164558] PMD 0 
Sep  1 16:21:57 cp4024 kernel: [ 5179.165151] 
Sep  1 16:21:57 cp4024 kernel: [ 5179.166817] Oops: 0010 [#1] SMP
Sep  1 16:21:57 cp4024 kernel: [ 5179.170320] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport cpufreq_userspace cpufreq_conservative cpufreq_powersave drbg ansi_cprng seqiv xfrm4_mode_transport 8021q garp mrp stp llc xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 ttm drm_kms_helper kvm drm iTCO_wdt mxm_wmi evdev dcdbas i2c_algo_bit irqbypass iTCO_vendor_support mei_me crct10dif_pclmul crc32_pclmul lpc_ich ghash_clmulni_intel pcspkr mei shpchp mfd_core ipmi_si wmi button tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto mbcache raid1 md_mod sg sd_mod ahci libahci bnx2x aesni_intel aes_x86_64 ehci_pci ptp glue_helper lrw gf128mul ehci_hcd ablk_helper cryptd libata pps_core mdio usbcore
Sep  1 16:21:57 cp4024 kernel: [ 5179.249705]  libcrc32c crc32c_generic scsi_mod usb_common crc32c_intel
Sep  1 16:21:57 cp4024 kernel: [ 5179.255581] CPU: 23 PID: 2406 Comm: varnishd Not tainted 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.30-2+deb9u2~bpo8+1
Sep  1 16:21:57 cp4024 kernel: [ 5179.266645] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
Sep  1 16:21:57 cp4024 kernel: [ 5179.274995] task: ffff9e03bf8d8000 task.stack: ffffbaadf6b10000
Sep  1 16:21:57 cp4024 kernel: [ 5179.281599] RIP: 0010:[<ffffffffba8fb2e3>]  [<ffffffffba8fb2e3>] 0xffffffffba8fb2e3
Sep  1 16:21:57 cp4024 kernel: [ 5179.290155] RSP: 0018:ffffbaadf6b13ee8  EFLAGS: 00010292
Sep  1 16:21:57 cp4024 kernel: [ 5179.296079] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Sep  1 16:21:57 cp4024 kernel: [ 5179.304040] RDX: 0000000080000000 RSI: ffffbaadf6b13d90 RDI: ffff9e03d3886400
Sep  1 16:21:57 cp4024 kernel: [ 5179.312002] RBP: 0000000000000080 R08: 0000000000000000 R09: 0000000000100000
Sep  1 16:21:57 cp4024 kernel: [ 5179.319963] R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000000
Sep  1 16:21:57 cp4024 kernel: [ 5179.327922] R13: 00007e6540e65034 R14: 00000000000001b1 R15: 00000000000000d8
Sep  1 16:21:57 cp4024 kernel: [ 5179.335884] FS:  00007e655716a700(0000) GS:ffff9e03ff2c0000(0000) knlGS:0000000000000000
Sep  1 16:21:57 cp4024 kernel: [ 5179.344912] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  1 16:21:57 cp4024 kernel: [ 5179.351323] CR2: ffffffffba8fb2e3 CR3: 0000005fe2ccd000 CR4: 00000000003406e0
Sep  1 16:21:57 cp4024 kernel: [ 5179.359283] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep  1 16:21:57 cp4024 kernel: [ 5179.367245] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep  1 16:21:57 cp4024 kernel: [ 5179.375206] Stack:
Sep  1 16:21:57 cp4024 kernel: [ 5179.377448]  00000000ffffffff 0000000000000000 00007e6526263020 ffff9e03e35d5700
Sep  1 16:21:57 cp4024 kernel: [ 5179.385738]  ffffffffbac03c22 0000000000000586 4e70677c98b8eacf 00007e65259ea2e0
Sep  1 16:21:57 cp4024 kernel: [ 5179.394025]  00007e655716a700 00007e6540e65030 00007e655716a658 00007e6540e65030
Sep  1 16:21:57 cp4024 kernel: [ 5179.402315] Call Trace:
Sep  1 16:21:57 cp4024 kernel: [ 5179.405048]  [<ffffffffbac03c22>] ? do_writev+0x82/0xf0
Sep  1 16:21:57 cp4024 kernel: [ 5179.410881]  [<ffffffffbb0077bb>] ? system_call_fast_compare_end+0xc/0x9b
Sep  1 16:21:57 cp4024 kernel: [ 5179.418446] Code:  Bad RIP value.
Sep  1 16:21:57 cp4024 kernel: [ 5179.422177] RIP  [<ffffffffba8fb2e3>] 0xffffffffba8fb2e3
Sep  1 16:21:57 cp4024 kernel: [ 5179.428112]  RSP <ffffbaadf6b13ee8>
Sep  1 16:21:57 cp4024 kernel: [ 5179.432001] CR2: ffffffffba8fb2e3
Sep  1 16:21:57 cp4024 kernel: [ 5179.439026] ---[ end trace 27ae728dcab70bcf ]---

On September 3rd, the system was rebooted at 09:26 and crashed a few hours later:

Sep  3 17:24:23 cp4024 kernel: [28665.794503] invalid opcode: 0000 [#1] SMP
Sep  3 17:24:23 cp4024 kernel: [28665.798981] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport 8021q garp mrp stp llc drbg ansi_cprng seqiv xfrm4_mode_transport cpufreq_userspace cpufreq_conservative cpufreq_powersave xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp mgag200 coretemp ttm iTCO_wdt iTCO_vendor_support mxm_wmi evdev drm_kms_helper kvm irqbypass drm crct10dif_pclmul crc32_pclmul dcdbas mei_me ghash_clmulni_intel pcspkr i2c_algo_bit lpc_ich mei mfd_core shpchp ipmi_si wmi button tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto mbcache raid1 md_mod sg sd_mod ahci libahci aesni_intel aes_x86_64 ehci_pci bnx2x glue_helper lrw gf128mul libata ehci_hcd ablk_helper ptp cryptd pps_core mdio libcrc32c
Sep  3 17:24:23 cp4024 kernel: [28665.878587]  usbcore crc32c_generic scsi_mod usb_common crc32c_intel
Sep  3 17:24:23 cp4024 kernel: [28665.884274] CPU: 47 PID: 2237 Comm: varnishd Not tainted 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.30-2+deb9u2~bpo8+1
Sep  3 17:24:23 cp4024 kernel: [28665.895340] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
Sep  3 17:24:23 cp4024 kernel: [28665.903690] task: ffff9684390a0140 task.stack: ffffb2e8f644c000
Sep  3 17:24:23 cp4024 kernel: [28665.910293] RIP: 0010:[<ffffffff97c19fe7>]  [<ffffffff97c19fe7>] __xen_pgd_pin+0x117/0x2f0
Sep  3 17:24:23 cp4024 kernel: [28665.919540] RSP: 0018:ffffb2e8f644ff28  EFLAGS: 00010246
Sep  3 17:24:23 cp4024 kernel: [28665.925467] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00007fd157fe22e8
Sep  3 17:24:23 cp4024 kernel: [28665.933429] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
Sep  3 17:24:23 cp4024 kernel: [28665.941390] RBP: 00007fd157fe22e0 R08: 0000000000000000 R09: ffffb2e8f644fb7c
Sep  3 17:24:23 cp4024 kernel: [28665.949352] R10: ffff968462508038 R11: ffff968460f313c0 R12: 0000000000005fff
Sep  3 17:24:23 cp4024 kernel: [28665.957313] R13: 00007e63d240a000 R14: 00007e63d2410000 R15: 00007e63d2408128
Sep  3 17:24:23 cp4024 kernel: [28665.965274] FS:  00007fd157fe3700(0000) GS:ffff96847f5c0000(0000) knlGS:0000000000000000
Sep  3 17:24:23 cp4024 kernel: [28665.974302] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  3 17:24:23 cp4024 kernel: [28665.980714] CR2: 00007fc2e5f73000 CR3: 0000005fe1cd3000 CR4: 00000000003406e0
Sep  3 17:24:23 cp4024 kernel: [28665.988675] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep  3 17:24:23 cp4024 kernel: [28665.996636] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep  3 17:24:23 cp4024 kernel: [28666.004597] Stack:
Sep  3 17:24:23 cp4024 kernel: [28666.006838]  0000000000006ff9 00000000049fad52 736057fd296f8e12 0000000000000074
Sep  3 17:24:23 cp4024 kernel: [28666.015128]  00007e63d240a000 ffffffff982077bb 00007e63d2408020 00007fd157fe2c60
Sep  3 17:24:23 cp4024 kernel: [28666.023417]  00007e6427d46ac0 0000000000000002 00007fd157fe21c0 00007fd157fe2c60
Sep  3 17:24:23 cp4024 kernel: [28666.031708] Call Trace:
Sep  3 17:24:23 cp4024 kernel: [28666.034437]  [<ffffffff982077bb>] ? system_call_fast_compare_end+0xc/0x9b
Sep  3 17:24:23 cp4024 kernel: [28666.042010] Code: 1f 44 00 00 e8 9b fa 03 00 e8 16 bd ff ff 65 48 8b 3d ee 19 3f 68 57 9d 0f 1f 44 00 00 5b 5d 41 5c 41 5d 41 5e c3 65 8b 05 27 02 <3f> 68 89 c0 48 0f a3 05 cd a8 d0 00 0f 92 c0 84 c0 0f 84 08 ff 
Sep  3 17:24:23 cp4024 kernel: [28666.063684] RIP  [<ffffffff97c19fe7>] __xen_pgd_pin+0x117/0x2f0
Sep  3 17:24:23 cp4024 kernel: [28666.070302]  RSP <ffffb2e8f644ff28>
Sep  3 17:24:23 cp4024 kernel: [28666.077440] ---[ end trace 792d69cdb7466d2e ]---
Sep  3 17:24:44 cp4024 kernel: [28687.127593] INFO: rcu_sched self-detected stall on CPU
Sep  3 17:24:44 cp4024 kernel: [28687.131595] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep  3 17:24:44 cp4024 kernel: [28687.131599]   41-...: (5250 ticks this GP) idle=cf5/140000000000001/0 softirq=2070809/2070809 fqs=2625 
Sep  3 17:24:44 cp4024 kernel: [28687.131600]   
Sep  3 17:24:44 cp4024 kernel: [28687.131603] (detected by 31, t=5252 jiffies, g=2170205, c=2170204, q=188630)
Sep  3 17:24:44 cp4024 kernel: [28687.131604] Task dump for CPU 41:
Sep  3 17:24:44 cp4024 kernel: [28687.131605] varnishd        R
Sep  3 17:24:44 cp4024 kernel: [28687.131605]   running task        0  2065   1888 0x00000008
Sep  3 17:24:44 cp4024 kernel: ffffffff981515bd
Sep  3 17:24:44 cp4024 kernel: [28687.131608]  01ff9684614f90c0 fc38c2bf6c515e59 0000000000000000 ffff968444355a80
Sep  3 17:24:44 cp4024 kernel: [28687.131610]  ffffffff980e42e1 ffffb2e8f5f07df8 ffffb2e8f5f07e18 0000000000000001
Sep  3 17:24:44 cp4024 kernel: [28687.131611]  ffffb2e8f5f07ea8 ffffffff97e4b514 00007fd15a9b5edcCall Trace:
Sep  3 17:24:44 cp4024 kernel: [28687.131618]  [<ffffffff981515bd>] ? tcp_poll+0x23d/0x250
Sep  3 17:24:44 cp4024 kernel: [28687.131623]  [<ffffffff980e42e1>] ? sock_poll+0x41/0x90
Sep  3 17:24:44 cp4024 kernel: [28687.131626]  [<ffffffff97e4b514>] ? ep_send_events_proc+0x134/0x190
Sep  3 17:24:44 cp4024 kernel: [28687.131628]  [<ffffffff97e21ecc>] ? __fget+0x3c/0x90
Sep  3 17:24:44 cp4024 kernel: [28687.131629]  [<ffffffff97e22191>] ? __fget_light+0x21/0x50
Sep  3 17:24:44 cp4024 kernel: [28687.131630]  [<ffffffff97e4c63e>] ? SyS_epoll_ctl+0x7e/0xdb0
Sep  3 17:24:44 cp4024 kernel: [28687.131634]  [<ffffffff97ca2be0>] ? wake_up_q+0x60/0x60
Sep  3 17:24:44 cp4024 kernel: [28687.131637]  [<ffffffff982077bb>] ? system_call_fast_compare_end+0xc/0x9b
Sep  3 17:24:44 cp4024 kernel: [28687.248710]   41-...: (5250 ticks this GP) idle=cf5/140000000000001/0 softirq=2070809/2070809 fqs=2641 
Sep  3 17:24:44 cp4024 kernel: [28687.259096]    (t=5284 jiffies g=2170205 c=2170204 q=189573)
Sep  3 17:24:44 cp4024 kernel: [28687.265324] Task dump for CPU 41:
Sep  3 17:24:44 cp4024 kernel: [28687.269019] varnishd        R  running task        0  2065   1888 0x00000008
Sep  3 17:24:44 cp4024 kernel: [28687.276904]  0000000000000087 ffffffff97ca4fa3 0000000000000029 0000000000000087
Sep  3 17:24:44 cp4024 kernel: [28687.285193]  ffffffff97d7c521 ffff96847f519540 ffffffff9884e3c0 0000000000000000
Sep  3 17:24:44 cp4024 kernel: [28687.293482]  ffff9684614f90c0 ffffffff97ce147e ffff96847f5187c0 00000000000187c0
Sep  3 17:24:44 cp4024 kernel: [28687.301771] Call Trace:
Sep  3 17:24:44 cp4024 kernel: [28687.304498]  <IRQ> 
Sep  3 17:24:44 cp4024 kernel: [28687.306644]  [<ffffffff97ca4fa3>] ? sched_show_task+0xd3/0x140
Sep  3 17:24:44 cp4024 kernel: [28687.313160]  [<ffffffff97d7c521>] ? rcu_dump_cpu_stacks+0x7b/0xb4
Sep  3 17:24:44 cp4024 kernel: [28687.319960]  [<ffffffff97ce147e>] ? rcu_check_callbacks+0x6be/0x800
Sep  3 17:24:44 cp4024 kernel: [28687.326954]  [<ffffffff97cf7360>] ? tick_sched_do_timer+0x30/0x30
Sep  3 17:24:44 cp4024 kernel: [28687.333755]  [<ffffffff97ce7e98>] ? update_process_times+0x28/0x50
Sep  3 17:24:44 cp4024 kernel: [28687.340650]  [<ffffffff97cf6d70>] ? tick_sched_handle.isra.13+0x20/0x50
Sep  3 17:24:44 cp4024 kernel: [28687.348030]  [<ffffffff97cf7398>] ? tick_sched_timer+0x38/0x70
Sep  3 17:24:44 cp4024 kernel: [28687.354538]  [<ffffffff97ce8b4c>] ? __hrtimer_run_queues+0xdc/0x250
Sep  3 17:24:44 cp4024 kernel: [28687.361530]  [<ffffffff97ce8f99>] ? hrtimer_interrupt+0x99/0x190
Sep  3 17:24:44 cp4024 kernel: [28687.368233]  [<ffffffff9820a0e9>] ? smp_apic_timer_interrupt+0x39/0x50
Sep  3 17:24:44 cp4024 kernel: [28687.375517]  [<ffffffff98209402>] ? apic_timer_interrupt+0x82/0x90
Sep  3 17:24:44 cp4024 kernel: [28687.382412]  <EOI> 
Sep  3 17:24:44 cp4024 kernel: [28687.384557]  [<ffffffff97e21ecc>] ? __fget+0x3c/0x90
Sep  3 17:24:44 cp4024 kernel: [28687.390102]  [<ffffffff97e22191>] ? __fget_light+0x21/0x50
Sep  3 17:24:44 cp4024 kernel: [28687.396223]  [<ffffffff97e4c63e>] ? SyS_epoll_ctl+0x7e/0xdb0
Sep  3 17:24:44 cp4024 kernel: [28687.402537]  [<ffffffff97ca2be0>] ? wake_up_q+0x60/0x60
Sep  3 17:24:44 cp4024 kernel: [28687.408366]  [<ffffffff982077bb>] ? system_call_fast_compare_end+0xc/0x9b

Mentioned in SAL (#wikimedia-operations) [2017-09-05T10:11:53Z] <ema> reboot cp4024 T174891

BBlack added a subscriber: BBlack.Tue, Sep 5, 4:43 PM

I suspect this is a hardware failure, but we should do some more software testing on the other nodes first to confirm this isn't buggy / incorrect behavior being triggered by the (somewhat crazy) NUMA-isolation config on these nodes.

So far, other nodes are testing ok on this front. This is likely a node-specific early hardware failure.