cp4024 kernel errors
Closed, ResolvedPublic

Description

cp4024's varnish backend alarmed at around 17:28 UTC and several kernel errors were present in the dmesg, with this one popping up in the tty:

Message from syslogd@cp4024 at Sep  3 17:56:06 ...
 kernel:[30568.739024] NMI watchdog: BUG: soft lockup - CPU#41 stuck for 22s! [varnishd:2065]

Before doing anything I depooled it and then tried to restart Varnish with much success. Didn't try to attempt any reboot or similar to allow the Traffic team to better triage it.

Related Objects

elukey created this task.Sep 3 2017, 6:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 3 2017, 6:00 PM
elukey updated the task description. (Show Details)Sep 3 2017, 6:08 PM
elukey added a comment.Sep 3 2017, 6:15 PM

elukey@cp4024:~$ top
top - 18:15:02 up  8:48,  2 users,  load average: 771.32, 750.69, 538.32
Tasks: 542 total,   4 running, 538 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.8 us,  1.2 sy,  0.0 ni, 95.5 id,  0.7 wa,  0.0 hi,  0.8 si,  0.0 st
KiB Mem:  39618131+total, 26834315+used, 12783816+free,   173648 buffers
KiB Swap:        0 total,        0 used,        0 free. 13084851+cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1898 vcache    20   0  1.429t 0.123t 0.120t D 100.9 33.2 205:58.32 varnishd
 1328 vcache    20   0  0.132t 0.115t  86388 S   6.3 31.3 210:04.84 varnishd
    1 root      20   0  176540   5452   3032 S   0.0  0.0   0:15.81 systemd
ema triaged this task as Normal priority.Sep 4 2017, 7:43 AM
ema moved this task from Triage to Caching on the Traffic board.
ema added a subscriber: ema.Sep 4 2017, 7:46 AM

Thanks @elukey! Yeah cp4024 might be having hardware issues. The system was down yesterday at 9ish AM UTC. I've power-cycled it and it came back online fine, but then after some hours it started with the lockups mentioned in this task description.

ema added a comment.EditedSep 5 2017, 9:49 AM

On September 1st:

Sep  1 16:21:57 cp4024 kernel: [ 5179.145078] BUG: unable to handle kernel paging request at ffffffffba8fb2e3
Sep  1 16:21:57 cp4024 kernel: [ 5179.152873] IP: [<ffffffffba8fb2e3>] 0xffffffffba8fb2e3
Sep  1 16:21:57 cp4024 kernel: [ 5179.158713] PGD 78c60a067 
Sep  1 16:21:57 cp4024 kernel: [ 5179.161536] PUD 78c60b063 
Sep  1 16:21:57 cp4024 kernel: [ 5179.164558] PMD 0 
Sep  1 16:21:57 cp4024 kernel: [ 5179.165151] 
Sep  1 16:21:57 cp4024 kernel: [ 5179.166817] Oops: 0010 [#1] SMP
Sep  1 16:21:57 cp4024 kernel: [ 5179.170320] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport cpufreq_userspace cpufreq_conservative cpufreq_powersave drbg ansi_cprng seqiv xfrm4_mode_transport 8021q garp mrp stp llc xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 ttm drm_kms_helper kvm drm iTCO_wdt mxm_wmi evdev dcdbas i2c_algo_bit irqbypass iTCO_vendor_support mei_me crct10dif_pclmul crc32_pclmul lpc_ich ghash_clmulni_intel pcspkr mei shpchp mfd_core ipmi_si wmi button tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto mbcache raid1 md_mod sg sd_mod ahci libahci bnx2x aesni_intel aes_x86_64 ehci_pci ptp glue_helper lrw gf128mul ehci_hcd ablk_helper cryptd libata pps_core mdio usbcore
Sep  1 16:21:57 cp4024 kernel: [ 5179.249705]  libcrc32c crc32c_generic scsi_mod usb_common crc32c_intel
Sep  1 16:21:57 cp4024 kernel: [ 5179.255581] CPU: 23 PID: 2406 Comm: varnishd Not tainted 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.30-2+deb9u2~bpo8+1
Sep  1 16:21:57 cp4024 kernel: [ 5179.266645] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
Sep  1 16:21:57 cp4024 kernel: [ 5179.274995] task: ffff9e03bf8d8000 task.stack: ffffbaadf6b10000
Sep  1 16:21:57 cp4024 kernel: [ 5179.281599] RIP: 0010:[<ffffffffba8fb2e3>]  [<ffffffffba8fb2e3>] 0xffffffffba8fb2e3
Sep  1 16:21:57 cp4024 kernel: [ 5179.290155] RSP: 0018:ffffbaadf6b13ee8  EFLAGS: 00010292
Sep  1 16:21:57 cp4024 kernel: [ 5179.296079] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Sep  1 16:21:57 cp4024 kernel: [ 5179.304040] RDX: 0000000080000000 RSI: ffffbaadf6b13d90 RDI: ffff9e03d3886400
Sep  1 16:21:57 cp4024 kernel: [ 5179.312002] RBP: 0000000000000080 R08: 0000000000000000 R09: 0000000000100000
Sep  1 16:21:57 cp4024 kernel: [ 5179.319963] R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000000
Sep  1 16:21:57 cp4024 kernel: [ 5179.327922] R13: 00007e6540e65034 R14: 00000000000001b1 R15: 00000000000000d8
Sep  1 16:21:57 cp4024 kernel: [ 5179.335884] FS:  00007e655716a700(0000) GS:ffff9e03ff2c0000(0000) knlGS:0000000000000000
Sep  1 16:21:57 cp4024 kernel: [ 5179.344912] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  1 16:21:57 cp4024 kernel: [ 5179.351323] CR2: ffffffffba8fb2e3 CR3: 0000005fe2ccd000 CR4: 00000000003406e0
Sep  1 16:21:57 cp4024 kernel: [ 5179.359283] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep  1 16:21:57 cp4024 kernel: [ 5179.367245] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep  1 16:21:57 cp4024 kernel: [ 5179.375206] Stack:
Sep  1 16:21:57 cp4024 kernel: [ 5179.377448]  00000000ffffffff 0000000000000000 00007e6526263020 ffff9e03e35d5700
Sep  1 16:21:57 cp4024 kernel: [ 5179.385738]  ffffffffbac03c22 0000000000000586 4e70677c98b8eacf 00007e65259ea2e0
Sep  1 16:21:57 cp4024 kernel: [ 5179.394025]  00007e655716a700 00007e6540e65030 00007e655716a658 00007e6540e65030
Sep  1 16:21:57 cp4024 kernel: [ 5179.402315] Call Trace:
Sep  1 16:21:57 cp4024 kernel: [ 5179.405048]  [<ffffffffbac03c22>] ? do_writev+0x82/0xf0
Sep  1 16:21:57 cp4024 kernel: [ 5179.410881]  [<ffffffffbb0077bb>] ? system_call_fast_compare_end+0xc/0x9b
Sep  1 16:21:57 cp4024 kernel: [ 5179.418446] Code:  Bad RIP value.
Sep  1 16:21:57 cp4024 kernel: [ 5179.422177] RIP  [<ffffffffba8fb2e3>] 0xffffffffba8fb2e3
Sep  1 16:21:57 cp4024 kernel: [ 5179.428112]  RSP <ffffbaadf6b13ee8>
Sep  1 16:21:57 cp4024 kernel: [ 5179.432001] CR2: ffffffffba8fb2e3
Sep  1 16:21:57 cp4024 kernel: [ 5179.439026] ---[ end trace 27ae728dcab70bcf ]---

On September 3rd, the system was rebooted at 09:26 and crashed a few hours later:

Sep  3 17:24:23 cp4024 kernel: [28665.794503] invalid opcode: 0000 [#1] SMP
Sep  3 17:24:23 cp4024 kernel: [28665.798981] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport 8021q garp mrp stp llc drbg ansi_cprng seqiv xfrm4_mode_transport cpufreq_userspace cpufreq_conservative cpufreq_powersave xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp mgag200 coretemp ttm iTCO_wdt iTCO_vendor_support mxm_wmi evdev drm_kms_helper kvm irqbypass drm crct10dif_pclmul crc32_pclmul dcdbas mei_me ghash_clmulni_intel pcspkr i2c_algo_bit lpc_ich mei mfd_core shpchp ipmi_si wmi button tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto mbcache raid1 md_mod sg sd_mod ahci libahci aesni_intel aes_x86_64 ehci_pci bnx2x glue_helper lrw gf128mul libata ehci_hcd ablk_helper ptp cryptd pps_core mdio libcrc32c
Sep  3 17:24:23 cp4024 kernel: [28665.878587]  usbcore crc32c_generic scsi_mod usb_common crc32c_intel
Sep  3 17:24:23 cp4024 kernel: [28665.884274] CPU: 47 PID: 2237 Comm: varnishd Not tainted 4.9.0-0.bpo.3-amd64 #1 Debian 4.9.30-2+deb9u2~bpo8+1
Sep  3 17:24:23 cp4024 kernel: [28665.895340] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
Sep  3 17:24:23 cp4024 kernel: [28665.903690] task: ffff9684390a0140 task.stack: ffffb2e8f644c000
Sep  3 17:24:23 cp4024 kernel: [28665.910293] RIP: 0010:[<ffffffff97c19fe7>]  [<ffffffff97c19fe7>] __xen_pgd_pin+0x117/0x2f0
Sep  3 17:24:23 cp4024 kernel: [28665.919540] RSP: 0018:ffffb2e8f644ff28  EFLAGS: 00010246
Sep  3 17:24:23 cp4024 kernel: [28665.925467] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00007fd157fe22e8
Sep  3 17:24:23 cp4024 kernel: [28665.933429] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
Sep  3 17:24:23 cp4024 kernel: [28665.941390] RBP: 00007fd157fe22e0 R08: 0000000000000000 R09: ffffb2e8f644fb7c
Sep  3 17:24:23 cp4024 kernel: [28665.949352] R10: ffff968462508038 R11: ffff968460f313c0 R12: 0000000000005fff
Sep  3 17:24:23 cp4024 kernel: [28665.957313] R13: 00007e63d240a000 R14: 00007e63d2410000 R15: 00007e63d2408128
Sep  3 17:24:23 cp4024 kernel: [28665.965274] FS:  00007fd157fe3700(0000) GS:ffff96847f5c0000(0000) knlGS:0000000000000000
Sep  3 17:24:23 cp4024 kernel: [28665.974302] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  3 17:24:23 cp4024 kernel: [28665.980714] CR2: 00007fc2e5f73000 CR3: 0000005fe1cd3000 CR4: 00000000003406e0
Sep  3 17:24:23 cp4024 kernel: [28665.988675] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep  3 17:24:23 cp4024 kernel: [28665.996636] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep  3 17:24:23 cp4024 kernel: [28666.004597] Stack:
Sep  3 17:24:23 cp4024 kernel: [28666.006838]  0000000000006ff9 00000000049fad52 736057fd296f8e12 0000000000000074
Sep  3 17:24:23 cp4024 kernel: [28666.015128]  00007e63d240a000 ffffffff982077bb 00007e63d2408020 00007fd157fe2c60
Sep  3 17:24:23 cp4024 kernel: [28666.023417]  00007e6427d46ac0 0000000000000002 00007fd157fe21c0 00007fd157fe2c60
Sep  3 17:24:23 cp4024 kernel: [28666.031708] Call Trace:
Sep  3 17:24:23 cp4024 kernel: [28666.034437]  [<ffffffff982077bb>] ? system_call_fast_compare_end+0xc/0x9b
Sep  3 17:24:23 cp4024 kernel: [28666.042010] Code: 1f 44 00 00 e8 9b fa 03 00 e8 16 bd ff ff 65 48 8b 3d ee 19 3f 68 57 9d 0f 1f 44 00 00 5b 5d 41 5c 41 5d 41 5e c3 65 8b 05 27 02 <3f> 68 89 c0 48 0f a3 05 cd a8 d0 00 0f 92 c0 84 c0 0f 84 08 ff 
Sep  3 17:24:23 cp4024 kernel: [28666.063684] RIP  [<ffffffff97c19fe7>] __xen_pgd_pin+0x117/0x2f0
Sep  3 17:24:23 cp4024 kernel: [28666.070302]  RSP <ffffb2e8f644ff28>
Sep  3 17:24:23 cp4024 kernel: [28666.077440] ---[ end trace 792d69cdb7466d2e ]---
Sep  3 17:24:44 cp4024 kernel: [28687.127593] INFO: rcu_sched self-detected stall on CPU
Sep  3 17:24:44 cp4024 kernel: [28687.131595] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep  3 17:24:44 cp4024 kernel: [28687.131599]   41-...: (5250 ticks this GP) idle=cf5/140000000000001/0 softirq=2070809/2070809 fqs=2625 
Sep  3 17:24:44 cp4024 kernel: [28687.131600]   
Sep  3 17:24:44 cp4024 kernel: [28687.131603] (detected by 31, t=5252 jiffies, g=2170205, c=2170204, q=188630)
Sep  3 17:24:44 cp4024 kernel: [28687.131604] Task dump for CPU 41:
Sep  3 17:24:44 cp4024 kernel: [28687.131605] varnishd        R
Sep  3 17:24:44 cp4024 kernel: [28687.131605]   running task        0  2065   1888 0x00000008
Sep  3 17:24:44 cp4024 kernel: ffffffff981515bd
Sep  3 17:24:44 cp4024 kernel: [28687.131608]  01ff9684614f90c0 fc38c2bf6c515e59 0000000000000000 ffff968444355a80
Sep  3 17:24:44 cp4024 kernel: [28687.131610]  ffffffff980e42e1 ffffb2e8f5f07df8 ffffb2e8f5f07e18 0000000000000001
Sep  3 17:24:44 cp4024 kernel: [28687.131611]  ffffb2e8f5f07ea8 ffffffff97e4b514 00007fd15a9b5edcCall Trace:
Sep  3 17:24:44 cp4024 kernel: [28687.131618]  [<ffffffff981515bd>] ? tcp_poll+0x23d/0x250
Sep  3 17:24:44 cp4024 kernel: [28687.131623]  [<ffffffff980e42e1>] ? sock_poll+0x41/0x90
Sep  3 17:24:44 cp4024 kernel: [28687.131626]  [<ffffffff97e4b514>] ? ep_send_events_proc+0x134/0x190
Sep  3 17:24:44 cp4024 kernel: [28687.131628]  [<ffffffff97e21ecc>] ? __fget+0x3c/0x90
Sep  3 17:24:44 cp4024 kernel: [28687.131629]  [<ffffffff97e22191>] ? __fget_light+0x21/0x50
Sep  3 17:24:44 cp4024 kernel: [28687.131630]  [<ffffffff97e4c63e>] ? SyS_epoll_ctl+0x7e/0xdb0
Sep  3 17:24:44 cp4024 kernel: [28687.131634]  [<ffffffff97ca2be0>] ? wake_up_q+0x60/0x60
Sep  3 17:24:44 cp4024 kernel: [28687.131637]  [<ffffffff982077bb>] ? system_call_fast_compare_end+0xc/0x9b
Sep  3 17:24:44 cp4024 kernel: [28687.248710]   41-...: (5250 ticks this GP) idle=cf5/140000000000001/0 softirq=2070809/2070809 fqs=2641 
Sep  3 17:24:44 cp4024 kernel: [28687.259096]    (t=5284 jiffies g=2170205 c=2170204 q=189573)
Sep  3 17:24:44 cp4024 kernel: [28687.265324] Task dump for CPU 41:
Sep  3 17:24:44 cp4024 kernel: [28687.269019] varnishd        R  running task        0  2065   1888 0x00000008
Sep  3 17:24:44 cp4024 kernel: [28687.276904]  0000000000000087 ffffffff97ca4fa3 0000000000000029 0000000000000087
Sep  3 17:24:44 cp4024 kernel: [28687.285193]  ffffffff97d7c521 ffff96847f519540 ffffffff9884e3c0 0000000000000000
Sep  3 17:24:44 cp4024 kernel: [28687.293482]  ffff9684614f90c0 ffffffff97ce147e ffff96847f5187c0 00000000000187c0
Sep  3 17:24:44 cp4024 kernel: [28687.301771] Call Trace:
Sep  3 17:24:44 cp4024 kernel: [28687.304498]  <IRQ> 
Sep  3 17:24:44 cp4024 kernel: [28687.306644]  [<ffffffff97ca4fa3>] ? sched_show_task+0xd3/0x140
Sep  3 17:24:44 cp4024 kernel: [28687.313160]  [<ffffffff97d7c521>] ? rcu_dump_cpu_stacks+0x7b/0xb4
Sep  3 17:24:44 cp4024 kernel: [28687.319960]  [<ffffffff97ce147e>] ? rcu_check_callbacks+0x6be/0x800
Sep  3 17:24:44 cp4024 kernel: [28687.326954]  [<ffffffff97cf7360>] ? tick_sched_do_timer+0x30/0x30
Sep  3 17:24:44 cp4024 kernel: [28687.333755]  [<ffffffff97ce7e98>] ? update_process_times+0x28/0x50
Sep  3 17:24:44 cp4024 kernel: [28687.340650]  [<ffffffff97cf6d70>] ? tick_sched_handle.isra.13+0x20/0x50
Sep  3 17:24:44 cp4024 kernel: [28687.348030]  [<ffffffff97cf7398>] ? tick_sched_timer+0x38/0x70
Sep  3 17:24:44 cp4024 kernel: [28687.354538]  [<ffffffff97ce8b4c>] ? __hrtimer_run_queues+0xdc/0x250
Sep  3 17:24:44 cp4024 kernel: [28687.361530]  [<ffffffff97ce8f99>] ? hrtimer_interrupt+0x99/0x190
Sep  3 17:24:44 cp4024 kernel: [28687.368233]  [<ffffffff9820a0e9>] ? smp_apic_timer_interrupt+0x39/0x50
Sep  3 17:24:44 cp4024 kernel: [28687.375517]  [<ffffffff98209402>] ? apic_timer_interrupt+0x82/0x90
Sep  3 17:24:44 cp4024 kernel: [28687.382412]  <EOI> 
Sep  3 17:24:44 cp4024 kernel: [28687.384557]  [<ffffffff97e21ecc>] ? __fget+0x3c/0x90
Sep  3 17:24:44 cp4024 kernel: [28687.390102]  [<ffffffff97e22191>] ? __fget_light+0x21/0x50
Sep  3 17:24:44 cp4024 kernel: [28687.396223]  [<ffffffff97e4c63e>] ? SyS_epoll_ctl+0x7e/0xdb0
Sep  3 17:24:44 cp4024 kernel: [28687.402537]  [<ffffffff97ca2be0>] ? wake_up_q+0x60/0x60
Sep  3 17:24:44 cp4024 kernel: [28687.408366]  [<ffffffff982077bb>] ? system_call_fast_compare_end+0xc/0x9b

Mentioned in SAL (#wikimedia-operations) [2017-09-05T10:11:53Z] <ema> reboot cp4024 T174891

BBlack added a subscriber: BBlack.Sep 5 2017, 4:43 PM

I suspect this is a hardware failure, but we should do some more software testing on the other nodes first to confirm this isn't buggy / incorrect behavior being triggered by the (somewhat crazy) NUMA-isolation config on these nodes.

So far, other nodes are testing ok on this front. This is likely a node-specific early hardware failure.

RobH added a subscriber: RobH.Sep 21 2017, 4:41 PM

Dell has preboot diagnostics as an option (epsa) and then the code translation at https://www.dell.com/support/home/us/en/4/pre-boot-analysis

When I'm onsite next Monday, I'll reboot this machine into the hardware tests.

@RobH any updates here on diags?

RobH added a comment.Sep 27 2017, 3:27 PM

I stupidly forgot this machine with the rest of the repairs I was doing! I'll go back onsite to work on this soon!

Mentioned in SAL (#wikimedia-operations) [2017-09-27T23:39:48Z] <robh> cp4024 is running hardware testing, leave it alone T174891

RobH added a comment.Sep 27 2017, 11:50 PM

Sometimes entering the serial console registers a keystroke, so I have this running via a screen session on iron. That way reconnecting wont input a keystroke and cancel testing. I'll check on this regularly.

RobH added a comment.EditedSep 28 2017, 3:42 PM

This did fail hardware testing.

Service Tag : 3NC7KH2
Error Code : 2000-0251
Validation : 127076

This is a fairly useless error code: http://www.dell.com/support/article/us/en/19/sln115162/resolve-hardware-issues-with-built-in-and-online-diagnostics--epsa-or-psa-error-codes-?lang=en

Basically it states there is an error in the SEL. The error in the SEL was the fact there was a power loss and gain, showing a now cleared error in the log. I've gone ahead and cleared out the entire service event log, since it had no unhandled errors.

Event Log - the log contains failing records.
this relates to BIOS events in SERVERS ONLY.

    Update to the latest BIOS.
    Repeat the PSA diagnostics.
    When the diagnostics still results in an error code, visit our ePSA online tool. You can get more information on possible resolutions to your issue and even get a part dispatched if needed.

https://www.dell.com/support/home/us/en/19/Pre-boot-Analysis

Recommended action
Needs Attention.

A potential error has been found and we can fix the issue by following these steps.

    Please update BIOS
    Restart your computer and rerun the PSA diagnostics.

If you are unable to resolve the issue, please contact Dell Technical Support for additional assistance.

Nice.

Dell.com has a page on these here: http://www.dell.com/support/manuals/us/en/19/poweredge-vrtx/servers_tsg/psaepsa-diagnostics-error-codes?guid=guid-9afeed67-a47c-4afd-83d8-04301ebf3523&lang=en-us

Which says that error code is Event Log - the log contains failing records., and then recommends BIOS update -> retest, which seems strange. Maybe check the eventlog and see if it logged something new there during diags?

Oh sorry, my comment was redundant to your edit :)

Mentioned in SAL (#wikimedia-operations) [2017-09-28T16:26:03Z] <robh> cp4024 has had ilom and bios firmware updaed per epsa error checking, now re-running epsa utility T174891

Mentioned in SAL (#wikimedia-operations) [2017-09-28T19:14:04Z] <robh> cp4024 initial tests done (takes about 2 hours or so) then prompted for more in depth testing (hit yes, no failures so far) T174891

RobH assigned this task to BBlack.Sep 28 2017, 9:35 PM

Followup testing passed without further errors: All tests passed.

@BBlack: Perhaps it was a transient error solved by the bios firmware updates? It went from 2.4.2 to 2.5.4 in bios firmware revision. This was only one release/firmware update, so all the corrected possible errors for this patch are: http://www.dell.com/support/home/us/en/19/drivers/driversdetails?driverId=R7F8W&fileId=1962935893&osCode=W12R2&productCode=poweredge-r430&languageCode=en&categoryId=BI

Fixes & Enhancements
Fixes

  • Added workaround to address the uncorrectable errors on RDIMMs with specific vendor or the rev Register Clock Driver (RCD).
  • Updated ECRC (end-to-end CRC checking) to prevent the OS from changing the BIOS settings.
  • Fixed CPU machine check issue when the Write Data Cyclic Redundancy Check (CRC) is enabled.

    Enhancements
  • Updated the Intel Processor and Memory Reference Code (MRC) to PLR11.
  • Updated the Intel Xeon Processor E5-2600 v4 Product Family Processor Microcode to version 0x21.
  • Updated the Intel Xeon Processor E5-2600 v3 Product Family Processor Microcode to version 0x3A.
  • Secure MOR change to support the Windows 2016 R3 server version.
  • Added support for the Redfish management interfaces.
  • Added improvements to the Secure Boot Custom Policy settings.
  • Added improvements to the S130 storage.

What are your thoughts? It passed the epsa, which is Dell's built in hardware testing suite. I'd suggest we return it to service and see if it has another issue. Right now it is powered down.

Mentioned in SAL (#wikimedia-operations) [2017-09-29T05:20:58Z] <ema> powerup cp4024 T174891

Mentioned in SAL (#wikimedia-operations) [2017-09-29T05:53:09Z] <ema> cp4024 repooled T174891

BBlack moved this task from Caching to Hardware on the Traffic board.Mon, Oct 23, 2:44 PM
BBlack closed this task as Resolved.Mon, Oct 23, 2:49 PM
BBlack reopened this task as Open.Wed, Oct 25, 9:24 PM

cp4024 died randomly today. I've left it alone other than to connect to the console and verify no response there.
21:07 < icinga-wm> PROBLEM - Host cp4024 is DOWN: PING CRITICAL - Packet loss = 100%

I'm going to assume for now this is related, and that the firmware update wasn't a fix.

I left this down because @RobH was due on-site a short while later. He observed no SEL entry while on-site.

Powercycled back to the OS this morning with traffic depooled. No useful observations from syslog prior to the crash, except that a bunch of the logfiles ended up having a span of NUL bytes stuffed in them at crash-time (this is likely an artifact of the filesystem crash->recovery process). The machine then crashed without warning again after roughly 12-minutes of (depooled) uptime. Checked SEL and there's still no entries. Kicked off "express" diagnostics job from the serial console, but it hasn't completed yet:

racadm>>jobqueue view -i JID_090421243021 

racadm jobqueue view -i JID_090421243021  
---------------------------- JOB -------------------------
[Job ID=JID_090421243021]
Job Name=Remote Diagnostics
Status=Running
Start Time=[Now]
Expiration Time=[Not Applicable]
Message=[PR20: Job in progress.]
Percent Complete=[NA]
----------------------------------------------------------

For now I'm marking this downtimed in icinga for a week, and considering the hardware faulty and in need of fix/replace/whatever.

BBlack reassigned this task from BBlack to RobH.Thu, Oct 26, 2:07 PM

Powering off for now, less confusing for other software-level maintenance.

RobH added a comment.Thu, Oct 26, 7:21 PM

I'm going to disable its switch port and power it back into the diagnostics (also asked @BBlack via irc if there was a result from him running hte output for diagnostics)

RobH added a comment.Thu, Oct 26, 10:07 PM

Ran full test today, no errors.

/------------------------------------------------------------------------------\
| Build : 4239.44        ePSA Pre-boot System Assessment    ServiceTag 3NC7KH2 |
| Device: Hard Drive 2-0-1                                Percent Complete 100 |
| Test  : DST Long Test                                       [++++++++++]     |
| Status:                                                                      |
|------------------------------------------------------------------------------|
|Running: 10/26/2017 21:56:38, Elapsed time: 00:52:40                          |
|Test Results : Running                                                        |
|                                                                              |
|                                                                              |
|** Hard Drive 2-0-1 - DST Long/-----------------\                             |
|Started: 10/26/2017 21:03:58  |All tests passed.|                             |
|Running: 10/26/2017 21:57:09, |        Ok       | 11                          |
|Test Results : Running        \-----------------/                             |
|                                                                              |
|                                                                              |
|** Hard Drive 2-0-1 - DST Long Test **                                        |
|Started: 10/26/2017 21:03:58                                                  |
|Ended: 10/26/2017 21:57:41, Elapsed time: 00:53:43                            |
|Test Results : Pass                                                           |
|                                                                              |
|                                                                              |
|                                                                              |
\------------------------ PRESS ESC TO ABORT TESTING --------------------------/

I'm not sure what is failing at this point.

Since diagnostics and SEL don't turn up anything, I'll recap what we've observed:

  • Initially - system was reporting kernel errors such as kernel:[30568.739024] NMI watchdog: BUG: soft lockup - CPU#41 stuck for 22s! [varnishd:2065] and misbehaving at the software level due to these, unlike its identical peer nodes. This started from the first moment we looked for it after initial install.
  • Sep 1+ Sep 3 - 2x hard crashes with OS-level error reporting of general badness (Invalid OpCode, Unable to handle kernel paging request). We suspected possibility of numa isolation software config, but all peer nodes had this configured as well and weren't misbehaving in this way.
  • Sept 21-28 - bunch of hardware diagnostics run with no error. BIOS is upgraded due to un-specific SEL entry pointing at such things?
  • Sept 30 - Oct 24 - Everything seems ok (also, numa isolation is reverted at the start of this time, making that even less likely to be a cause)
  • Oct 25 - Node crashes out, unresponsive over ethernet/console. No SEL log.
  • Oct 26 - Powercycle and boot back to OS, no OS logs about the crash either. Re-crashes similarly 12 minutes later. Still no SEL entry. Re-run of diagnostics still shows no fault.

I think we have 3 basically-plausible theories here, with made-up statistics about how likely they are:

  1. 70% - Faulty hardware undetectable by diagnostics and not causing SEL entries on crash
  2. 20% - Earliest crashes due to numa_networking config and/or outdated BIOS, these corrupted important on-disk code (kernel/modules?), later crashes due to corruption
  3. 10% - Power quality issues in ulsfo, perhaps specific to the power ports this node is plugged into, triggering random hardware faults. We do have a history here of being unsure about ulsfo power quality.

The first and third options are difficult to deal with (getting Dell to replace stuff with no SEL or diag, or pushing back on ulsfo about power quality and actually testing/observed it directly), and the second option can be dealt with by a clean reinstall. So let's try that first, and if it crashes out again let's move on to working with Dell.

Script wmf-auto-reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

cp4024.ulsfo.wmnet

The log can be found in /var/log/wmf-auto-reimage/201710271452_bblack_1093_cp4024_ulsfo_wmnet.log.

So, the reinstall attempt failed with another crash during the installer. I think we have to be looking at bad hardware here:

  ┌───────────────────────────┤ Configuring apt ├───────────────────────────┐
  │                                                                         │
  │                                   25%                                   │
  │                                                                         │
  │ Retrieving file 6 of 9                                                  │
  │                                                                         │
  │     <[  601.837177] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[  601.845534] BUG: unable to handle kernel paging request at ffff885ff443c9c0
[  601.853325] IP: [<ffff885ff443c9c0>] 0xffff885ff443c9c0──────────────────┘
[  601.859169] PGD 1af5067 PUD 5ff184b063 PMD 5fedc60063 PTE 8000005ff443c163
[  601.866882] Oops: 0011 [#1] SMP 
[  601.870501] Modules linked in: dm_mod raid1 md_mod xfs jfs btrfs xor raid6_pq ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci libata bnx2x libcrc32c mdio vfat f]
[  601.899366] CPU: 47 PID: 0 Comm: swapper/47 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.39-1
[  601.908591] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.5.4 08/17/2017
[  601.916942] task: ffff884065c71370 ti: ffff884065c74000 task.ti: ffff884065c74000
[  601.925292] RIP: 0010:[<ffff885ff443c9c0>]  [<ffff885ff443c9c0>] 0xffff885ff443c9c0
[  601.933847] RSP: 0018:ffff88607ece3ea8  EFLAGS: 00010206
[  601.939773] RAX: dead000000000200 RBX: ffff885ff463c000 RCX: ffff885ff443c9c0
[  601.947737] RDX: ffff885ff443c9c0 RSI: ffff885ff443c9c0 RDI: ffff885ff443c9c0
[  601.955700] RBP: ffff885ff443c9a0 R08: ffff885ff443c9e0 R09: 0000000000000000
[  601.963664] R10: 0000000000000004 R11: 0000000000000005 R12: 0000000000000100
[  601.971627] R13: ffff885ff443c9c0 R14: ffff885ff443c9c0 R15: 0000000000000000
[  601.979592] FS:  0000000000000000(0000) GS:ffff88607ece0000(0000) knlGS:0000000000000000
[  601.988623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  601.995035] CR2: ffff885ff443c9c0 CR3: 0000000001813000 CR4: 00000000003407e0
[  602.002999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  602.010962] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  602.018925] Stack:
[  602.021168]  ffffffff81073ea1 ffff885ff463c000 ffff885ff443c9a0 0000000000000001
[  602.029458]  ffff885ff443c9c0 ffff885ff443c9c0 ffffffff81075559 ffff885ff463d030
[  602.037748]  ffff885ff463d430 ffff885ff463d830 ffff885ff443c9a0 ffff885ff463c9a0
[  602.046039] Call Trace:
[  602.048766]  <IRQ> 
[  602.050912]  [<ffffffff81073ea1>] ? call_timer_fn+0x31/0x140
[  602.057452]  [<ffffffff81075559>] ? run_timer_softirq+0x1e9/0x2f0
[  602.064254]  [<ffffffff8106d911>] ? __do_softirq+0xf1/0x2d0
[  602.070474]  [<ffffffff8106dd25>] ? irq_exit+0x95/0xa0
[  602.076212]  [<ffffffff8151dcb0>] ? smp_apic_timer_interrupt+0x40/0x50
[  602.083500]  [<ffffffff8151bd3d>] ? apic_timer_interrupt+0x6d/0x80
[  602.090396]  <EOI> 
[  602.092542]  [<ffffffff8101da12>] ? mwait_idle+0x62/0xa0
[  602.098692]  [<ffffffff810a9b74>] ? cpu_startup_entry+0x374/0x470
[  602.105496]  [<ffffffff810d038c>] ? tick_setup_periodic+0x5c/0x90
[  602.112299]  [<ffffffff81043fdf>] ? start_secondary+0x20f/0x2d0
[  602.118905] Code: 88 ff ff 90 c9 43 f4 5f 88 ff ff 00 00 00 00 00 00 00 00 00 02 00 00 00 00 ad de b0 c9 43 f4 5f 88 ff ff b0 c9 43 f4 5f 88 ff ff <c0> c9 43 f4 5f 88 ff ff c0 c9 43 f4 5f 
[  602.140560] RIP  [<ffff885ff443c9c0>] 0xffff885ff443c9c0
[  602.146498]  RSP <ffff88607ece3ea8>
[  602.150388] CR2: ffff885ff443c9c0
[  602.154666] ---[ end trace 817d4a4f3cd04379 ]---
[  602.154671] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  602.154680] IP: [<ffffffff81075437>] run_timer_softirq+0xc7/0x2f0
[  602.154681] PGD 0 
[  602.154684] Oops: 0002 [#2] SMP 
[  602.154699] Modules linked in: dm_mod raid1 md_mod xfs jfs btrfs xor raid6_pq ext4 crc16 mbcache jbd2 sg sd_mod crc_t10dif crct10dif_common ahci libahci libata bnx2x libcrc32c mdio vfat f]
[  602.154703] CPU: 33 PID: 0 Comm: swapper/33 Tainted: G      D       3.16.0-4-amd64 #1 Debian 3.16.39-1
[  602.154704] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.5.4 08/17/2017
[  602.154705] task: ffff884065c24ce0 ti: ffff884065c2c000 task.ti: ffff884065c2c000
[  602.154710] RIP: 0010:[<ffffffff81075437>]  [<ffffffff81075437>] run_timer_softirq+0xc7/0x2f0
[  602.154711] RSP: 0018:ffff88607ec03ee0  EFLAGS: 00010082
[  602.154712] RAX: ffff885ff443c9a0 RBX: ffff885ff443c000 RCX: 0000000000000000
[  602.154713] RDX: ffff885ff443c970 RSI: ffff88607ec03ef8 RDI: ffff885ff443c000
[  602.154715] RBP: 0000000000000097 R08: 0000000000000068 R09: 0000000000000100
[  602.154716] R10: 0000000000000004 R11: 0000000000000005 R12: 0000000000000001
[  602.154717] R13: ffffffff813e2850 R14: ffff885ff222cc00 R15: 0000000000000000
[  602.154718] FS:  0000000000000000(0000) GS:ffff88607ec00000(0000) knlGS:0000000000000000
[  602.154720] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  602.154721] CR2: 0000000000000008 CR3: 0000000001813000 CR4: 00000000003407e0
[  602.154723] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  602.154724] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  602.154725] Stack:
[  602.154727]  ffff885ff443d030 ffff885ff443d430 ffff885ff443d830 ffff88607ec03ef8
[  602.154729]  ffff88607ec03ef8 0000000000000001 ffffffff8180c088 0000000000000001
[  602.154731]  0000000000000100 0000000000000001 00000000ffffffed ffffffff8106d911
[  602.154732] Call Trace:
[  602.154734]  <IRQ> 
[  602.154737]  [<ffffffff8106d911>] ? __do_softirq+0xf1/0x2d0
[  602.154739]  [<ffffffff8106dd25>] ? irq_exit+0x95/0xa0
[  602.154743]  [<ffffffff8151dcb0>] ? smp_apic_timer_interrupt+0x40/0x50
[  602.154748]  [<ffffffff8151bd3d>] ? apic_timer_interrupt+0x6d/0x80
[  602.154750]  <EOI> 
[  602.154754]  [<ffffffff8101da12>] ? mwait_idle+0x62/0xa0
[  602.154758]  [<ffffffff810a9b74>] ? cpu_startup_entry+0x374/0x470
[  602.154762]  [<ffffffff810d038c>] ? tick_setup_periodic+0x5c/0x90
[  602.154765]  [<ffffffff81043fdf>] ? start_secondary+0x20f/0x2d0
[  602.154787] Code: 10 0f 1f 00 0f b6 e8 85 ed 0f 84 a1 01 00 00 48 63 d5 48 83 43 10 01 48 8d 74 24 18 48 c1 e2 04 48 01 da 48 8b 4a 30 48 8d 42 30 <48> 89 71 08 48 89 4c 24 18 48 8b 4a 38 
[  602.154791] RIP  [<ffffffff81075437>] run_timer_softirq+0xc7/0x2f0
[  602.154791]  RSP <ffff88607ec03ee0>
[  602.154792] CR2: 0000000000000008
[  602.154795] ---[ end trace 817d4a4f3cd0437a ]---
[  602.181616] Kernel panic - not syncing: Fatal exception in interrupt
[  603.257963] Shutting down cpus with NMI
[  603.262250] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  603.281205] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Completed auto-reimage of hosts:

['cp4024.ulsfo.wmnet']

Of which those FAILED:

['cp4024.ulsfo.wmnet']

Ping @RobH - this hardware needs replacing. I guess diagnostics aren't perfect, and neither is the SEL, but clearly the node crashes out even during a fresh install.

RobH added a comment.EditedTue, Oct 31, 2:33 PM

Yep, I've email our Dell reps explaining our issues and asking for next steps. (Not sure if they'll want to try swapping the mainboard out or just the entire system.)

Change 387805 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Remove borked cp4024 from ipsec nodelists

https://gerrit.wikimedia.org/r/387805

Change 387805 merged by BBlack:
[operations/puppet@production] Remove borked cp4024 from ipsec nodelists

https://gerrit.wikimedia.org/r/387805

RobH added a comment.Tue, Nov 7, 12:00 AM

They advised we needed to open a support case, so I did so, 956261134. They're following up and will let us know.

RobH added a comment.Tue, Nov 7, 9:17 PM

I've been in discussion with Renny@Dell support. We lowered the CPU count from all to just 2 per CPU. The error still happened during the OS install and boot just now:

[    0.346024] kernel BUG at /build/linux-HB2ADw/linux-4.9.51/arch/x86/kernel/cpu/common.c:997!
[    0.355441] invalid opcode: 0000 [#1] SMP                                   
[    0.359910] Modules linked in:                                              
[    0.363325] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-0.bpo.4-amd64 #1 Debian 4.9.51-1~bpo8+1
[    0.373418] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.5.4 08/17/2017
[    0.381766] task: ffffa0f0e6b04040 task.stack: ffffb86371350000             
[    0.388371] RIP: 0010:[<ffffffffb243802d>]  [<ffffffffb243802d>] identify_secondary_cpu+0x4d/0x70
[    0.398283] RSP: 0000:ffffb86371353f20  EFLAGS: 00010086                    
[    0.404209] RAX: 00000000ffffffe4 RBX: ffffa0f17fc0a040 RCX: ffffffffb304c448
[    0.412170] RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046
[    0.420131] RBP: 0000000000000001 R08: 000000000000018c R09: 0000000000000000
[    0.428091] R10: ffffb86371353dec R11: 000000000000018c R12: 0000000000000000
[    0.436052] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    0.444014] FS:  0000000000000000(0000) GS:ffffa0f17fc00000(0000) knlGS:0000000000000000
[    0.453040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033               
[    0.459449] CR2: 0000000000000000 CR3: 0000002e22e07000 CR4: 00000000003406e0
[    0.467410] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.475370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.483331] Stack:                          
[    0.485571]  0000000000000000 0000000000000000 ffffffffb24483f8 0000000000000001
[    0.493858]  73f21b7d73f21937 0000000000000000 0000000000000000 0000000000000000
[    0.502145]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.510432] Call Trace:
[    0.513162]  [<ffffffffb24483f8>] ? start_secondary+0x58/0x190
[    0.519668] Code: 0a 22 bd 4d 89 ef ff 50 68 0f b7 93 d2 00 00 00 39 d0 75 19 0f b7 bb da 00 00 00 89 ee e8 4c fd 00 00 85 c0 75 05 5b 5d c3 0f 0b <0f> 0b 0f b7 8b d4 00 00 00 89 c2 89 ee 48 c7 c7 
[    0.541358] RIP  [<ffffffffb243802d>] identify_secondary_cpu+0x4d/0x70
[    0.548648]  RSP <ffffb86371353f20>
[    0.552541] ---[ end trace 913b18dbea4f2417 ]---
[    0.557691] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.565168] ---[ end Kernel panic - not syncing: Attempted to kill the idle task!

I've updated dell support (they have been replying back quickly so I dont expect a long wait for followup.)

Since this already has the latest firmware, I'm going to assume we're likely to have either CPU replacements (seems unlikely that 2 processors are bad) or a mainboard replacement (this seems far more likely), or both. Will update as they become available.

RobH added a comment.Tue, Nov 7, 11:30 PM

Dell seems to agree, they are dispatching a replacement mainboard. I'll swap and we'll see what happens.

RobH added a comment.EditedThu, Nov 9, 6:54 PM

Swapped mainboard yesterday, but during the installer today got the following:

[ 457.538179] BUG: soft lockup - CPU#19 stuck for 23s! [apt-get:38504] │
[ 493.538655] BUG: soft lockup - CPU#19 stuck for 22s! [apt-get:38504]─────┘
[ 521.539025] BUG: soft lockup - CPU#19 stuck for 22s! [apt-get:38504]

It then locks up and becomes unresponsive. I've updated Dell support on the issue, awaiting next stesp.

BBlack added a comment.Thu, Nov 9, 6:59 PM

So, looking at all the crash messages we've managed to record since the beginning of this ticket, the CPU# indicated has had a history of: 41, 23, 47, 47, 1, 19 . The way Linux numbers the CPU cores, since they're all odd they're all on the second CPU die (the one with fewer DIMM slots next to it).

RobH added a comment.Thu, Nov 9, 7:29 PM

I've updated Dell, and they want me to move it to socket 1 and repeat. I'm asking them to just send me a replacement CPU, we'll see what happens.

RobH added a comment.EditedThu, Nov 9, 8:28 PM

They agreed and are dispatching a replacement part. I'll likely go ahead and do the proposed swap with the existing, but this will eliminate my having to make two trips.

So the plan is as follows:

  • Dell is dispaching a replacement CPU2
  • I'll go ahead and swap exsting CPU 1 & 2 around, and start the reinstall again, trying to reproduce the error.
    • Error will likely follow the CPU to socket 1, and I'll just swap out the malfunctioning cpu at that time.
    • If it doesn't follow the CPU, we'll have to re-evaluate at that time.
    • Error is unlikely to be mainboard, since we've swapped it out entirely.
  • Test OS isntall with new CPU to ensure there are no issues, and then continued testing via service.
    • While I"ll go ahead and run a full ePSA test on the hardware, that testing is not picking up our current CPU issues. We're only seeing the errors via OS or service usage.

Supposedly this was delivered to ulsfo today, but I didn't get any email from UL support. Dropped them an email and will update. If it is onsite, I'll plan to go to ulsfo tomorrow (Tuesday) to work on this.

RobH added a comment.Tue, Nov 14, 7:41 PM

Ok, did the following:

  • pulled cpu1 entirely because I didnt want to waste thermal compund swapping it to cpu 2.
  • put suspected cpu 2 into cpu 1 socket
  • installed os, got cpu error during install, so its a bad cpu
  • put cpu 1 back into its socket
  • put new cpu 2 from dell warranty replacement into cpu socket 2
  • installed the os, no errors so far

so its likely a bad cpu 2 at this point. os is installed. will ship back the cpu 2 to dell today

Also dell emailed me about the mainboard not arriving back, which is pretty fast since I just replaced it last week. I replied back letting them know it was put in the mail yesterday (since Friday was a USPS holiday and that is who they provided that return tag with). this cpu is a fedex return, will drop it off later today

RobH reassigned this task from RobH to BBlack.Tue, Nov 14, 7:44 PM

@BBlack: Assignign this back to you, please reimage or place this system back into service as you see fit. The CPU error hasn't shown back up during the OS install since I swapped out the CPU2.

For now I'm puppetizing it back into the cluster (and ipsec lists), but not repooling yet...

Mentioned in SAL (#wikimedia-operations) [2017-11-15T14:18:56Z] <ema> repool cp4024 T174891

BBlack closed this task as Resolved.Wed, Nov 15, 10:56 PM

Closing for now, assuming no new problems surface. Thanks @RobH :)