cp3056 had a hardware failure when it was first being brought up, and we still haven't imaged it or put it into service (not able to). The info we got from IRC was an amber light on the front panel and some kind of message about a storage failure (SSDs or NVMe?). Needs fixing at some point, but we can live without it temporarily!
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | wiki_willy | T235805 ESAMS Refresh/Rebuild (October 2019) | |||
Resolved | BBlack | T236497 cp3056 hardware issue |
Event Timeline
@BBlack indeed I was getting an error on the PCIe card and I did removed/ insert it and was not getting the error anymore. Please try to re-image the server and let me know.
Thanks.
I've tried imaging, and things mostly work, but I have a hard time keeping it online long enough to get through an initial puppet agent run (or two or three), as the kernel keeps panic-ing somewhere related to the NIC, e.g.
[ 348.395114] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 [ 348.402980] IP: [<ffffffffc00ade06>] bnxt_poll_work+0x3c6/0x520 [bnxt_en] [ 348.409783] PGD 0 [ 348.411627] [ 348.413132] Oops: 0000 [#1] SMP [ 348.416270] Modules linked in: intel_rapl skx_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp sparse_keymap kvm_intel dell_smbiosn [ 348.490768] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-11-amd64 #1 Debian 4.9.189-3+deb9u1 [ 348.499437] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 2.2.11 06/14/2019 [ 348.506986] task: ffffffffa8a11500 task.stack: ffffffffa8a00000 [ 348.512890] RIP: 0010:[<ffffffffc00ade06>] [<ffffffffc00ade06>] bnxt_poll_work+0x3c6/0x520 [bnxt_en] [ 348.522108] RSP: 0018:ffff90d53f203e38 EFLAGS: 00010286 [ 348.527408] RAX: 00000000000000fd RBX: ffff90d5236e88c0 RCX: ffffd728bd7ddfdf [ 348.534523] RDX: ffffffffa8a246a0 RSI: ffff90d51f7b5600 RDI: ffff910522657000 [ 348.541639] RBP: 0000000000000000 R08: 0000000000000001 R09: ffff9e5d349ee7a0 [ 348.548755] R10: ffffffffa815fde0 R11: ffffffffa815ff90 R12: ffff90d514539300 [ 348.555871] R13: ffff9105226570a0 R14: 0000000000000001 R15: 00000000000000fc [ 348.562991] FS: 0000000000000000(0000) GS:ffff90d53f200000(0000) knlGS:0000000000000000 [ 348.571056] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 348.576789] CR2: 0000000000000080 CR3: 0000002278e08000 CR4: 0000000000760670 [ 348.583904] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 348.591020] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 348.598135] PKRU: 55555554 [ 348.600842] Stack: [ 348.602856] 0100000000000000 ffff910522657000 ffff9e5d349f7490 0000008d00000036 [ 348.610335] ffff90d521ec03c0 ffff90d514539300 0000000100000187 ffff90d521ec90c0 [ 348.617804] 0001284d00852052 9dd92715d83bde09 0000000000000000 ffff90d521ec90c0 [ 348.625258] Call Trace: [ 348.627707] <IRQ> [ 348.629635] [<ffffffffc00adfdf>] ? bnxt_poll+0x7f/0xd0 [bnxt_en] [ 348.635725] [<ffffffffa83122c6>] ? net_rx_action+0x246/0x380 [ 348.641455] [<ffffffffa84200ad>] ? __do_softirq+0x10d/0x2b0 [ 348.647103] [<ffffffffa7e80e22>] ? irq_exit+0xc2/0xd0 [ 348.652226] [<ffffffffa841f137>] ? do_IRQ+0x57/0xe0 [ 348.657182] [<ffffffffa841ccde>] ? common_interrupt+0x9e/0x9e [ 348.663001] <EOI> [ 348.664929] [<ffffffffa82dcba2>] ? cpuidle_enter_state+0xa2/0x2d0 [ 348.671105] [<ffffffffa7ebe294>] ? cpu_startup_entry+0x154/0x240 [ 348.677182] [<ffffffffa8b3ef5e>] ? start_kernel+0x447/0x467 [ 348.682827] [<ffffffffa8b3e120>] ? early_idt_handler_array+0x120/0x120 [ 348.689423] [<ffffffffa8b3e408>] ? x86_64_start_kernel+0x14c/0x170 [ 348.695673] Code: 2f fd ff ff 4d 85 ed 0f 84 2c 01 00 00 48 8b 7c 24 08 48 8b 97 d0 02 00 00 48 85 d2 0f 84 17 01 00 00 4c 8b 5a 28 4d 85 db 74 [ 348.716043] RIP [<ffffffffc00ade06>] bnxt_poll_work+0x3c6/0x520 [bnxt_en] [ 348.722920] RSP <ffff90d53f203e38> [ 348.726404] CR2: 0000000000000080 [ 348.729757] ---[ end trace d2ed557127a41c07 ]--- [ 348.738445] Kernel panic - not syncing: Fatal exception in interrupt [ 348.744911] Kernel Offset: 0x26e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 348.759632] ---[ end Kernel panic - not syncing: Fatal exception in interrupt [ 348.766749] ------------[ cut here ]------------ [ 348.771360] unchecked MSR access error: WRMSR to 0x83f (tried to write 0x00000000000000f6) at rIP: 0xffffffffa7e5b9b4 (native_write_msr+0x4/0x2) [ 348.784352] ffffffffa7e529a7 ffffffffa7e2b87b ffffffffa7f60858 ffffffffa7ed29b1 [ 348.791813] 0000000000000046 ffffffffa8cb8568 0000000001000086 0000000000000004 [ 348.799291] 0000000000000024 0000000000000000 0000000000000000 ffffffffa8cba9c2 [ 348.806769] Call Trace: [ 348.809207] <IRQ> [ 348.811132] [<ffffffffa7e529a7>] ? native_apic_msr_write+0x27/0x30 [ 348.817389] [<ffffffffa7e2b87b>] ? arch_irq_work_raise+0x2b/0x40 [ 348.823469] [<ffffffffa7f60858>] ? irq_work_queue+0x98/0xa0 [ 348.829116] [<ffffffffa7ed29b1>] ? console_unlock+0x361/0x610 [ 348.834932] [<ffffffffa7ed2f76>] ? vprintk_emit+0x316/0x4d0 [ 348.840580] [<ffffffffa7ea5f8a>] ? try_to_wake_up+0x2fa/0x3c0 [ 348.846398] [<ffffffffa7f81e25>] ? printk+0x5a/0x76 [ 348.851351] [<ffffffffa7ea5f8a>] ? try_to_wake_up+0x2fa/0x3c0 [ 348.857170] [<ffffffffa7e7a7a2>] ? __warn+0x32/0xf0 [ 348.862124] [<ffffffffa7ea5f8a>] ? try_to_wake_up+0x2fa/0x3c0 [ 348.867943] [<ffffffffa7ebdb13>] ? autoremove_wake_function+0x13/0x40 [ 348.874454] [<ffffffffa7ebd53f>] ? __wake_up_common+0x4f/0x90 [ 348.880273] [<ffffffffa7ebd5b4>] ? __wake_up+0x34/0x50 [ 348.885488] [<ffffffffa7f60759>] ? irq_work_run_list+0x49/0x70 [ 348.891394] [<ffffffffa841f378>] ? smp_irq_work_interrupt+0x38/0x40 [ 348.897729] [<ffffffffa841f01e>] ? irq_work_interrupt+0x9e/0xb0 [ 348.903723] [<ffffffffa7f81c04>] ? panic+0x1fc/0x242 [ 348.908765] [<ffffffffa7e298f2>] ? oops_end+0xc2/0xd0 [ 348.913890] [<ffffffffa7e61a51>] ? no_context+0x1b1/0x400 [ 348.919363] [<ffffffffa7e62214>] ? __do_page_fault+0xd4/0x4f0 [ 348.925184] [<ffffffffa7ed98a0>] ? handle_edge_irq+0x90/0x170 [ 348.931003] [<ffffffffa7e80d9e>] ? irq_exit+0x3e/0xd0 [ 348.936129] [<ffffffffa841d7e8>] ? page_fault+0x28/0x30 [ 348.941430] [<ffffffffa815ff90>] ? unmap_single+0x20/0x20 [ 348.946901] [<ffffffffa815fde0>] ? trace_event_raw_event_swiotlb_bounced+0x160/0x160 [ 348.954712] [<ffffffffc00ade06>] ? bnxt_poll_work+0x3c6/0x520 [bnxt_en] [ 348.961394] [<ffffffffc00adcc2>] ? bnxt_poll_work+0x282/0x520 [bnxt_en] [ 348.968077] [<ffffffffc00adfdf>] ? bnxt_poll+0x7f/0xd0 [bnxt_en] [ 348.974157] [<ffffffffa83122c6>] ? net_rx_action+0x246/0x380 [ 348.979888] [<ffffffffa84200ad>] ? __do_softirq+0x10d/0x2b0 [ 348.985535] [<ffffffffa7e80e22>] ? irq_exit+0xc2/0xd0 [ 348.990660] [<ffffffffa841f137>] ? do_IRQ+0x57/0xe0 [ 348.995615] [<ffffffffa841ccde>] ? common_interrupt+0x9e/0x9e [ 349.001434] <EOI> [ 349.003362] [<ffffffffa82dcba2>] ? cpuidle_enter_state+0xa2/0x2d0 [ 349.009537] [<ffffffffa7ebe294>] ? cpu_startup_entry+0x154/0x240 [ 349.015616] [<ffffffffa8b3ef5e>] ? start_kernel+0x447/0x467 [ 349.021260] [<ffffffffa8b3e120>] ? early_idt_handler_array+0x120/0x120 [ 349.027858] [<ffffffffa8b3e408>] ? x86_64_start_kernel+0x14c/0x170 [ 349.034108] WARNING: CPU: 0 PID: 0 at /build/linux-sdMcHj/linux-4.9.189/arch/x86/kernel/smp.c:128 try_to_wake_up+0x2fa/0x3c0 [ 349.045286] Modules linked in: intel_rapl skx_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp sparse_keymap kvm_intel dell_smbiosn [ 349.119700] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D 4.9.0-11-amd64 #1 Debian 4.9.189-3+deb9u1 [ 349.129581] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 2.2.11 06/14/2019 [ 349.137129] 0000000000000000 ffffffffa81353d4 0000000000000000 0000000000000000 [ 349.144608] ffffffffa7e7a83b ffff910517334580 0000000000000001 ffff910517334ca4 [ 349.152086] 0000000000000004 0000000000000046 0000000000018980 ffffffffa7ea5f8a [ 349.159556] Call Trace: [ 349.162004] <IRQ> [ 349.163935] [<ffffffffa81353d4>] ? dump_stack+0x5c/0x78 [ 349.169242] [<ffffffffa7e7a83b>] ? __warn+0xcb/0xf0 [ 349.174196] [<ffffffffa7ea5f8a>] ? try_to_wake_up+0x2fa/0x3c0 [ 349.180015] [<ffffffffa7ebdb13>] ? autoremove_wake_function+0x13/0x40 [ 349.186524] [<ffffffffa7ebd53f>] ? __wake_up_common+0x4f/0x90 [ 349.192346] [<ffffffffa7ebd5b4>] ? __wake_up+0x34/0x50 [ 349.197560] [<ffffffffa7f60759>] ? irq_work_run_list+0x49/0x70 [ 349.203463] [<ffffffffa841f378>] ? smp_irq_work_interrupt+0x38/0x40 [ 349.209802] [<ffffffffa841f01e>] ? irq_work_interrupt+0x9e/0xb0 [ 349.215793] [<ffffffffa7f81c04>] ? panic+0x1fc/0x242 [ 349.220834] [<ffffffffa7e298f2>] ? oops_end+0xc2/0xd0 [ 349.225960] [<ffffffffa7e61a51>] ? no_context+0x1b1/0x400 [ 349.231435] [<ffffffffa7e62214>] ? __do_page_fault+0xd4/0x4f0 [ 349.237254] [<ffffffffa7ed98a0>] ? handle_edge_irq+0x90/0x170 [ 349.243073] [<ffffffffa7e80d9e>] ? irq_exit+0x3e/0xd0 [ 349.248200] [<ffffffffa841d7e8>] ? page_fault+0x28/0x30 [ 349.253500] [<ffffffffa815ff90>] ? unmap_single+0x20/0x20 [ 349.258974] [<ffffffffa815fde0>] ? trace_event_raw_event_swiotlb_bounced+0x160/0x160 [ 349.266782] [<ffffffffc00ade06>] ? bnxt_poll_work+0x3c6/0x520 [bnxt_en] [ 349.273466] [<ffffffffc00adcc2>] ? bnxt_poll_work+0x282/0x520 [bnxt_en] [ 349.280148] [<ffffffffc00adfdf>] ? bnxt_poll+0x7f/0xd0 [bnxt_en] [ 349.286228] [<ffffffffa83122c6>] ? net_rx_action+0x246/0x380 [ 349.291958] [<ffffffffa84200ad>] ? __do_softirq+0x10d/0x2b0 [ 349.297604] [<ffffffffa7e80e22>] ? irq_exit+0xc2/0xd0 [ 349.302731] [<ffffffffa841f137>] ? do_IRQ+0x57/0xe0 [ 349.307686] [<ffffffffa841ccde>] ? common_interrupt+0x9e/0x9e [ 349.313505] <EOI> [ 349.315433] [<ffffffffa82dcba2>] ? cpuidle_enter_state+0xa2/0x2d0 [ 349.321606] [<ffffffffa7ebe294>] ? cpu_startup_entry+0x154/0x240 [ 349.327686] [<ffffffffa8b3ef5e>] ? start_kernel+0x447/0x467 [ 349.333332] [<ffffffffa8b3e120>] ? early_idt_handler_array+0x120/0x120 [ 349.339928] [<ffffffffa8b3e408>] ? x86_64_start_kernel+0x14c/0x170 [ 349.346179] ---[ end trace d2ed557127a41c08 ]---
I think we've seen this before, and it could be related to kernel versions and/or NIC firmware level (but these nodes seem to have even newer firmware than cp1075...). Giving up for now. OS is installed, but hasn't gotten through its initial puppet runs because this keeps hanging on me every time after just a few minutes...
Tried again this morning, but the kernel panics happen too fast to make much progress once the agent starts actually using the NIC (I've only ever had one agent run complete successfully before a crash, out of many attempts). The crashes (and preceding dmesg outputs) are consistently issues with the card and/or driver for the 10G NIC. I'd say this sounds like our familiar firmware-level issue, but I was able to look at ethtool earlier and it seems like the same firmware version which is stable on the rest of the new esams cache nodes. Given the history, perhaps it really is some kind of actual system board error (which was first affecting the PCIe NVMe drive, and is now affecting the PCIe NIC? I'm at a loss on causes, but if it consistently can't make it through a few puppet runs without crashing, something's wrong....
SEL doesn't have any entries since back on the Oct 25 (so nothing new specific to this week's crashes). It's also I guess possible this is related to some bad setting (e.g. something to do with the state of the internal disabled NICs, or some setting in iDRAC for sharing access to the NICs, or really anything related in BIOS setup? I don't know). I've tried various racadm methods of resetting things (powercycle, powerdown -> wait 1 minute -> powerup, racreset, hardreset, etc) but they don't seem to change the situation. Even if I don't try to log in or run the agent, it generally dies within about 10 minutes.
Since it looks like cp3056 might be down for some time could we remove it from the config until fixed? It would be good to let the ipsec checks in icinga return to green.
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Aggregate+IPsec+Tunnel+Status+codfw
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Aggregate+IPsec+Tunnel+Status+eqiad
Change 549474 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] remove cp3056 from service configs due to host hardware problems
Change 549480 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Remove cp3056 from cache::nodes temporarily
Sorry I missed that you already had a patch! But in any case, we only need commenting from cache::nodes to fix up this case (there's no good reason to e.g. churn it out of conftool or the various iptables rules defined from the other stuff).
Change 549480 merged by BBlack:
[operations/puppet@production] Remove cp3056 from cache::nodes temporarily
Change 549474 abandoned by Herron:
remove cp3056 from service configs due to host hardware problems
Reason:
abandoned in favor of Ia6aded6cab7281895ff6eef6d15c8a87dfbc02a1
Change 550811 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: reimage cp3058 as text_ats
Change 550811 merged by Ema:
[operations/puppet@production] cache: reimage cp3058 as text_ats
Mentioned in SAL (#wikimedia-operations) [2019-11-14T17:46:20Z] <robh> running dell epsa tool on cp3056 per T236497
Please note part of the ePSA tool is checking the SEL. So the SEL has to be cleared before running the test. @BBlack let me know this server had issues with the storage ssd needing reseating during install, which reflects in the error log. Please note there is no entry to explain the crash log pasted in previous comments from 2019-10-29.
Also these logs aren't really lost, just 'cleared' from the active log. They are still stored in the system, and can be manually parsed out of the ePSA system reporting export we have to provide to Dell support for hardware repair.
Record: 1 Date/Time: 09/05/2019 08:37:13 Source: system Severity: Ok Description: Log cleared. ------------------------------------------------------------------------------- Record: 2 Date/Time: 09/05/2019 08:44:57 Source: system Severity: Ok Description: C: boot completed. ------------------------------------------------------------------------------- Record: 3 Date/Time: 09/05/2019 08:44:57 Source: system Severity: Ok Description: OEM software event. ------------------------------------------------------------------------------- Record: 4 Date/Time: 10/22/2019 11:32:08 Source: system Severity: Critical Description: The chassis is open while the power is off. ------------------------------------------------------------------------------- Record: 5 Date/Time: 10/22/2019 11:32:13 Source: system Severity: Ok Description: The chassis is closed while the power is off. ------------------------------------------------------------------------------- Record: 6 Date/Time: 10/24/2019 09:03:37 Source: system Severity: Critical Description: A bus fatal error was detected on a component at slot 2. ------------------------------------------------------------------------------- Record: 7 Date/Time: 10/24/2019 09:03:37 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 8 Date/Time: 10/24/2019 09:03:37 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 9 Date/Time: 10/24/2019 09:03:37 Source: system Severity: Critical Description: A fatal error was detected on a component at bus 58 device 0 function 0. ------------------------------------------------------------------------------- Record: 10 Date/Time: 10/24/2019 09:03:37 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 11 Date/Time: 10/24/2019 12:31:44 Source: system Severity: Critical Description: The power input for power supply 1 is lost. ------------------------------------------------------------------------------- Record: 12 Date/Time: 10/24/2019 12:35:44 Source: system Severity: Critical Description: The chassis is open while the power is off. ------------------------------------------------------------------------------- Record: 13 Date/Time: 10/24/2019 12:35:50 Source: system Severity: Ok Description: The chassis is closed while the power is off. ------------------------------------------------------------------------------- Record: 14 Date/Time: 10/25/2019 13:24:18 Source: system Severity: Critical Description: The power input for power supply 2 is lost. ------------------------------------------------------------------------------- Record: 15 Date/Time: 10/25/2019 13:24:20 Source: system Severity: Critical Description: Power supply redundancy is lost. ------------------------------------------------------------------------------- Record: 16 Date/Time: 10/25/2019 13:24:53 Source: system Severity: Ok Description: The input power for power supply 2 has been restored. ------------------------------------------------------------------------------- Record: 17 Date/Time: 10/25/2019 13:24:55 Source: system Severity: Ok Description: The power supplies are redundant. -------------------------------------------------------------------------------
All tests passed. Validation Code : 84413
So all testing has passed. I've gone ahead and powered down the host. Not sure on next steps, will need to sync up with @BBlack. Our earlier IRC conversation made it seem like they may have other tests to run for the NIC directly?
Please advise, and if we need to swap out hardware, assign back to me for followup!
I don't think there's anything else we can do either. We can't keep it alive booted into an OS for very long before we get a Linux kernel crash in the network driver.
We could try re-seating the NIC (since a re-seat of the NVMe did help earlier on, maybe this box was bumped and all the cards are ill-seated?), and we could try replacing the NIC. If neither of those helps, it might be a deeper fault in the board. Either way it's all physical work from here.
Tag, you're it again! :)
I'll be sending the following to Iron Mountain remote hands request via the portal:
Iron Mountain,
We are experiencing transient issues on one of our servers, and would like to have remote hands work on it on our behalf. The server in question is labeled cp3056, Dell service tag BSM4CZ2, in rack OE15, U17.
We would like to have remote hands open this chassis, and reseat all of the memory, the PCIe riser, and the PCIe network card.
Checklist for work:
- - work can proceed at the time of Iron Mountain's choosing, as long as it is within normal business hours in the EU or US timezones.
- - unplug all cables and note their numbers, they must plug back in exactly as they were.
- - slide server on rails (won't have to unrack) and open chassis
- - reseat the memory dimms, PCIe riser, and PCIe network card. (The PCIe SSD storage card was already reseated, but feel free to check it.)
- - reassemble system back into rack and plug back in all cables
- - let us know the above has been completed
Thanks in advance!
Ok, synced up with @BBlack via irc and he doesn't have a preference for time.
My above directions have been submitted for remote hands via the portal, case RITM0115394.
SCTASK0128980 is new case number, confirmed and opened. (I suppose one is the request, and now we have a confirmed remote hands case?)
Please note this has had all the RAM/riser/cards reseated and continues to pass all Dell ePSA tests.
@BBlack: With the reseating of everything, shall we reimage and try using this system or did you want to try anything else first?
Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:
['cp3056.esams.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201911221717_bblack_138506.log.
Completed auto-reimage of hosts:
['cp3056.esams.wmnet']
Of which those FAILED:
['cp3056.esams.wmnet']
Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:
['cp3056.esams.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201911221717_bblack_138644.log.
Attempting reimage (see above). If it fails like before, it won't get very far (certainly not into production use).
Change 552548 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp3056: re-enable cache::nodes entry
Change 552548 merged by BBlack:
[operations/puppet@production] cp3056: re-enable cache::nodes entry
So far so good - it has completed all the initial puppetization stuff, which is much further than it got before.
Given it's Friday and this node has a fishy history, I'm leaving this ticket open and the host depooled from live service for the weekend. If it doesn't crash before Monday we'll try pooling it into service and see how it goes.
Seems good so far, has been up a few days and in full service for about a day, without incident. Calling this resolved until anything changes!