Page MenuHomePhabricator

cp1075-90 - bnxt_en transmit hangs
Closed, ResolvedPublic

Description

Icinga started to to alert about cp1080 being unreachable with the usual flurry of IPsec issues on other hosts connecting to it:

..
15:48 < icinga-wm> PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1080_v4, cp1080_v6
15:48 < icinga-wm> PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1080_v4, cp1080_v6
15:48 < icinga-wm> PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1080_v4, cp1080_v6
..

depooled it:

16:06 < mutante> !log dzahn@neodymium conftool action : set/pooled=no; selector: name=cp1080.eqiad.wmnet| reason: Strongswan CRITICALs fom Icinga (T201174)

could not ssh to it, powercycled it:

16:23 < mutante> !log cp1080 - powercycled - lots of RECOVERY from Icinga for IPsec connections - leaving depooled so far (T201174)

All the Icinga checks recovered but i left it depooled.

In syslog it all starts a few minutes before the alerts with kernel failures related to the bnxt_en Broadcom module:

19546 Aug 30 19:41:12 cp1080 kernel: [1809273.920929] ------------[ cut here ]------------
2 9547 Aug 30 19:41:12 cp1080 kernel: [1809273.920966] WARNING: CPU: 29 PID: 0 at /build/linux-iyqref/linux-4.9.110/net/sched/sch_generic.c:316 dev_watchdog+0x233/0x240
3 9548 Aug 30 19:41:12 cp1080 kernel: [1809273.920971] NETDEV WATCHDOG: enp59s0f0 (bnxt_en): transmit queue 0 timed out
4 9549 Aug 30 19:41:12 cp1080 kernel: [1809273.920974] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport drbg ansi_cprng seqiv xfrm4_mode_transport cpufreq_userspace cpufreq_conser vative cpufreq_powersave xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo intel_rapl skx_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp mg ag200 ttm kvm drm_kms_helper sg irqbypass drm crct10dif_pclmul crc32_pclmul i2c_algo_bit dcdbas mei_me evdev ghash_clmulni_intel iTCO_wdt lpc_ich pcspkr mfd_core mei shpchp iTCO_ve ndor_support ipmi_si button tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache ses enclosure sd_mod raid10 raid456 async_raid6_recov async_mem cpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ahci libahci crc32c_intel
5 9550 Aug 30 19:41:12 cp1080 kernel: [1809273.921082] mpt3sas aesni_intel raid_class aes_x86_64 glue_helper lrw gf128mul xhci_pci ablk_helper scsi_transport_sas libata cryptd xhci_hcd n vme bnxt_en i2c_i801 nvme_core i2c_smbus usbcore scsi_mod usb_common
6 9551 Aug 30 19:41:12 cp1080 kernel: [1809273.921115] CPU: 29 PID: 0 Comm: swapper/29 Not tainted 4.9.0-7-amd64 #1 Debian 4.9.110-3+deb9u1
7 9552 Aug 30 19:41:12 cp1080 kernel: [1809273.921117] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 1.3.7 02/09/2018
8 9553 Aug 30 19:41:12 cp1080 kernel: [1809273.921120] 0000000000000000 ffffffff9d530694 ffff993abd783e20 0000000000000000
9 9554 Aug 30 19:41:12 cp1080 kernel: [1809273.921126] ffffffff9d278d6e 0000000000000000 ffff993abd783e78 ffff990a9f830000
10 9555 Aug 30 19:41:12 cp1080 kernel: [1809273.921131] 000000000000001d ffff990aa2535bc0 000000000000004a ffffffff9d278def
11 9556 Aug 30 19:41:12 cp1080 kernel: [1809273.921137] Call Trace:
12 9557 Aug 30 19:41:12 cp1080 kernel: [1809273.921140] <IRQ>
13 9558 Aug 30 19:41:12 cp1080 kernel: [1809273.921151] [<ffffffff9d530694>] ? dump_stack+0x5c/0x78
14 9559 Aug 30 19:41:12 cp1080 kernel: [1809273.921161] [<ffffffff9d278d6e>] ? __warn+0xbe/0xe0
15 9560 Aug 30 19:41:12 cp1080 kernel: [1809273.921166] [<ffffffff9d278def>] ? warn_slowpath_fmt+0x5f/0x80
16 9561 Aug 30 19:41:12 cp1080 kernel: [1809273.921175] [<ffffffff9d2b336d>] ? task_tick_fair+0x52d/0x930
17 9562 Aug 30 19:41:12 cp1080 kernel: [1809273.921181] [<ffffffff9d7377b3>] ? dev_watchdog+0x233/0x240
18 9563 Aug 30 19:41:12 cp1080 kernel: [1809273.921186] [<ffffffff9d737580>] ? dev_deactivate_queue.constprop.26+0x60/0x60
19 9564 Aug 30 19:41:12 cp1080 kernel: [1809273.921214] [<ffffffff9d2e6b22>] ? call_timer_fn+0x32/0x120
20 9565 Aug 30 19:41:12 cp1080 kernel: [1809273.921217] [<ffffffff9d2e6e97>] ? run_timer_softirq+0x1d7/0x420
21 9566 Aug 30 19:41:12 cp1080 kernel: [1809273.921224] [<ffffffff9d2f7f70>] ? tick_sched_do_timer+0x30/0x30
22 9567 Aug 30 19:41:12 cp1080 kernel: [1809273.921228] [<ffffffff9d2f7980>] ? tick_sched_handle.isra.12+0x20/0x50
23 9568 Aug 30 19:41:12 cp1080 kernel: [1809273.921232] [<ffffffff9d2f7fa8>] ? tick_sched_timer+0x38/0x70
24 9569 Aug 30 19:41:12 cp1080 kernel: [1809273.921242] [<ffffffff9d8186ba>] ? __do_softirq+0x10a/0x29e
25 9570 Aug 30 19:41:12 cp1080 kernel: [1809273.921246] [<ffffffff9d27eefe>] ? irq_exit+0xae/0xb0
26 9571 Aug 30 19:41:12 cp1080 kernel: [1809273.921251] [<ffffffff9d818184>] ? smp_apic_timer_interrupt+0x44/0x50
27 9572 Aug 30 19:41:12 cp1080 kernel: [1809273.921255] [<ffffffff9d816a76>] ? apic_timer_interrupt+0x96/0xa0
28 9573 Aug 30 19:41:12 cp1080 kernel: [1809273.921257] <EOI>
29 9574 Aug 30 19:41:12 cp1080 kernel: [1809273.921264] [<ffffffff9d6d7182>] ? cpuidle_enter_state+0xa2/0x2d0
30 9575 Aug 30 19:41:12 cp1080 kernel: [1809273.921267] [<ffffffff9d6d7170>] ? cpuidle_enter_state+0x90/0x2d0
31 9576 Aug 30 19:41:12 cp1080 kernel: [1809273.921274] [<ffffffff9d2bbd24>] ? cpu_startup_entry+0x154/0x240
32 9577 Aug 30 19:41:12 cp1080 kernel: [1809273.921281] [<ffffffff9d248bb0>] ? start_secondary+0x170/0x1b0
33 9578 Aug 30 19:41:12 cp1080 kernel: [1809273.921285] ---[ end trace ec1ba160f47f9507 ]---
34 9579 Aug 30 19:41:12 cp1080 kernel: [1809273.921294] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeout detected, starting reset task!
35 9580 Aug 30 19:41:13 cp1080 kernel: [1809274.862415] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
36 9581 Aug 30 19:41:13 cp1080 kernel: [1809274.869297] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
37 9582 Aug 30 19:41:14 cp1080 kernel: [1809275.779642] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
38 9583 Aug 30 19:41:14 cp1080 kernel: [1809275.786524] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
39 9584 Aug 30 19:41:15 cp1080 kernel: [1809276.693792] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
40 9585 Aug 30 19:41:15 cp1080 kernel: [1809276.700669] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
41 9586 Aug 30 19:41:15 cp1080 kernel: [1809277.610223] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
42 9587 Aug 30 19:41:15 cp1080 kernel: [1809277.617118] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
43 9588 Aug 30 19:41:16 cp1080 kernel: [1809278.528568] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
44 9589 Aug 30 19:41:16 cp1080 kernel: [1809278.535437] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
45 9590 Aug 30 19:41:17 cp1080 kernel: [1809279.448344] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
469591 Aug 30 19:41:17 cp1080 kernel: [1809279.455243] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
47 9592 Aug 30 19:41:18 cp1080 kernel: [1809280.368844] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
48 9593 Aug 30 19:41:18 cp1080 kernel: [1809280.375729] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
49 9594 Aug 30 19:41:19 cp1080 lldpd[1170]: unable to send packet on real device for enp59s0f0: No buffer space available
50 9595 Aug 30 19:41:19 cp1080 lldpd[1167]: 2018-08-30T19:41:19 [WARN/lldp] unable to send packet on real device for enp59s0f0: No buffer space available
51 9596 Aug 30 19:41:19 cp1080 kernel: [1809281.279662] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
52 9597 Aug 30 19:41:19 cp1080 kernel: [1809281.286545] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
53 9598 Aug 30 19:41:20 cp1080 kernel: [1809282.193967] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
54 9599 Aug 30 19:41:20 cp1080 kernel: [1809282.200885] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
55 9600 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: Traceback (most recent call last):
56 9601 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: File "/usr/local/bin/varnishreqstats", line 121, in <module>
57 9602 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: ReqStatsSender(sys.argv[1:]).main()
58 9603 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 149, in main
59 9604 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: self.handle_record(line)
60 9605 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 133, in handle_record
61 9606 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: self.resolve_statsd_ip(now)
62 9607 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 99, in resolve_statsd_ip
63 9608 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: statsd_ip = socket.gethostbyname(self.statsd_hostname)
64 9609 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: socket.gaierror: [Errno -3] Temporary failure in name resolution
65 9610 Aug 30 19:41:21 cp1080 systemd[1]: varnishreqstats-frontend.service: Main process exited, code=exited, status=1/FAILURE
66 9611 Aug 30 19:41:21 cp1080 systemd[1]: varnishreqstats-frontend.service: Unit entered failed state.
67 9612 Aug 30 19:41:21 cp1080 systemd[1]: varnishreqstats-frontend.service: Failed with result 'exit-code'.
68 9613 Aug 30 19:41:21 cp1080 kernel: [1809283.109547] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
69 9614 Aug 30 19:41:21 cp1080 kernel: [1809283.116410] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
70 9615 Aug 30 19:41:22 cp1080 kernel: [1809284.026761] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
71 9616 Aug 30 19:41:22 cp1080 kernel: [1809284.033622] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
72 9617 Aug 30 19:41:23 cp1080 kernel: [1809284.944888] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
73 9618 Aug 30 19:41:23 cp1080 kernel: [1809284.951749] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1
749619 Aug 30 19:41:24 cp1080 kernel: [1809285.808528] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
75 9620 Aug 30 19:41:24 cp1080 kernel: [1809285.815389] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
76 9621 Aug 30 19:41:25 cp1080 kernel: [1809286.707603] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
77 9622 Aug 30 19:41:25 cp1080 kernel: [1809286.714475] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
78 9623 Aug 30 19:41:25 cp1080 kernel: [1809287.620964] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
79 9624 Aug 30 19:41:25 cp1080 kernel: [1809287.627838] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
80 9625 Aug 30 19:41:26 cp1080 systemd[1]: varnishreqstats-frontend.service: Service hold-off time over, scheduling restart.
81 9626 Aug 30 19:41:26 cp1080 systemd[1]: Stopped Varnish Request Stats StatsD reporter.
82 9627 Aug 30 19:41:26 cp1080 systemd[1]: Started Varnish Request Stats StatsD reporter.
83 9628 Aug 30 19:41:26 cp1080 kernel: [1809288.534169] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
84 9629 Aug 30 19:41:26 cp1080 kernel: [1809288.541049] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
85 9630 Aug 30 19:41:27 cp1080 kernel: [1809289.422863] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
86 9631 Aug 30 19:41:27 cp1080 kernel: [1809289.429727] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
87 9632 Aug 30 19:41:28 cp1080 kernel: [1809290.294522] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
88 9633 Aug 30 19:41:28 cp1080 kernel: [1809290.301379] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
89 9634 Aug 30 19:41:29 cp1080 kernel: [1809291.193288] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
90 9635 Aug 30 19:41:29 cp1080 kernel: [1809291.200174] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
91 9636 Aug 30 19:41:30 cp1080 kernel: [1809292.107024] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
92 9637 Aug 30 19:41:30 cp1080 kernel: [1809292.113911] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
93 9638 Aug 30 19:41:31 cp1080 kernel: [1809293.023559] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
94 9639 Aug 30 19:41:31 cp1080 kernel: [1809293.030449] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
95 9640 Aug 30 19:41:32 cp1080 kernel: [1809293.938329] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
96 9641 Aug 30 19:41:32 cp1080 kernel: [1809293.945185] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
97 9642 Aug 30 19:41:33 cp1080 kernel: [1809294.852956] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
98 9643 Aug 30 19:41:33 cp1080 kernel: [1809294.859812] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
99 9644 Aug 30 19:41:34 cp1080 kernel: [1809295.769910] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
100 9645 Aug 30 19:41:34 cp1080 kernel: [1809295.776791] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
101 9646 Aug 30 19:41:35 cp1080 kernel: [1809296.685620] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
102 9647 Aug 30 19:41:35 cp1080 kernel: [1809296.692482] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
103 9648 Aug 30 19:41:35 cp1080 kernel: [1809297.542229] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
104 9649 Aug 30 19:41:35 cp1080 kernel: [1809297.549100] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
105 9650 Aug 30 19:41:36 cp1080 kernel: [1809298.448563] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
106 9651 Aug 30 19:41:36 cp1080 kernel: [1809298.455434] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
107 9652 Aug 30 19:41:37 cp1080 kernel: [1809299.357676] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
108 9653 Aug 30 19:41:37 cp1080 kernel: [1809299.364552] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
109 9654 Aug 30 19:41:38 cp1080 kernel: [1809300.274744] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
110 9655 Aug 30 19:41:38 cp1080 kernel: [1809300.281624] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
111 9656 Aug 30 19:41:39 cp1080 kernel: [1809301.190199] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
112 9657 Aug 30 19:41:39 cp1080 kernel: [1809301.197062] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
113 9658 Aug 30 19:41:40 cp1080 kernel: [1809302.072518] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
114 9659 Aug 30 19:41:40 cp1080 kernel: [1809302.079395] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
115 9660 Aug 30 19:41:41 cp1080 kernel: [1809302.973032] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
116 9661 Aug 30 19:41:41 cp1080 kernel: [1809302.979909] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
1179662 Aug 30 19:41:42 cp1080 kernel: [1809303.855811] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
118 9663 Aug 30 19:41:42 cp1080 kernel: [1809303.862693] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
119 9664 Aug 30 19:41:43 cp1080 kernel: [1809304.770523] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
120 9665 Aug 30 19:41:43 cp1080 kernel: [1809304.777430] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
121 9666 Aug 30 19:41:44 cp1080 kernel: [1809305.686517] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
122 9667 Aug 30 19:41:44 cp1080 kernel: [1809305.693378] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
123 9668 Aug 30 19:41:45 cp1080 kernel: [1809306.601203] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51
124 9669 Aug 30 19:41:45 cp1080 kernel: [1809306.608068] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1
125 9670 Aug 30 19:41:45 cp1080 kernel: [1809307.517259] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a2} len:0
126 9671 Aug 30 19:41:45 cp1080 kernel: [1809307.525452] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
127 9672 Aug 30 19:41:46 cp1080 kernel: [1809308.435681] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a3} len:0
128 9673 Aug 30 19:41:46 cp1080 kernel: [1809308.443848] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
129 9674 Aug 30 19:41:47 cp1080 kernel: [1809309.303532] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a4} len:0
130 9675 Aug 30 19:41:47 cp1080 kernel: [1809309.311706] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
131 9676 Aug 30 19:41:48 cp1080 kernel: [1809310.210782] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a5} len:0
132 9677 Aug 30 19:41:48 cp1080 kernel: [1809310.218938] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
133 9678 Aug 30 19:41:49 cp1080 kernel: [1809311.064632] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a6} len:0
134 9679 Aug 30 19:41:49 cp1080 kernel: [1809311.072806] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
135 9680 Aug 30 19:41:49 cp1080 lldpd[1170]: unable to send packet on real device for enp59s0f0: No buffer space available
136 9681 Aug 30 19:41:49 cp1080 lldpd[1167]: 2018-08-30T19:41:49 [WARN/lldp] unable to send packet on real device for enp59s0f0: No buffer space available
137 9682 Aug 30 19:41:50 cp1080 kernel: [1809311.949349] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a7} len:0
138 9683 Aug 30 19:41:50 cp1080 kernel: [1809311.957509] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
139 9684 Aug 30 19:41:51 cp1080 kernel: [1809312.803086] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a8} len:0
140 9685 Aug 30 19:41:51 cp1080 kernel: [1809312.811237] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
141 9686 Aug 30 19:41:52 cp1080 kernel: [1809313.655826] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a9} len:0
142 9687 Aug 30 19:41:52 cp1080 kernel: [1809313.663991] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
143 9688 Aug 30 19:41:52 cp1080 kernel: [1809314.535686] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8aa} len:0
144 9689 Aug 30 19:41:52 cp1080 kernel: [1809314.543878] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
145 9690 Aug 30 19:41:53 cp1080 kernel: [1809315.456720] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8ab} len:0
146 9691 Aug 30 19:41:53 cp1080 kernel: [1809315.464885] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
147 9692 Aug 30 19:41:54 cp1080 kernel: [1809316.377057] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8ac} len:0
148 9693 Aug 30 19:41:54 cp1080 kernel: [1809316.385235] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
149 9694 Aug 30 19:41:55 cp1080 kernel: [1809317.297106] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8ad} len:0
150 9695 Aug 30 19:41:55 cp1080 kernel: [1809317.305263] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1
151 9696 Aug 30 19:41:56 cp1080 kernel: [1809318.213093] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x61 0x8ae} len:0
152 9697 Aug 30 19:41:57 cp1080 kernel: [1809319.184394] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x50 0x8af} len:0
153 9698 Aug 30 19:41:57 cp1080 kernel: [1809319.192577] bnxt_en 0000:3b:00.0 enp59s0f0: Invalid ring
154 9699 Aug 30 19:41:57 cp1080 kernel: [1809319.198069] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm ring alloc failure rc: ffffffff
155 9700 Aug 30 19:41:58 cp1080 kernel: [1809320.089628] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0xb1 0x8b0} len:0
156 9701 Aug 30 19:41:58 cp1080 kernel: [1809320.097795] bnxt_en 0000:3b:00.0 enp59s0f0: bnxt_init_nic err: ffffffff
157 9702 Aug 30 19:41:58 cp1080 kernel: [1809320.122391] bnxt_en 0000:3b:00.0 enp59s0f0: nic open fail (rc: ffffffff```

and then the network goes down and there are all kinds of issues like:

1)
2 9703 Aug 30 19:41:59 cp1080 varnishmedia[2545]: Traceback (most recent call last):
3 9704 Aug 30 19:41:59 cp1080 varnishmedia[2545]: File "/usr/local/bin/varnishmedia", line 61, in <module>
4 9705 Aug 30 19:41:59 cp1080 varnishmedia[2545]: MediaCacheStatsSender(sys.argv[1:]).main()
5 9706 Aug 30 19:41:59 cp1080 varnishmedia[2545]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 149, in main
69707 Aug 30 19:41:59 cp1080 varnishmedia[2545]: self.handle_record(line)
7 9708 Aug 30 19:41:59 cp1080 varnishmedia[2545]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 133, in handle_record
8 9709 Aug 30 19:41:59 cp1080 varnishmedia[2545]: self.resolve_statsd_ip(now)
9 9710 Aug 30 19:41:59 cp1080 varnishmedia[2545]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 99, in resolve_statsd_ip
10 9711 Aug 30 19:41:59 cp1080 varnishmedia[2545]: statsd_ip = socket.gethostbyname(self.statsd_hostname)
11 9712 Aug 30 19:41:59 cp1080 varnishmedia[2545]: socket.gaierror: [Errno -3] Temporary failure in name resolution
12 9713 Aug 30 19:41:59 cp1080 systemd[1]: varnishmedia.service: Main process exited, code=exited, status=1/FAILURE
13 9714 Aug 30 19:41:59 cp1080 systemd[1]: varnishmedia.service: Unit entered failed state.
14 9715 Aug 30 19:41:59 cp1080 systemd[1]: varnishmedia.service: Failed with result 'exit-code'.
15 9716 Aug 30 19:41:59 cp1080 kernel: [1809321.075598] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x20 0x8b1} len:0
16 9717 Aug 30 19:41:59 cp1080 lldpd[1170]: removal request for address of 2620:0:861:102:10:64:16:23%2, but no knowledge of it
17 9718 Aug 30 19:41:59 cp1080 lldpd[1167]: 2018-08-30T19:41:59 [INFO/netlink] removal request for address of 2620:0:861:102:10:64:16:23%2, but no knowledge of it
18 9719 Aug 30 19:41:59 cp1080 lldpd[1170]: removal request for address of fe80::20a:f7ff:feef:eba0%2, but no knowledge of it
19 9720 Aug 30 19:41:59 cp1080 lldpd[1167]: 2018-08-30T19:41:59 [INFO/netlink] removal request for address of fe80::20a:f7ff:feef:eba0%2, but no knowledge of it
20 9721 Aug 30 19:41:59 cp1080 lldpd[1170]: error while receiving frame on enp59s0f0 (retry: 0): Network is down
21 9722 Aug 30 19:41:59 cp1080 lldpd[1167]: 2018-08-30T19:41:59 [WARN/interfaces] error while receiving frame on enp59s0f0 (retry: 0): Network is down
22 9723 Aug 30 19:41:59 cp1080 varnishospital[2547]: --- Logging error ---
23 9724 Aug 30 19:41:59 cp1080 varnishospital[2547]: Traceback (most recent call last):
24 9725 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/lib/python3.5/logging/handlers.py", line 622, in emit
25 9726 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.send(s)
26 9727 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/lib/python3.5/logging/handlers.py", line 680, in send
27 9728 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.sock.sendto(s, self.address)
28 9729 Aug 30 19:41:59 cp1080 varnishospital[2547]: socket.gaierror: [Errno -3] Temporary failure in name resolution
29 9730 Aug 30 19:41:59 cp1080 varnishospital[2547]: Call stack:
30 9731 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/local/bin/varnishospital", line 87, in <module>
31 9732 Aug 30 19:41:59 cp1080 varnishospital[2547]: VarnishHospital(sys.argv[1:]).main()
32 9733 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/local/lib/python3.5/dist-packages/wikimedia_varnishlogconsumer.py", line 189, in main
33 9734 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.handle_line(line)
34 9735 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/local/bin/varnishospital", line 83, in handle_line
35 9736 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.logger.info("{} {}".format(log['origin_server'], log['transition']), extra=log)
36 9737 Aug 30 19:41:59 cp1080 varnishospital[2547]: Message: 'be_cp1078_eqiad_wmnet Went sick'
37 9738 Aug 30 19:41:59 cp1080 varnishospital[2547]: Arguments: ()
38 9739 Aug 30 19:41:59 cp1080 varnishospital[2547]: --- Logging error ---
39 9740 Aug 30 19:41:59 cp1080 varnishospital[2547]: Traceback (most recent call last):
40 9741 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/lib/python3.5/logging/handlers.py", line 622, in emit
41 9742 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.send(s)
42 9743 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/lib/python3.5/logging/handlers.py", line 680, in send
43 9744 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.sock.sendto(s, self.address)
44 9745 Aug 30 19:41:59 cp1080 varnishospital[2547]: socket.gaierror: [Errno -3] Temporary failure in name resolution
45 9746 Aug 30 19:41:59 cp1080 varnishospital[2547]: Call stack:
46 9747 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/local/bin/varnishospital", line 87, in <module>
47 9748 Aug 30 19:41:59 cp1080 varnishospital[2547]: VarnishHospital(sys.argv[1:]).main()
48 9749 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/local/lib/python3.5/dist-packages/wikimedia_varnishlogconsumer.py", line 189, in main
49
50
51 9938 Aug 30 19:42:04 cp1080 systemd[1]: varnishstatsd-default.service: Unit entered failed state.
52 9939 Aug 30 19:42:04 cp1080 systemd[1]: varnishstatsd-default.service: Failed with result 'exit-code'.
53 9940 Aug 30 19:42:04 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1002.eqiad.wmnet:9093/1002: Failed to resolve 'kafka-jumbo1002.eqiad.wmnet:9093': Temporar y failure in name resolution
54 9941 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1006.eqiad.wmnet:9093/1006: 4 request(s) timed out: disconnect
55 9942 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1005.eqiad.wmnet:9093/1005: 4 request(s) timed out: disconnect
56 9943 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1003.eqiad.wmnet:9093/1003: 6 request(s) timed out: disconnect
57 9944 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1004.eqiad.wmnet:9093/1004: 4 request(s) timed out: disconnect
58 9945 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1001.eqiad.wmnet:9093/1001: 4 request(s) timed out: disconnect
59 9946 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-187): 6/6 brokers are down
60 9947 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1006.eqiad.wmnet:9093/1006: Failed to resolve 'kafka-jumbo1006.eqiad.wmnet:9093': Temporar y failure in name resolution
61 9948 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1005.eqiad.wmnet:9093/1005: Failed to resolve 'kafka-jumbo1005.eqiad.wmnet:9093': Temporar y failure in name resolution
62 9949 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1003.eqiad.wmnet:9093/1003: Failed to resolve 'kafka-jumbo1003.eqiad.wmnet:9093': Temporar y failure in name resolution
63 9950 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1004.eqiad.wmnet:9093/1004: Failed to resolve 'kafka-jumbo1004.eqiad.wmnet:9093': Temporar y failure in name resolution
64 9951 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1001.eqiad.wmnet:9093/1001: Failed to resolve 'kafka-jumbo1001.eqiad.wmnet:9093': Temporar y failure in name resolution
65
66
67
6810139 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: --- Logging error ---
6910140 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: Traceback (most recent call last):
7010141 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: File "/usr/lib/python3.5/logging/handlers.py", line 622, in emit
7110142 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: self.send(s)
7210143 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: File "/usr/lib/python3.5/logging/handlers.py", line 680, in send
7310144 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: self.sock.sendto(s, self.address)
7410145 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: socket.gaierror: [Errno -3] Temporary failure in name resolution
75
76
77
7812417 Aug 30 20:09:39 cp1080 confd[1054]: 2018-08-30T20:09:39Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0]
7912418 Aug 30 20:09:39 cp1080 confd[1054]: 2018-08-30T20:09:39Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0]
8012419 Aug 30 20:09:41 cp1080 confd[1054]: 2018-08-30T20:09:41Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0]
8112420 Aug 30 20:09:41 cp1080 confd[1054]: 2018-08-30T20:09:41Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0]
8212421 Aug 30 20:09:44 cp1080 confd[1054]: 2018-08-30T20:09:44Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0]
8312422 Aug 30 20:09:44 cp1080 confd[1054]: 2018-08-30T20:09:44Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0]
8412423 Aug 30 20:09:46 cp1080 confd[1054]: 2018-08-30T20:09:46Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0]
8512424 Aug 30 20:09:46 cp1080 confd[1054]: 2018-08-30T20:09:46Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0]
86

Event Timeline

ArielGlenn triaged this task as Medium priority.Sep 3 2018, 8:26 AM

@BBlack @Dzahn I didn't get any direction with this but I assumed swapping the DAC is the appropriate attempted fix. Please let me know if this resolves the issue

Mentioned in SAL (#wikimedia-operations) [2018-09-05T18:30:52Z] <mutante> cp1080 - pooling again after T203194 appears fixed

@Cmjohnson Thank you. I don't really know if that was the cause but looking at monitoring now everything is green. So i repooled the server just now.

Dzahn claimed this task.
Dzahn added a subscriber: ema.

I guess we can close it and simply reopen it if it should happen again. Let me know if you think otherwise @ema @BBlack

BBlack renamed this task from cp1080 - kernel / bnxt_en failures to cp1076-90 - bnxt_en transmit hangs.Oct 31 2018, 5:45 PM
BBlack reopened this task as Open.

cp1085 hit this today, virtually identical in all respects with the sequence of events and kernel/log outputs, etc. That's 2/16 nodes, a little under 2 months apart. Not an epidemic, but maybe worth looking into. Most likely this is a kernel driver bug or firmware bug rather than bad hardware. A simple reboot brough cp1085 back into service afterwards.

BBlack renamed this task from cp1076-90 - bnxt_en transmit hangs to cp1075-90 - bnxt_en transmit hangs.Oct 31 2018, 6:33 PM
Dzahn removed Dzahn as the assignee of this task.Oct 31 2018, 8:49 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-06T23:55:49Z] <mutante> cp1084 - network went down, powercycled, probably T203194

@BBlack same symptoms on cp1084 just now...

bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0xe4f3}
bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1

BBlack added a subscriber: Andrew.

Yet another! cp1078 crash ticket above merged into here.

Change 482658 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] site: Keep lvs2010 as a spare system while T203194 is not solved

https://gerrit.wikimedia.org/r/482658

Change 482658 merged by Vgutierrez:
[operations/puppet@production] site: Keep lvs2010 as a spare system while T203194 is not solved

https://gerrit.wikimedia.org/r/482658

I suspect our bug is fixed by:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=73f21c653f930f438d53eed29b5e4c65c8a0f906

which included in the official upstream Linux releases from 4.19-rc7 onwards. The bnxt_en driver has a lot of commit traffic though, so I'm not sure it's reasonable to backport this fix very far (it might inadvertently depend on many other commits). Maybe should look into whether other vendor kernels nearer to what we're running have backported it?

I suspect our bug is fixed by:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=73f21c653f930f438d53eed29b5e4c65c8a0f906

which included in the official upstream Linux releases from 4.19-rc7 onwards. The bnxt_en driver has a lot of commit traffic though, so I'm not sure it's reasonable to backport this fix very far (it might inadvertently depend on many other commits). Maybe should look into whether other vendor kernels nearer to what we're running have backported it?

Or we can ask the Broadcom maintainers whether they can provide a backport to 4.9.x, it's one of the officially support LTS kernel branches after all. And the Debian stretch kernel uses the bnxt_en driver without local patches.

Even better, then we can simply get the 4.9.144-1 kernel from stretch-proposed-updates and test whether that is the correct fix

Yeah. It's hard to "prove" whether we have this bug fixed other than running a supposed fix on the bnxt_en cp10 fleet for a while as a statistical test, but probably the sooner we start on that the better.

The 4.9.144-1 kernel is fully production-ready, the point releases for Debian are used to rebase the Stretch kernel to the latest set of 4.9.x bug fixes (although depending on the final date for Stretch 9.7 there might be one further update still).

Change 484181 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] aptrepo: add component/kernel-proposed-updates to stretch-wikimedia

https://gerrit.wikimedia.org/r/484181

Change 484181 merged by Vgutierrez:
[operations/puppet@production] aptrepo: add component/kernel-proposed-updates to stretch-wikimedia

https://gerrit.wikimedia.org/r/484181

Change 484199 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] cache: Add kernel-proposed-updates component for cp1075-99

https://gerrit.wikimedia.org/r/484199

Change 484199 merged by Vgutierrez:
[operations/puppet@production] cache: Add kernel-proposed-updates component for cp1075-99

https://gerrit.wikimedia.org/r/484199

Mentioned in SAL (#wikimedia-operations) [2019-01-14T15:02:18Z] <vgutierrez> upgrading kernel in cp1075 to 4.1.144-1 - T203194

Mentioned in SAL (#wikimedia-operations) [2019-01-14T15:33:48Z] <vgutierrez> rolling restart of cp1076-cp1090 to upgrade to kernel 4.9.144 - T203194

kernel upgraded successfully in cp1075-cp1090:

vgutierrez@cumin1001:~$ sudo cumin cp[1075-1090].eqiad.wmnet 'uname -v'
16 hosts will be targeted:
cp[1075-1090].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(16) cp[1075-1090].eqiad.wmnet
----- OUTPUT of 'uname -v' -----
#1 SMP Debian 4.9.144-1 (2018-12-30)

Mentioned in SAL (#wikimedia-operations) [2019-01-15T01:12:21Z] <mutante> cp1078 - bnxt_en - TX timeout detected - Host cp1078 is DOWN - powercycled via mgmt (T203194)

20:09 <+icinga-wm> PROBLEM - Host cp1078 is DOWN: PING CRITICAL - Packet loss = 100%
...
cp1078 login: [33059.724815] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeout detected, starting reset task!
...
..powercycling..
...
20:14 <+icinga-wm> RECOVERY - Host cp1078 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms

Thanks for handling cp1078 @Dzahn. It looks like 4.9.144 is also affected

so far we've seen crashes in the following servers (updated on 16/01/19):

  • cp1078 (twice)
  • cp1080
  • cp1084
  • cp1085
  • cp1088

on the Dell community forum there is a post with several people reporting the same issue. The last one suggesting that Dell could be replacing hardware to fix the issue. Maybe it's time to open a support case with Dell?

The reports in that thread are for RHEL 7, which uses 3.10 as the base layer kernel (but with backports for all kinds of drivers, so it's hard to tell how that maps to out 4.9 kernel. One thing we could try is to test the 4.19.12-1~bpo9+1 kernel from stretch-backports. If it still fails in that version, we can easily report it to the upstream maintainers given that 4.19 is the latest LTS branch. Or we point Dell to the thread and ask them them swap the NICs to a known working 10G card.

cp1088 has been affected as well after the kernel upgrade

on the Dell community forum there is a post with several people reporting the same issue. The last one suggesting that Dell could be replacing hardware to fix the issue.

I think you're referring to the last comment there at present where a user reports:

Hello, In my case, I am having this issue in 8 servers ( same specs). We have a case opened with dell. They are proposing replace the mainboard. The card is broadcom.

I think this is probably just the standard-issue Dell answer for unknown hardware problems: replace the mainboard and see if it works. Given how widespread the problem is, this is unlikely to help unless they've identified that a specific batch of mainboards have a specific issue causing this.

Did we ever look at NIC firmware levels on these? (I don't remember)... it could be that it needs a combination of bugfixed firmware and the bugfixed Linux driver.

See also this email thread where Michael Chan (broadcom driver dev) asks for firmware level output, sees the same numbers we have on cp1088, and tells them to upgrade: https://www.spinics.net/lists/netdev/msg519478.html

Mentioned in SAL (#wikimedia-operations) [2019-01-16T16:45:35Z] <vgutierrez> upgrading NIC firmware on cp1075 - T203194

Mentioned in SAL (#wikimedia-operations) [2019-01-16T17:05:32Z] <vgutierrez> upgrading NIC firmware in cp1076 - T203194

Mentioned in SAL (#wikimedia-operations) [2019-01-16T17:22:17Z] <vgutierrez> rolling NIC firmware upgrade cp[1077-1080] - T203194

Mentioned in SAL (#wikimedia-operations) [2019-01-17T09:02:12Z] <vgutierrez> rolling NIC firmware upgrade cp[1081-1090] - T203194

firmware upgrade completed for all the affected systems.

Change 485044 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] pybal: check for discrepancies in the configured services

https://gerrit.wikimedia.org/r/485044

This is looking way better than the previous attempt. All the systems have been running 19 days without issues since the firmware upgrade:

vgutierrez@cumin1001:~$ sudo cumin cp[1075-1090].eqiad.wmnet "uptime |cut -d' ' -f 4-5"
16 hosts will be targeted:
cp[1075-1090].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(16) cp[1075-1090].eqiad.wmnet
----- OUTPUT of 'uptime |cut -d' ' -f 4-5' -----
19 days

Change 494671 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "cache: Add kernel-proposed-updates component for cp1075-99"

https://gerrit.wikimedia.org/r/494671

Change 494671 abandoned by Vgutierrez:
Revert "cache: Add kernel-proposed-updates component for cp1075-99"

Reason:
this needs an intermediate commit for proper cleaning

https://gerrit.wikimedia.org/r/494671

Change 494673 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] cache: get rid of the wikimedia-kernel-updates apt component (1/2)

https://gerrit.wikimedia.org/r/494673

Change 494674 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] cache: get rid of the wikimedia-kernel-updates apt component (2/2)

https://gerrit.wikimedia.org/r/494674

Change 494673 merged by Vgutierrez:
[operations/puppet@production] cache: get rid of the wikimedia-kernel-updates apt component (1/2)

https://gerrit.wikimedia.org/r/494673

Change 494674 merged by Vgutierrez:
[operations/puppet@production] cache: get rid of the wikimedia-kernel-updates apt component (2/2)

https://gerrit.wikimedia.org/r/494674

Change 494690 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] aptrepo: Get rid of the no longer needed component/kernel-proposed-updates

https://gerrit.wikimedia.org/r/494690

Change 494690 merged by Vgutierrez:
[operations/puppet@production] aptrepo: Get rid of the no longer needed component/kernel-proposed-updates

https://gerrit.wikimedia.org/r/494690

@Vgutierrez The firmware update on the NICs fixed this for good, right? Can we close this task?

BBlack assigned this task to Vgutierrez.

@Vgutierrez The firmware update on the NICs fixed this for good, right? Can we close this task?

Yes I think we're done here, AFAIK we haven't had any recurrences lately!

Mentioned in SAL (#wikimedia-operations) [2020-03-02T07:22:57Z] <vgutierrez> upgrading NICs FW on lvs2008 - T196560 T203194