Icinga started to to alert about cp1080 being unreachable with the usual flurry of IPsec issues on other hosts connecting to it:
.. 15:48 < icinga-wm> PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1080_v4, cp1080_v6 15:48 < icinga-wm> PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1080_v4, cp1080_v6 15:48 < icinga-wm> PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 connecting: cp1080_v4, cp1080_v6 ..
depooled it:
16:06 < mutante> !log dzahn@neodymium conftool action : set/pooled=no; selector: name=cp1080.eqiad.wmnet| reason: Strongswan CRITICALs fom Icinga (T201174)
could not ssh to it, powercycled it:
16:23 < mutante> !log cp1080 - powercycled - lots of RECOVERY from Icinga for IPsec connections - leaving depooled so far (T201174)
All the Icinga checks recovered but i left it depooled.
In syslog it all starts a few minutes before the alerts with kernel failures related to the bnxt_en Broadcom module:
1 | 9546 Aug 30 19:41:12 cp1080 kernel: [1809273.920929] ------------[ cut here ]------------ |
---|---|
2 | 9547 Aug 30 19:41:12 cp1080 kernel: [1809273.920966] WARNING: CPU: 29 PID: 0 at /build/linux-iyqref/linux-4.9.110/net/sched/sch_generic.c:316 dev_watchdog+0x233/0x240 |
3 | 9548 Aug 30 19:41:12 cp1080 kernel: [1809273.920971] NETDEV WATCHDOG: enp59s0f0 (bnxt_en): transmit queue 0 timed out |
4 | 9549 Aug 30 19:41:12 cp1080 kernel: [1809273.920974] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport drbg ansi_cprng seqiv xfrm4_mode_transport cpufreq_userspace cpufreq_conser vative cpufreq_powersave xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo intel_rapl skx_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp mg ag200 ttm kvm drm_kms_helper sg irqbypass drm crct10dif_pclmul crc32_pclmul i2c_algo_bit dcdbas mei_me evdev ghash_clmulni_intel iTCO_wdt lpc_ich pcspkr mfd_core mei shpchp iTCO_ve ndor_support ipmi_si button tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache ses enclosure sd_mod raid10 raid456 async_raid6_recov async_mem cpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ahci libahci crc32c_intel |
5 | 9550 Aug 30 19:41:12 cp1080 kernel: [1809273.921082] mpt3sas aesni_intel raid_class aes_x86_64 glue_helper lrw gf128mul xhci_pci ablk_helper scsi_transport_sas libata cryptd xhci_hcd n vme bnxt_en i2c_i801 nvme_core i2c_smbus usbcore scsi_mod usb_common |
6 | 9551 Aug 30 19:41:12 cp1080 kernel: [1809273.921115] CPU: 29 PID: 0 Comm: swapper/29 Not tainted 4.9.0-7-amd64 #1 Debian 4.9.110-3+deb9u1 |
7 | 9552 Aug 30 19:41:12 cp1080 kernel: [1809273.921117] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 1.3.7 02/09/2018 |
8 | 9553 Aug 30 19:41:12 cp1080 kernel: [1809273.921120] 0000000000000000 ffffffff9d530694 ffff993abd783e20 0000000000000000 |
9 | 9554 Aug 30 19:41:12 cp1080 kernel: [1809273.921126] ffffffff9d278d6e 0000000000000000 ffff993abd783e78 ffff990a9f830000 |
10 | 9555 Aug 30 19:41:12 cp1080 kernel: [1809273.921131] 000000000000001d ffff990aa2535bc0 000000000000004a ffffffff9d278def |
11 | 9556 Aug 30 19:41:12 cp1080 kernel: [1809273.921137] Call Trace: |
12 | 9557 Aug 30 19:41:12 cp1080 kernel: [1809273.921140] <IRQ> |
13 | 9558 Aug 30 19:41:12 cp1080 kernel: [1809273.921151] [<ffffffff9d530694>] ? dump_stack+0x5c/0x78 |
14 | 9559 Aug 30 19:41:12 cp1080 kernel: [1809273.921161] [<ffffffff9d278d6e>] ? __warn+0xbe/0xe0 |
15 | 9560 Aug 30 19:41:12 cp1080 kernel: [1809273.921166] [<ffffffff9d278def>] ? warn_slowpath_fmt+0x5f/0x80 |
16 | 9561 Aug 30 19:41:12 cp1080 kernel: [1809273.921175] [<ffffffff9d2b336d>] ? task_tick_fair+0x52d/0x930 |
17 | 9562 Aug 30 19:41:12 cp1080 kernel: [1809273.921181] [<ffffffff9d7377b3>] ? dev_watchdog+0x233/0x240 |
18 | 9563 Aug 30 19:41:12 cp1080 kernel: [1809273.921186] [<ffffffff9d737580>] ? dev_deactivate_queue.constprop.26+0x60/0x60 |
19 | 9564 Aug 30 19:41:12 cp1080 kernel: [1809273.921214] [<ffffffff9d2e6b22>] ? call_timer_fn+0x32/0x120 |
20 | 9565 Aug 30 19:41:12 cp1080 kernel: [1809273.921217] [<ffffffff9d2e6e97>] ? run_timer_softirq+0x1d7/0x420 |
21 | 9566 Aug 30 19:41:12 cp1080 kernel: [1809273.921224] [<ffffffff9d2f7f70>] ? tick_sched_do_timer+0x30/0x30 |
22 | 9567 Aug 30 19:41:12 cp1080 kernel: [1809273.921228] [<ffffffff9d2f7980>] ? tick_sched_handle.isra.12+0x20/0x50 |
23 | 9568 Aug 30 19:41:12 cp1080 kernel: [1809273.921232] [<ffffffff9d2f7fa8>] ? tick_sched_timer+0x38/0x70 |
24 | 9569 Aug 30 19:41:12 cp1080 kernel: [1809273.921242] [<ffffffff9d8186ba>] ? __do_softirq+0x10a/0x29e |
25 | 9570 Aug 30 19:41:12 cp1080 kernel: [1809273.921246] [<ffffffff9d27eefe>] ? irq_exit+0xae/0xb0 |
26 | 9571 Aug 30 19:41:12 cp1080 kernel: [1809273.921251] [<ffffffff9d818184>] ? smp_apic_timer_interrupt+0x44/0x50 |
27 | 9572 Aug 30 19:41:12 cp1080 kernel: [1809273.921255] [<ffffffff9d816a76>] ? apic_timer_interrupt+0x96/0xa0 |
28 | 9573 Aug 30 19:41:12 cp1080 kernel: [1809273.921257] <EOI> |
29 | 9574 Aug 30 19:41:12 cp1080 kernel: [1809273.921264] [<ffffffff9d6d7182>] ? cpuidle_enter_state+0xa2/0x2d0 |
30 | 9575 Aug 30 19:41:12 cp1080 kernel: [1809273.921267] [<ffffffff9d6d7170>] ? cpuidle_enter_state+0x90/0x2d0 |
31 | 9576 Aug 30 19:41:12 cp1080 kernel: [1809273.921274] [<ffffffff9d2bbd24>] ? cpu_startup_entry+0x154/0x240 |
32 | 9577 Aug 30 19:41:12 cp1080 kernel: [1809273.921281] [<ffffffff9d248bb0>] ? start_secondary+0x170/0x1b0 |
33 | 9578 Aug 30 19:41:12 cp1080 kernel: [1809273.921285] ---[ end trace ec1ba160f47f9507 ]--- |
34 | 9579 Aug 30 19:41:12 cp1080 kernel: [1809273.921294] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeout detected, starting reset task! |
35 | 9580 Aug 30 19:41:13 cp1080 kernel: [1809274.862415] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
36 | 9581 Aug 30 19:41:13 cp1080 kernel: [1809274.869297] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
37 | 9582 Aug 30 19:41:14 cp1080 kernel: [1809275.779642] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
38 | 9583 Aug 30 19:41:14 cp1080 kernel: [1809275.786524] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
39 | 9584 Aug 30 19:41:15 cp1080 kernel: [1809276.693792] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
40 | 9585 Aug 30 19:41:15 cp1080 kernel: [1809276.700669] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
41 | 9586 Aug 30 19:41:15 cp1080 kernel: [1809277.610223] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
42 | 9587 Aug 30 19:41:15 cp1080 kernel: [1809277.617118] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
43 | 9588 Aug 30 19:41:16 cp1080 kernel: [1809278.528568] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
44 | 9589 Aug 30 19:41:16 cp1080 kernel: [1809278.535437] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
45 | 9590 Aug 30 19:41:17 cp1080 kernel: [1809279.448344] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
46 | 9591 Aug 30 19:41:17 cp1080 kernel: [1809279.455243] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
47 | 9592 Aug 30 19:41:18 cp1080 kernel: [1809280.368844] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
48 | 9593 Aug 30 19:41:18 cp1080 kernel: [1809280.375729] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
49 | 9594 Aug 30 19:41:19 cp1080 lldpd[1170]: unable to send packet on real device for enp59s0f0: No buffer space available |
50 | 9595 Aug 30 19:41:19 cp1080 lldpd[1167]: 2018-08-30T19:41:19 [WARN/lldp] unable to send packet on real device for enp59s0f0: No buffer space available |
51 | 9596 Aug 30 19:41:19 cp1080 kernel: [1809281.279662] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
52 | 9597 Aug 30 19:41:19 cp1080 kernel: [1809281.286545] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
53 | 9598 Aug 30 19:41:20 cp1080 kernel: [1809282.193967] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
54 | 9599 Aug 30 19:41:20 cp1080 kernel: [1809282.200885] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
55 | 9600 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: Traceback (most recent call last): |
56 | 9601 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: File "/usr/local/bin/varnishreqstats", line 121, in <module> |
57 | 9602 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: ReqStatsSender(sys.argv[1:]).main() |
58 | 9603 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 149, in main |
59 | 9604 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: self.handle_record(line) |
60 | 9605 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 133, in handle_record |
61 | 9606 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: self.resolve_statsd_ip(now) |
62 | 9607 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 99, in resolve_statsd_ip |
63 | 9608 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: statsd_ip = socket.gethostbyname(self.statsd_hostname) |
64 | 9609 Aug 30 19:41:21 cp1080 varnishreqstats[2550]: socket.gaierror: [Errno -3] Temporary failure in name resolution |
65 | 9610 Aug 30 19:41:21 cp1080 systemd[1]: varnishreqstats-frontend.service: Main process exited, code=exited, status=1/FAILURE |
66 | 9611 Aug 30 19:41:21 cp1080 systemd[1]: varnishreqstats-frontend.service: Unit entered failed state. |
67 | 9612 Aug 30 19:41:21 cp1080 systemd[1]: varnishreqstats-frontend.service: Failed with result 'exit-code'. |
68 | 9613 Aug 30 19:41:21 cp1080 kernel: [1809283.109547] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
69 | 9614 Aug 30 19:41:21 cp1080 kernel: [1809283.116410] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
70 | 9615 Aug 30 19:41:22 cp1080 kernel: [1809284.026761] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
71 | 9616 Aug 30 19:41:22 cp1080 kernel: [1809284.033622] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
72 | 9617 Aug 30 19:41:23 cp1080 kernel: [1809284.944888] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
73 | 9618 Aug 30 19:41:23 cp1080 kernel: [1809284.951749] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free tx failed. rc:-1 |
74 | 9619 Aug 30 19:41:24 cp1080 kernel: [1809285.808528] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
75 | 9620 Aug 30 19:41:24 cp1080 kernel: [1809285.815389] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
76 | 9621 Aug 30 19:41:25 cp1080 kernel: [1809286.707603] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
77 | 9622 Aug 30 19:41:25 cp1080 kernel: [1809286.714475] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
78 | 9623 Aug 30 19:41:25 cp1080 kernel: [1809287.620964] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
79 | 9624 Aug 30 19:41:25 cp1080 kernel: [1809287.627838] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
80 | 9625 Aug 30 19:41:26 cp1080 systemd[1]: varnishreqstats-frontend.service: Service hold-off time over, scheduling restart. |
81 | 9626 Aug 30 19:41:26 cp1080 systemd[1]: Stopped Varnish Request Stats StatsD reporter. |
82 | 9627 Aug 30 19:41:26 cp1080 systemd[1]: Started Varnish Request Stats StatsD reporter. |
83 | 9628 Aug 30 19:41:26 cp1080 kernel: [1809288.534169] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
84 | 9629 Aug 30 19:41:26 cp1080 kernel: [1809288.541049] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
85 | 9630 Aug 30 19:41:27 cp1080 kernel: [1809289.422863] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
86 | 9631 Aug 30 19:41:27 cp1080 kernel: [1809289.429727] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
87 | 9632 Aug 30 19:41:28 cp1080 kernel: [1809290.294522] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
88 | 9633 Aug 30 19:41:28 cp1080 kernel: [1809290.301379] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
89 | 9634 Aug 30 19:41:29 cp1080 kernel: [1809291.193288] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
90 | 9635 Aug 30 19:41:29 cp1080 kernel: [1809291.200174] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
91 | 9636 Aug 30 19:41:30 cp1080 kernel: [1809292.107024] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
92 | 9637 Aug 30 19:41:30 cp1080 kernel: [1809292.113911] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
93 | 9638 Aug 30 19:41:31 cp1080 kernel: [1809293.023559] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
94 | 9639 Aug 30 19:41:31 cp1080 kernel: [1809293.030449] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
95 | 9640 Aug 30 19:41:32 cp1080 kernel: [1809293.938329] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
96 | 9641 Aug 30 19:41:32 cp1080 kernel: [1809293.945185] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
97 | 9642 Aug 30 19:41:33 cp1080 kernel: [1809294.852956] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
98 | 9643 Aug 30 19:41:33 cp1080 kernel: [1809294.859812] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
99 | 9644 Aug 30 19:41:34 cp1080 kernel: [1809295.769910] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
100 | 9645 Aug 30 19:41:34 cp1080 kernel: [1809295.776791] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
101 | 9646 Aug 30 19:41:35 cp1080 kernel: [1809296.685620] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
102 | 9647 Aug 30 19:41:35 cp1080 kernel: [1809296.692482] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
103 | 9648 Aug 30 19:41:35 cp1080 kernel: [1809297.542229] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
104 | 9649 Aug 30 19:41:35 cp1080 kernel: [1809297.549100] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
105 | 9650 Aug 30 19:41:36 cp1080 kernel: [1809298.448563] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
106 | 9651 Aug 30 19:41:36 cp1080 kernel: [1809298.455434] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
107 | 9652 Aug 30 19:41:37 cp1080 kernel: [1809299.357676] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
108 | 9653 Aug 30 19:41:37 cp1080 kernel: [1809299.364552] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
109 | 9654 Aug 30 19:41:38 cp1080 kernel: [1809300.274744] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
110 | 9655 Aug 30 19:41:38 cp1080 kernel: [1809300.281624] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
111 | 9656 Aug 30 19:41:39 cp1080 kernel: [1809301.190199] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
112 | 9657 Aug 30 19:41:39 cp1080 kernel: [1809301.197062] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
113 | 9658 Aug 30 19:41:40 cp1080 kernel: [1809302.072518] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
114 | 9659 Aug 30 19:41:40 cp1080 kernel: [1809302.079395] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
115 | 9660 Aug 30 19:41:41 cp1080 kernel: [1809302.973032] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
116 | 9661 Aug 30 19:41:41 cp1080 kernel: [1809302.979909] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
117 | 9662 Aug 30 19:41:42 cp1080 kernel: [1809303.855811] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
118 | 9663 Aug 30 19:41:42 cp1080 kernel: [1809303.862693] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
119 | 9664 Aug 30 19:41:43 cp1080 kernel: [1809304.770523] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
120 | 9665 Aug 30 19:41:43 cp1080 kernel: [1809304.777430] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
121 | 9666 Aug 30 19:41:44 cp1080 kernel: [1809305.686517] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
122 | 9667 Aug 30 19:41:44 cp1080 kernel: [1809305.693378] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
123 | 9668 Aug 30 19:41:45 cp1080 kernel: [1809306.601203] bnxt_en 0000:3b:00.0 enp59s0f0: Resp cmpl intr err msg: 0x51 |
124 | 9669 Aug 30 19:41:45 cp1080 kernel: [1809306.608068] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free rx failed. rc:-1 |
125 | 9670 Aug 30 19:41:45 cp1080 kernel: [1809307.517259] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a2} len:0 |
126 | 9671 Aug 30 19:41:45 cp1080 kernel: [1809307.525452] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
127 | 9672 Aug 30 19:41:46 cp1080 kernel: [1809308.435681] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a3} len:0 |
128 | 9673 Aug 30 19:41:46 cp1080 kernel: [1809308.443848] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
129 | 9674 Aug 30 19:41:47 cp1080 kernel: [1809309.303532] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a4} len:0 |
130 | 9675 Aug 30 19:41:47 cp1080 kernel: [1809309.311706] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
131 | 9676 Aug 30 19:41:48 cp1080 kernel: [1809310.210782] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a5} len:0 |
132 | 9677 Aug 30 19:41:48 cp1080 kernel: [1809310.218938] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
133 | 9678 Aug 30 19:41:49 cp1080 kernel: [1809311.064632] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a6} len:0 |
134 | 9679 Aug 30 19:41:49 cp1080 kernel: [1809311.072806] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
135 | 9680 Aug 30 19:41:49 cp1080 lldpd[1170]: unable to send packet on real device for enp59s0f0: No buffer space available |
136 | 9681 Aug 30 19:41:49 cp1080 lldpd[1167]: 2018-08-30T19:41:49 [WARN/lldp] unable to send packet on real device for enp59s0f0: No buffer space available |
137 | 9682 Aug 30 19:41:50 cp1080 kernel: [1809311.949349] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a7} len:0 |
138 | 9683 Aug 30 19:41:50 cp1080 kernel: [1809311.957509] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
139 | 9684 Aug 30 19:41:51 cp1080 kernel: [1809312.803086] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a8} len:0 |
140 | 9685 Aug 30 19:41:51 cp1080 kernel: [1809312.811237] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
141 | 9686 Aug 30 19:41:52 cp1080 kernel: [1809313.655826] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8a9} len:0 |
142 | 9687 Aug 30 19:41:52 cp1080 kernel: [1809313.663991] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
143 | 9688 Aug 30 19:41:52 cp1080 kernel: [1809314.535686] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8aa} len:0 |
144 | 9689 Aug 30 19:41:52 cp1080 kernel: [1809314.543878] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
145 | 9690 Aug 30 19:41:53 cp1080 kernel: [1809315.456720] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8ab} len:0 |
146 | 9691 Aug 30 19:41:53 cp1080 kernel: [1809315.464885] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
147 | 9692 Aug 30 19:41:54 cp1080 kernel: [1809316.377057] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8ac} len:0 |
148 | 9693 Aug 30 19:41:54 cp1080 kernel: [1809316.385235] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
149 | 9694 Aug 30 19:41:55 cp1080 kernel: [1809317.297106] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x51 0x8ad} len:0 |
150 | 9695 Aug 30 19:41:55 cp1080 kernel: [1809317.305263] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm_ring_free cp failed. rc:-1 |
151 | 9696 Aug 30 19:41:56 cp1080 kernel: [1809318.213093] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x61 0x8ae} len:0 |
152 | 9697 Aug 30 19:41:57 cp1080 kernel: [1809319.184394] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x50 0x8af} len:0 |
153 | 9698 Aug 30 19:41:57 cp1080 kernel: [1809319.192577] bnxt_en 0000:3b:00.0 enp59s0f0: Invalid ring |
154 | 9699 Aug 30 19:41:57 cp1080 kernel: [1809319.198069] bnxt_en 0000:3b:00.0 enp59s0f0: hwrm ring alloc failure rc: ffffffff |
155 | 9700 Aug 30 19:41:58 cp1080 kernel: [1809320.089628] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0xb1 0x8b0} len:0 |
156 | 9701 Aug 30 19:41:58 cp1080 kernel: [1809320.097795] bnxt_en 0000:3b:00.0 enp59s0f0: bnxt_init_nic err: ffffffff |
157 | 9702 Aug 30 19:41:58 cp1080 kernel: [1809320.122391] bnxt_en 0000:3b:00.0 enp59s0f0: nic open fail (rc: ffffffff``` |
and then the network goes down and there are all kinds of issues like:
1 | ) |
---|---|
2 | 9703 Aug 30 19:41:59 cp1080 varnishmedia[2545]: Traceback (most recent call last): |
3 | 9704 Aug 30 19:41:59 cp1080 varnishmedia[2545]: File "/usr/local/bin/varnishmedia", line 61, in <module> |
4 | 9705 Aug 30 19:41:59 cp1080 varnishmedia[2545]: MediaCacheStatsSender(sys.argv[1:]).main() |
5 | 9706 Aug 30 19:41:59 cp1080 varnishmedia[2545]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 149, in main |
6 | 9707 Aug 30 19:41:59 cp1080 varnishmedia[2545]: self.handle_record(line) |
7 | 9708 Aug 30 19:41:59 cp1080 varnishmedia[2545]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 133, in handle_record |
8 | 9709 Aug 30 19:41:59 cp1080 varnishmedia[2545]: self.resolve_statsd_ip(now) |
9 | 9710 Aug 30 19:41:59 cp1080 varnishmedia[2545]: File "/usr/local/lib/python2.7/dist-packages/cachestats.py", line 99, in resolve_statsd_ip |
10 | 9711 Aug 30 19:41:59 cp1080 varnishmedia[2545]: statsd_ip = socket.gethostbyname(self.statsd_hostname) |
11 | 9712 Aug 30 19:41:59 cp1080 varnishmedia[2545]: socket.gaierror: [Errno -3] Temporary failure in name resolution |
12 | 9713 Aug 30 19:41:59 cp1080 systemd[1]: varnishmedia.service: Main process exited, code=exited, status=1/FAILURE |
13 | 9714 Aug 30 19:41:59 cp1080 systemd[1]: varnishmedia.service: Unit entered failed state. |
14 | 9715 Aug 30 19:41:59 cp1080 systemd[1]: varnishmedia.service: Failed with result 'exit-code'. |
15 | 9716 Aug 30 19:41:59 cp1080 kernel: [1809321.075598] bnxt_en 0000:3b:00.0 enp59s0f0: Error (timeout: 500) msg {0x20 0x8b1} len:0 |
16 | 9717 Aug 30 19:41:59 cp1080 lldpd[1170]: removal request for address of 2620:0:861:102:10:64:16:23%2, but no knowledge of it |
17 | 9718 Aug 30 19:41:59 cp1080 lldpd[1167]: 2018-08-30T19:41:59 [INFO/netlink] removal request for address of 2620:0:861:102:10:64:16:23%2, but no knowledge of it |
18 | 9719 Aug 30 19:41:59 cp1080 lldpd[1170]: removal request for address of fe80::20a:f7ff:feef:eba0%2, but no knowledge of it |
19 | 9720 Aug 30 19:41:59 cp1080 lldpd[1167]: 2018-08-30T19:41:59 [INFO/netlink] removal request for address of fe80::20a:f7ff:feef:eba0%2, but no knowledge of it |
20 | 9721 Aug 30 19:41:59 cp1080 lldpd[1170]: error while receiving frame on enp59s0f0 (retry: 0): Network is down |
21 | 9722 Aug 30 19:41:59 cp1080 lldpd[1167]: 2018-08-30T19:41:59 [WARN/interfaces] error while receiving frame on enp59s0f0 (retry: 0): Network is down |
22 | 9723 Aug 30 19:41:59 cp1080 varnishospital[2547]: --- Logging error --- |
23 | 9724 Aug 30 19:41:59 cp1080 varnishospital[2547]: Traceback (most recent call last): |
24 | 9725 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/lib/python3.5/logging/handlers.py", line 622, in emit |
25 | 9726 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.send(s) |
26 | 9727 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/lib/python3.5/logging/handlers.py", line 680, in send |
27 | 9728 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.sock.sendto(s, self.address) |
28 | 9729 Aug 30 19:41:59 cp1080 varnishospital[2547]: socket.gaierror: [Errno -3] Temporary failure in name resolution |
29 | 9730 Aug 30 19:41:59 cp1080 varnishospital[2547]: Call stack: |
30 | 9731 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/local/bin/varnishospital", line 87, in <module> |
31 | 9732 Aug 30 19:41:59 cp1080 varnishospital[2547]: VarnishHospital(sys.argv[1:]).main() |
32 | 9733 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/local/lib/python3.5/dist-packages/wikimedia_varnishlogconsumer.py", line 189, in main |
33 | 9734 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.handle_line(line) |
34 | 9735 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/local/bin/varnishospital", line 83, in handle_line |
35 | 9736 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.logger.info("{} {}".format(log['origin_server'], log['transition']), extra=log) |
36 | 9737 Aug 30 19:41:59 cp1080 varnishospital[2547]: Message: 'be_cp1078_eqiad_wmnet Went sick' |
37 | 9738 Aug 30 19:41:59 cp1080 varnishospital[2547]: Arguments: () |
38 | 9739 Aug 30 19:41:59 cp1080 varnishospital[2547]: --- Logging error --- |
39 | 9740 Aug 30 19:41:59 cp1080 varnishospital[2547]: Traceback (most recent call last): |
40 | 9741 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/lib/python3.5/logging/handlers.py", line 622, in emit |
41 | 9742 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.send(s) |
42 | 9743 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/lib/python3.5/logging/handlers.py", line 680, in send |
43 | 9744 Aug 30 19:41:59 cp1080 varnishospital[2547]: self.sock.sendto(s, self.address) |
44 | 9745 Aug 30 19:41:59 cp1080 varnishospital[2547]: socket.gaierror: [Errno -3] Temporary failure in name resolution |
45 | 9746 Aug 30 19:41:59 cp1080 varnishospital[2547]: Call stack: |
46 | 9747 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/local/bin/varnishospital", line 87, in <module> |
47 | 9748 Aug 30 19:41:59 cp1080 varnishospital[2547]: VarnishHospital(sys.argv[1:]).main() |
48 | 9749 Aug 30 19:41:59 cp1080 varnishospital[2547]: File "/usr/local/lib/python3.5/dist-packages/wikimedia_varnishlogconsumer.py", line 189, in main |
49 | |
50 | |
51 | 9938 Aug 30 19:42:04 cp1080 systemd[1]: varnishstatsd-default.service: Unit entered failed state. |
52 | 9939 Aug 30 19:42:04 cp1080 systemd[1]: varnishstatsd-default.service: Failed with result 'exit-code'. |
53 | 9940 Aug 30 19:42:04 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1002.eqiad.wmnet:9093/1002: Failed to resolve 'kafka-jumbo1002.eqiad.wmnet:9093': Temporar y failure in name resolution |
54 | 9941 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1006.eqiad.wmnet:9093/1006: 4 request(s) timed out: disconnect |
55 | 9942 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1005.eqiad.wmnet:9093/1005: 4 request(s) timed out: disconnect |
56 | 9943 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1003.eqiad.wmnet:9093/1003: 6 request(s) timed out: disconnect |
57 | 9944 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1004.eqiad.wmnet:9093/1004: 4 request(s) timed out: disconnect |
58 | 9945 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-185): ssl://kafka-jumbo1001.eqiad.wmnet:9093/1001: 4 request(s) timed out: disconnect |
59 | 9946 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-187): 6/6 brokers are down |
60 | 9947 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1006.eqiad.wmnet:9093/1006: Failed to resolve 'kafka-jumbo1006.eqiad.wmnet:9093': Temporar y failure in name resolution |
61 | 9948 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1005.eqiad.wmnet:9093/1005: Failed to resolve 'kafka-jumbo1005.eqiad.wmnet:9093': Temporar y failure in name resolution |
62 | 9949 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1003.eqiad.wmnet:9093/1003: Failed to resolve 'kafka-jumbo1003.eqiad.wmnet:9093': Temporar y failure in name resolution |
63 | 9950 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1004.eqiad.wmnet:9093/1004: Failed to resolve 'kafka-jumbo1004.eqiad.wmnet:9093': Temporar y failure in name resolution |
64 | 9951 Aug 30 19:42:05 cp1080 varnishkafka[2543]: KAFKAERR: Kafka error (-193): ssl://kafka-jumbo1001.eqiad.wmnet:9093/1001: Failed to resolve 'kafka-jumbo1001.eqiad.wmnet:9093': Temporar y failure in name resolution |
65 | |
66 | |
67 | |
68 | 10139 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: --- Logging error --- |
69 | 10140 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: Traceback (most recent call last): |
70 | 10141 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: File "/usr/lib/python3.5/logging/handlers.py", line 622, in emit |
71 | 10142 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: self.send(s) |
72 | 10143 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: File "/usr/lib/python3.5/logging/handlers.py", line 680, in send |
73 | 10144 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: self.sock.sendto(s, self.address) |
74 | 10145 Aug 30 19:42:37 cp1080 varnishslowlog[211237]: socket.gaierror: [Errno -3] Temporary failure in name resolution |
75 | |
76 | |
77 | |
78 | 12417 Aug 30 20:09:39 cp1080 confd[1054]: 2018-08-30T20:09:39Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0] |
79 | 12418 Aug 30 20:09:39 cp1080 confd[1054]: 2018-08-30T20:09:39Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0] |
80 | 12419 Aug 30 20:09:41 cp1080 confd[1054]: 2018-08-30T20:09:41Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0] |
81 | 12420 Aug 30 20:09:41 cp1080 confd[1054]: 2018-08-30T20:09:41Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0] |
82 | 12421 Aug 30 20:09:44 cp1080 confd[1054]: 2018-08-30T20:09:44Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0] |
83 | 12422 Aug 30 20:09:44 cp1080 confd[1054]: 2018-08-30T20:09:44Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0] |
84 | 12423 Aug 30 20:09:46 cp1080 confd[1054]: 2018-08-30T20:09:46Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0] |
85 | 12424 Aug 30 20:09:46 cp1080 confd[1054]: 2018-08-30T20:09:46Z cp1080 /usr/bin/confd[1054]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and fail ed) [0] |
86 |