cp1078 crashed earlier today; this resulted in a bunch of strongswan alerts firing.
I rebooted it from mgmt and everything seems to be recovering.
cp1078 crashed earlier today; this resulted in a bunch of strongswan alerts firing.
I rebooted it from mgmt and everything seems to be recovering.
The syslog suggests that the box wasn't actually all the way down before I restarted it, just in distress. Here are its last few minutes:
Nov 18 17:25:52 cp1078 confd[1060]: 2018-11-18T17:25:52Z cp1078 /usr/bin/confd[1060]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0] Nov 18 17:25:52 cp1078 confd[1060]: 2018-11-18T17:25:52Z cp1078 /usr/bin/confd[1060]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0] Nov 18 17:25:54 cp1078 charon: 08[IKE] initiating IKE_SA cp4023_v4[7230] to 10.128.0.123 Nov 18 17:25:55 cp1078 charon: 11[IKE] initiating IKE_SA cp2014_v4[7676] to 10.192.32.113 Nov 18 17:25:55 cp1078 kernel: [2000851.822322] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeout detected, starting reset task! Nov 18 17:25:56 cp1078 rsyslogd: cannot resolve hostname 'syslog.codfw.wmnet': Connection timed out [v8.38.0 try http://www.rsyslog.com/e/2027 ] Nov 18 17:25:58 cp1078 charon: 11[IKE] initiating IKE_SA cp4023_v4[6875] to 10.128.0.123 Nov 18 17:25:59 cp1078 charon: 15[IKE] initiating IKE_SA cp4026_v4[8694] to 10.128.0.126 Nov 18 17:25:59 cp1078 charon: 12[IKE] initiating IKE_SA cp5002_v6[8100] to 2001:df2:e500:101:10:132:0:102 Nov 18 17:26:00 cp1078 charon: 06[IKE] initiating IKE_SA cp4023_v4[8585] to 10.128.0.123 Nov 18 17:26:00 cp1078 kernel: [2000856.930284] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeout detected, starting reset task! Nov 18 17:26:01 cp1078 CRON[103748]: (root) CMD (/usr/bin/logster -o statsd --statsd-host=statsd.eqiad.wmnet:8125 --metric-prefix=varnishkafka.cp1078.webrequest.upload JsonLogster /var/cache/varnishkafka/webrequest.stats.json > /dev/null 2>&1) Nov 18 17:26:01 cp1078 CRON[103749]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom) Nov 18 17:26:01 cp1078 CRON[103750]: (root) CMD (systemctl is-active -q vhtcpd && test -s /tmp/vhtcpd.stats && /usr/local/bin/prometheus-vhtcpd-stats --outfile /var/lib/prometheus/node.d/vhtcpd.prom) Nov 18 17:26:01 cp1078 charon: 12[IKE] initiating IKE_SA cp2011_v6[8186] to 2620:0:860:102:10:192:16:137 Nov 18 17:26:02 cp1078 charon: 06[IKE] initiating IKE_SA cp5002_v6[7603] to 2001:df2:e500:101:10:132:0:102 Nov 18 17:26:06 cp1078 charon: 04[IKE] initiating IKE_SA cp2014_v4[8205] to 10.192.32.113 Nov 18 17:26:06 cp1078 kernel: [2000862.804440] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeout detected, starting reset task! Nov 18 17:26:06 cp1078 charon: 14[IKE] initiating IKE_SA cp3044_v4[7561] to 10.20.0.179 Nov 18 17:26:07 cp1078 charon: 09[IKE] initiating IKE_SA cp2011_v6[6567] to 2620:0:860:102:10:192:16:137 Nov 18 17:26:07 cp1078 lldpd[1164]: unable to send packet on real device for enp59s0f0: No buffer space available Nov 18 17:26:07 cp1078 lldpd[1162]: 2018-11-18T17:26:07 [WARN/lldp] unable to send packet on real device for enp59s0f0: No buffer space available Nov 18 17:26:08 cp1078 charon: 05[IKE] initiating IKE_SA cp3044_v4[8332] to 10.20.0.179 Nov 18 17:26:09 cp1078 confd[1060]: 2018-11-18T17:26:09Z cp1078 /usr/bin/confd[1060]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0] Nov 18 17:26:09 cp1078 confd[1060]: 2018-11-18T17:26:09Z cp1078 /usr/bin/confd[1060]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0] Nov 18 17:26:10 cp1078 charon: 07[IKE] initiating IKE_SA cp4026_v4[8802] to 10.128.0.126 Nov 18 17:26:11 cp1078 systemd[1]: Stopped target Graphical Interface.
Nov 18 16:27:31 cp1078 kernel: [1997356.199019] ------------[ cut here ]------------ Nov 18 16:27:31 cp1078 kernel: [1997356.199032] WARNING: CPU: 5 PID: 0 at /build/linux-IWeKxA/linux-4.9.110/net/sched/sch_generic.c:316 dev_watchdog+0x233/0x240 Nov 18 16:27:31 cp1078 kernel: [1997356.199034] NETDEV WATCHDOG: enp59s0f0 (bnxt_en): transmit queue 0 timed out Nov 18 16:27:31 cp1078 kernel: [1997356.199035] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport cpufreq_userspace cpufreq_powersave cpufreq_conservative drbg ansi_cprng seqiv xfrm4_mode_transport xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo intel_rapl skx_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 kvm ttm drm_kms_helper irqbypass drm crct10dif_pclmul crc32_pclmul iTCO_wdt dcdbas evdev ghash_clmulni_intel sg pcspkr mei_me i2c_algo_bit iTCO_vendor_support lpc_ich mei mfd_core shpchp ipmi_si button tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear raid1 md_mod ses enclosure sd_mod crc32c_intel ahci Nov 18 16:27:31 cp1078 kernel: [1997356.199134] mpt3sas aesni_intel raid_class aes_x86_64 glue_helper lrw libahci gf128mul ablk_helper xhci_pci scsi_transport_sas libata xhci_hcd cryptd bnxt_en nvme i2c_i801 nvme_core i2c_smbus usbcore scsi_mod usb_common Nov 18 16:27:31 cp1078 kernel: [1997356.199165] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u6 Nov 18 16:27:31 cp1078 kernel: [1997356.199167] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 1.3.7 02/09/2018 Nov 18 16:27:31 cp1078 kernel: [1997356.199170] 0000000000000000 ffffffffb3d31e54 ffff92febd483e20 0000000000000000 Nov 18 16:27:31 cp1078 kernel: [1997356.199176] ffffffffb3a794fe 0000000000000000 ffff92febd483e78 ffff92ce9f8d0000 Nov 18 16:27:31 cp1078 kernel: [1997356.199181] 0000000000000005 ffff92cea249dbc0 000000000000004a ffffffffb3a7957f Nov 18 16:27:31 cp1078 kernel: [1997356.199186] Call Trace: Nov 18 16:27:31 cp1078 kernel: [1997356.199189] <IRQ> Nov 18 16:27:31 cp1078 kernel: [1997356.199199] [<ffffffffb3d31e54>] ? dump_stack+0x5c/0x78 Nov 18 16:27:31 cp1078 kernel: [1997356.199204] [<ffffffffb3a794fe>] ? __warn+0xbe/0xe0 Nov 18 16:27:31 cp1078 kernel: [1997356.199207] [<ffffffffb3a7957f>] ? warn_slowpath_fmt+0x5f/0x80 Nov 18 16:27:31 cp1078 kernel: [1997356.199215] [<ffffffffb3ab1e52>] ? enqueue_task_fair+0x82/0x940 Nov 18 16:27:31 cp1078 kernel: [1997356.199221] [<ffffffffb3f387b3>] ? dev_watchdog+0x233/0x240 Nov 18 16:27:31 cp1078 kernel: [1997356.199226] [<ffffffffb3f38580>] ? dev_deactivate_queue.constprop.26+0x60/0x60 Nov 18 16:27:31 cp1078 kernel: [1997356.199231] [<ffffffffb3ae7622>] ? call_timer_fn+0x32/0x120 Nov 18 16:27:31 cp1078 kernel: [1997356.199234] [<ffffffffb3ae7997>] ? run_timer_softirq+0x1d7/0x420 Nov 18 16:27:31 cp1078 kernel: [1997356.199240] [<ffffffffb3af8a80>] ? tick_sched_do_timer+0x30/0x30 Nov 18 16:27:31 cp1078 kernel: [1997356.199244] [<ffffffffb3d3af54>] ? timerqueue_add+0x54/0xa0 Nov 18 16:27:31 cp1078 kernel: [1997356.199248] [<ffffffffb3ae9678>] ? enqueue_hrtimer+0x38/0x80 Nov 18 16:27:31 cp1078 kernel: [1997356.199256] [<ffffffffb401974d>] ? __do_softirq+0x10d/0x2a5 Nov 18 16:27:31 cp1078 kernel: [1997356.199261] [<ffffffffb3a7f9f0>] ? irq_exit+0xb0/0xc0 Nov 18 16:27:31 cp1078 kernel: [1997356.199266] [<ffffffffb40191cc>] ? smp_apic_timer_interrupt+0x4c/0x60 Nov 18 16:27:31 cp1078 kernel: [1997356.199271] [<ffffffffb4017a66>] ? apic_timer_interrupt+0x96/0xa0 Nov 18 16:27:31 cp1078 kernel: [1997356.199272] <EOI> Nov 18 16:27:31 cp1078 kernel: [1997356.199279] [<ffffffffb3ed89f2>] ? cpuidle_enter_state+0xa2/0x2d0 Nov 18 16:27:31 cp1078 kernel: [1997356.199282] [<ffffffffb3ed89e0>] ? cpuidle_enter_state+0x90/0x2d0 Nov 18 16:27:31 cp1078 kernel: [1997356.199288] [<ffffffffb3abc824>] ? cpu_startup_entry+0x154/0x240 Nov 18 16:27:31 cp1078 kernel: [1997356.199295] [<ffffffffb3a48db0>] ? start_secondary+0x170/0x1b0 Nov 18 16:27:31 cp1078 kernel: [1997356.199298] ---[ end trace aa84d59894952a2a ]---