Page MenuHomePhabricator

cp1078 crash
Closed, DuplicatePublic

Description

cp1078 crashed earlier today; this resulted in a bunch of strongswan alerts firing.

I rebooted it from mgmt and everything seems to be recovering.

Event Timeline

The syslog suggests that the box wasn't actually all the way down before I restarted it, just in distress. Here are its last few minutes:

Nov 18 17:25:52 cp1078 confd[1060]: 2018-11-18T17:25:52Z cp1078 /usr/bin/confd[1060]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
Nov 18 17:25:52 cp1078 confd[1060]: 2018-11-18T17:25:52Z cp1078 /usr/bin/confd[1060]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
Nov 18 17:25:54 cp1078 charon: 08[IKE] initiating IKE_SA cp4023_v4[7230] to 10.128.0.123
Nov 18 17:25:55 cp1078 charon: 11[IKE] initiating IKE_SA cp2014_v4[7676] to 10.192.32.113
Nov 18 17:25:55 cp1078 kernel: [2000851.822322] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeout detected, starting reset task!
Nov 18 17:25:56 cp1078 rsyslogd: cannot resolve hostname 'syslog.codfw.wmnet': Connection timed out [v8.38.0 try http://www.rsyslog.com/e/2027 ]
Nov 18 17:25:58 cp1078 charon: 11[IKE] initiating IKE_SA cp4023_v4[6875] to 10.128.0.123
Nov 18 17:25:59 cp1078 charon: 15[IKE] initiating IKE_SA cp4026_v4[8694] to 10.128.0.126
Nov 18 17:25:59 cp1078 charon: 12[IKE] initiating IKE_SA cp5002_v6[8100] to 2001:df2:e500:101:10:132:0:102
Nov 18 17:26:00 cp1078 charon: 06[IKE] initiating IKE_SA cp4023_v4[8585] to 10.128.0.123
Nov 18 17:26:00 cp1078 kernel: [2000856.930284] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeout detected, starting reset task!
Nov 18 17:26:01 cp1078 CRON[103748]: (root) CMD (/usr/bin/logster -o statsd --statsd-host=statsd.eqiad.wmnet:8125 --metric-prefix=varnishkafka.cp1078.webrequest.upload JsonLogster /var/cache/varnishkafka/webrequest.stats.json > /dev/null 2>&1)
Nov 18 17:26:01 cp1078 CRON[103749]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Nov 18 17:26:01 cp1078 CRON[103750]: (root) CMD (systemctl is-active -q vhtcpd && test -s /tmp/vhtcpd.stats && /usr/local/bin/prometheus-vhtcpd-stats --outfile /var/lib/prometheus/node.d/vhtcpd.prom)
Nov 18 17:26:01 cp1078 charon: 12[IKE] initiating IKE_SA cp2011_v6[8186] to 2620:0:860:102:10:192:16:137
Nov 18 17:26:02 cp1078 charon: 06[IKE] initiating IKE_SA cp5002_v6[7603] to 2001:df2:e500:101:10:132:0:102
Nov 18 17:26:06 cp1078 charon: 04[IKE] initiating IKE_SA cp2014_v4[8205] to 10.192.32.113
Nov 18 17:26:06 cp1078 kernel: [2000862.804440] bnxt_en 0000:3b:00.0 enp59s0f0: TX timeout detected, starting reset task!
Nov 18 17:26:06 cp1078 charon: 14[IKE] initiating IKE_SA cp3044_v4[7561] to 10.20.0.179
Nov 18 17:26:07 cp1078 charon: 09[IKE] initiating IKE_SA cp2011_v6[6567] to 2620:0:860:102:10:192:16:137
Nov 18 17:26:07 cp1078 lldpd[1164]: unable to send packet on real device for enp59s0f0: No buffer space available
Nov 18 17:26:07 cp1078 lldpd[1162]: 2018-11-18T17:26:07 [WARN/lldp] unable to send packet on real device for enp59s0f0: No buffer space available
Nov 18 17:26:08 cp1078 charon: 05[IKE] initiating IKE_SA cp3044_v4[8332] to 10.20.0.179
Nov 18 17:26:09 cp1078 confd[1060]: 2018-11-18T17:26:09Z cp1078 /usr/bin/confd[1060]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
Nov 18 17:26:09 cp1078 confd[1060]: 2018-11-18T17:26:09Z cp1078 /usr/bin/confd[1060]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
Nov 18 17:26:10 cp1078 charon: 07[IKE] initiating IKE_SA cp4026_v4[8802] to 10.128.0.126
Nov 18 17:26:11 cp1078 systemd[1]: Stopped target Graphical Interface.
Nov 18 16:27:31 cp1078 kernel: [1997356.199019] ------------[ cut here ]------------
Nov 18 16:27:31 cp1078 kernel: [1997356.199032] WARNING: CPU: 5 PID: 0 at /build/linux-IWeKxA/linux-4.9.110/net/sched/sch_generic.c:316 dev_watchdog+0x233/0x240
Nov 18 16:27:31 cp1078 kernel: [1997356.199034] NETDEV WATCHDOG: enp59s0f0 (bnxt_en): transmit queue 0 timed out
Nov 18 16:27:31 cp1078 kernel: [1997356.199035] Modules linked in: binfmt_misc esp6 xfrm6_mode_transport cpufreq_userspace cpufreq_powersave cpufreq_conservative drbg ansi_cprng seqiv xfrm4_mode_transport xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo intel_rapl skx_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 kvm ttm drm_kms_helper irqbypass drm crct10dif_pclmul crc32_pclmul iTCO_wdt dcdbas evdev ghash_clmulni_intel sg pcspkr mei_me i2c_algo_bit iTCO_vendor_support lpc_ich mei mfd_core shpchp ipmi_si button tcp_bbr sch_fq ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear raid1 md_mod ses enclosure sd_mod crc32c_intel ahci
Nov 18 16:27:31 cp1078 kernel: [1997356.199134]  mpt3sas aesni_intel raid_class aes_x86_64 glue_helper lrw libahci gf128mul ablk_helper xhci_pci scsi_transport_sas libata xhci_hcd cryptd bnxt_en nvme i2c_i801 nvme_core i2c_smbus usbcore scsi_mod usb_common
Nov 18 16:27:31 cp1078 kernel: [1997356.199165] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u6
Nov 18 16:27:31 cp1078 kernel: [1997356.199167] Hardware name: Dell Inc. PowerEdge R440/0WKGTH, BIOS 1.3.7 02/09/2018
Nov 18 16:27:31 cp1078 kernel: [1997356.199170]  0000000000000000 ffffffffb3d31e54 ffff92febd483e20 0000000000000000
Nov 18 16:27:31 cp1078 kernel: [1997356.199176]  ffffffffb3a794fe 0000000000000000 ffff92febd483e78 ffff92ce9f8d0000
Nov 18 16:27:31 cp1078 kernel: [1997356.199181]  0000000000000005 ffff92cea249dbc0 000000000000004a ffffffffb3a7957f
Nov 18 16:27:31 cp1078 kernel: [1997356.199186] Call Trace:
Nov 18 16:27:31 cp1078 kernel: [1997356.199189]  <IRQ>
Nov 18 16:27:31 cp1078 kernel: [1997356.199199]  [<ffffffffb3d31e54>] ? dump_stack+0x5c/0x78
Nov 18 16:27:31 cp1078 kernel: [1997356.199204]  [<ffffffffb3a794fe>] ? __warn+0xbe/0xe0
Nov 18 16:27:31 cp1078 kernel: [1997356.199207]  [<ffffffffb3a7957f>] ? warn_slowpath_fmt+0x5f/0x80
Nov 18 16:27:31 cp1078 kernel: [1997356.199215]  [<ffffffffb3ab1e52>] ? enqueue_task_fair+0x82/0x940
Nov 18 16:27:31 cp1078 kernel: [1997356.199221]  [<ffffffffb3f387b3>] ? dev_watchdog+0x233/0x240
Nov 18 16:27:31 cp1078 kernel: [1997356.199226]  [<ffffffffb3f38580>] ? dev_deactivate_queue.constprop.26+0x60/0x60
Nov 18 16:27:31 cp1078 kernel: [1997356.199231]  [<ffffffffb3ae7622>] ? call_timer_fn+0x32/0x120
Nov 18 16:27:31 cp1078 kernel: [1997356.199234]  [<ffffffffb3ae7997>] ? run_timer_softirq+0x1d7/0x420
Nov 18 16:27:31 cp1078 kernel: [1997356.199240]  [<ffffffffb3af8a80>] ? tick_sched_do_timer+0x30/0x30
Nov 18 16:27:31 cp1078 kernel: [1997356.199244]  [<ffffffffb3d3af54>] ? timerqueue_add+0x54/0xa0
Nov 18 16:27:31 cp1078 kernel: [1997356.199248]  [<ffffffffb3ae9678>] ? enqueue_hrtimer+0x38/0x80
Nov 18 16:27:31 cp1078 kernel: [1997356.199256]  [<ffffffffb401974d>] ? __do_softirq+0x10d/0x2a5
Nov 18 16:27:31 cp1078 kernel: [1997356.199261]  [<ffffffffb3a7f9f0>] ? irq_exit+0xb0/0xc0
Nov 18 16:27:31 cp1078 kernel: [1997356.199266]  [<ffffffffb40191cc>] ? smp_apic_timer_interrupt+0x4c/0x60
Nov 18 16:27:31 cp1078 kernel: [1997356.199271]  [<ffffffffb4017a66>] ? apic_timer_interrupt+0x96/0xa0
Nov 18 16:27:31 cp1078 kernel: [1997356.199272]  <EOI>
Nov 18 16:27:31 cp1078 kernel: [1997356.199279]  [<ffffffffb3ed89f2>] ? cpuidle_enter_state+0xa2/0x2d0
Nov 18 16:27:31 cp1078 kernel: [1997356.199282]  [<ffffffffb3ed89e0>] ? cpuidle_enter_state+0x90/0x2d0
Nov 18 16:27:31 cp1078 kernel: [1997356.199288]  [<ffffffffb3abc824>] ? cpu_startup_entry+0x154/0x240
Nov 18 16:27:31 cp1078 kernel: [1997356.199295]  [<ffffffffb3a48db0>] ? start_secondary+0x170/0x1b0
Nov 18 16:27:31 cp1078 kernel: [1997356.199298] ---[ end trace aa84d59894952a2a ]---