Page MenuHomePhabricator

cp2014 host down
Closed, ResolvedPublic

Description

cp2014 NIC went down:

root@cp2014:~# tail -250 /var/log/kern.log
Jan 28 23:36:47 cp2014 kernel: [13604258.774217] ------------[ cut here ]------------
Jan 28 23:36:47 cp2014 kernel: [13604258.774236] WARNING: CPU: 22 PID: 0 at /build/linux-AcJpTp/linux-4.9.110/net/sched/sch_generic.c:316 dev_watchdog+0x233/0x240
Jan 28 23:36:47 cp2014 kernel: [13604258.774239] NETDEV WATCHDOG: eno1 (bnx2x): transmit queue 18 timed out
Jan 28 23:36:47 cp2014 kernel: [13604258.774241] Modules linked in: tcp_bbr cpuid sch_fq esp6 xfrm6_mode_transport binfmt_misc jitterentropy_rng drbg ansi_cprng seqiv xfrm4_mode_transport xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo cpufreq_userspace cpufreq_powersave cpufreq_conservative intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 kvm ttm drm_kms_helper irqbypass crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support evdev dcdbas ghash_clmulni_intel pcspkr drm i2c_algo_bit lpc_ich sg mfd_core mei_me mei shpchp wmi ipmi_si button ipmi_devintf ipmi_msghandler ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid0 multipath linear raid1 md_mod sd_mod ahci aesni_intel
Jan 28 23:36:47 cp2014 kernel: [13604258.774296]  libahci aes_x86_64 glue_helper lrw gf128mul ablk_helper ehci_pci bnx2x libata ehci_hcd cryptd ptp pps_core mdio usbcore libcrc32c scsi_mod usb_common crc32c_generic crc32c_intel
Jan 28 23:36:47 cp2014 kernel: [13604258.774311] CPU: 22 PID: 0 Comm: swapper/22 Not tainted 4.9.0-8-amd64 #1 Debian 4.9.110-3+deb9u4
Jan 28 23:36:47 cp2014 kernel: [13604258.774312] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.2.10 03/09/2015
Jan 28 23:36:47 cp2014 kernel: [13604258.774314]  0000000000000000 ffffffff91731e54 ffff949e3fac3e20 0000000000000000
Jan 28 23:36:47 cp2014 kernel: [13604258.774317]  ffffffff9147943e 0000000000000012 ffff949e3fac3e78 ffff949e313dc000
Jan 28 23:36:47 cp2014 kernel: [13604258.774318]  0000000000000016 ffff949e313e7100 000000000000005b ffffffff914794bf
Jan 28 23:36:47 cp2014 kernel: [13604258.774336] Call Trace:
Jan 28 23:36:47 cp2014 kernel: [13604258.774338]  <IRQ>
Jan 28 23:36:47 cp2014 kernel: [13604258.774357]  [<ffffffff91731e54>] ? dump_stack+0x5c/0x78
Jan 28 23:36:47 cp2014 kernel: [13604258.774360]  [<ffffffff9147943e>] ? __warn+0xbe/0xe0
Jan 28 23:36:47 cp2014 kernel: [13604258.774361]  [<ffffffff914794bf>] ? warn_slowpath_fmt+0x5f/0x80
Jan 28 23:36:47 cp2014 kernel: [13604258.774364]  [<ffffffff919387b3>] ? dev_watchdog+0x233/0x240
Jan 28 23:36:47 cp2014 kernel: [13604258.774366]  [<ffffffff91938580>] ? dev_deactivate_queue.constprop.26+0x60/0x60
Jan 28 23:36:47 cp2014 kernel: [13604258.774368]  [<ffffffff914e7562>] ? call_timer_fn+0x32/0x120
Jan 28 23:36:47 cp2014 kernel: [13604258.774369]  [<ffffffff914e78d7>] ? run_timer_softirq+0x1d7/0x420
Jan 28 23:36:47 cp2014 kernel: [13604258.774374]  [<ffffffff914f83c0>] ? tick_sched_handle.isra.12+0x20/0x50
Jan 28 23:36:47 cp2014 kernel: [13604258.774388]  [<ffffffff914f89f8>] ? tick_sched_timer+0x38/0x70
Jan 28 23:36:47 cp2014 kernel: [13604258.774394]  [<ffffffff91a1975d>] ? __do_softirq+0x10d/0x2a5
Jan 28 23:36:47 cp2014 kernel: [13604258.774397]  [<ffffffff9147f930>] ? irq_exit+0xb0/0xc0
Jan 28 23:36:47 cp2014 kernel: [13604258.774398]  [<ffffffff91a191dc>] ? smp_apic_timer_interrupt+0x4c/0x60
Jan 28 23:36:47 cp2014 kernel: [13604258.774400]  [<ffffffff91a17a76>] ? apic_timer_interrupt+0x96/0xa0
Jan 28 23:36:47 cp2014 kernel: [13604258.774401]  <EOI>
Jan 28 23:36:47 cp2014 kernel: [13604258.774404]  [<ffffffff918d89f2>] ? cpuidle_enter_state+0xa2/0x2d0
Jan 28 23:36:47 cp2014 kernel: [13604258.774405]  [<ffffffff918d89e0>] ? cpuidle_enter_state+0x90/0x2d0
Jan 28 23:36:47 cp2014 kernel: [13604258.774409]  [<ffffffff914bc764>] ? cpu_startup_entry+0x154/0x240
Jan 28 23:36:47 cp2014 kernel: [13604258.774414]  [<ffffffff91448db0>] ? start_secondary+0x170/0x1b0
Jan 28 23:36:47 cp2014 kernel: [13604258.774416] ---[ end trace 93f1575b42008d19 ]---
Jan 28 23:36:47 cp2014 kernel: [13604258.793841] bnx2x: [bnx2x_stats_comp:205(eno1)]timeout waiting for stats finished
Jan 28 23:36:47 cp2014 kernel: [13604258.821909] bnx2x: [bnx2x_stats_comp:205(eno1)]timeout waiting for stats finished
Jan 28 23:36:49 cp2014 kernel: [13604260.862580] bnx2x: [bnx2x_clean_tx_queue:1205(eno1)]timeout waiting for queue[0]: txdata->tx_pkt_prod(38411) != txdata->tx_pkt_cons(38061)
Jan 28 23:36:51 cp2014 kernel: [13604262.902730] bnx2x: [bnx2x_clean_tx_queue:1205(eno1)]timeout waiting for queue[12]: txdata->tx_pkt_prod(10783) != txdata->tx_pkt_cons(10612)
Jan 28 23:36:53 cp2014 kernel: [13604264.950677] bnx2x: [bnx2x_clean_tx_queue:1205(eno1)]timeout waiting for queue[24]: txdata->tx_pkt_prod(51166) != txdata->tx_pkt_cons(51024)
Jan 28 23:36:55 cp2014 kernel: [13604267.003700] bnx2x: [bnx2x_clean_tx_queue:1205(eno1)]timeout waiting for queue[1]: txdata->tx_pkt_prod(57592) != txdata->tx_pkt_cons(57311)
Jan 28 23:36:57 cp2014 kernel: [13604269.057476] bnx2x: [bnx2x_clean_tx_queue:1205(eno1)]timeout waiting for queue[13]: txdata->tx_pkt_prod(41715) != txdata->tx_pkt_cons(41616)
Jan 28 23:37:00 cp2014 kernel: [13604271.111261] bnx2x: [bnx2x_clean_tx_queue:1205(eno1)]timeout waiting for queue[25]: txdata->tx_pkt_prod(37617) != txdata->tx_pkt_cons(37499)
Jan 28 23:37:02 cp2014 kernel: [13604273.169753] bnx2x: [bnx2x_clean_tx_queue:1205(eno1)]timeout waiting for queue[2]: txdata->tx_pkt_prod(26463) != txdata->tx_pkt_cons(26204)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2019, 11:50 PM

Mentioned in SAL (#wikimedia-operations) [2019-01-28T23:51:22Z] <vgutierrez> restarting cp2014 - T214872

Vgutierrez moved this task from Triage to Hardware on the Traffic board.Jan 28 2019, 11:53 PM
Vgutierrez closed this task as Resolved.Jan 29 2019, 12:13 AM
Vgutierrez claimed this task.

everything got back to normal after a reboot