Page MenuHomePhabricator

cp3032 ethernet link down (bnx2x dump in the dmesg)
Closed, ResolvedPublic

Description

Today Icinga showed (UTC+2 timings):

06:25  <icinga-wm> PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100%
06:31  <icinga-wm> PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6
06:31  <icinga-wm> PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6
06:32  <icinga-wm> PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6
06:32  <icinga-wm> PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6
06:32  <icinga-wm> PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3032_v4, cp3032_v6
06:32  <icinga-wm> PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp3032_v4, cp3032_v6
`

Host not reachable via ssh, but available via console. Several errors in the dmesg related to bnx2x:

[Thu Jun  1 04:21:35 2017] ------------[ cut here ]------------                                                                                                             [56/1921]
[Thu Jun  1 04:21:35 2017] WARNING: CPU: 2 PID: 0 at /home/zumbi/linux-4.9.13/net/sched/sch_generic.c:316 dev_watchdog+0x220/0x230
[Thu Jun  1 04:21:35 2017] NETDEV WATCHDOG: eth0 (bnx2x): transmit queue 0 timed out
[Thu Jun  1 04:21:35 2017] Modules linked in: tcp_bbr(E) sch_fq(E) binfmt_misc(E) esp6(E) xfrm6_mode_transport(E) hmac(E) drbg(E) ansi_cprng(E) cpufreq_conservative(E) seqiv(E) cpuf
req_userspace(E) xfrm4_mode_transport(E) cpufreq_powersave(E) 8021q(E) garp(E) mrp(E) stp(E) llc(E) xfrm_user(E) xfrm4_tunnel(E) tunnel4(E) ipcomp(E) xfrm_ipcomp(E) esp4(E) ah4(E) a
f_key(E) xfrm_algo(E) intel_rapl(E) sb_edac(E) edac_core(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm(E) mgag200(E) ttm(E) irqbypass(E) crct10dif_pclmul(E) drm_kms
_helper(E) crc32_pclmul(E) ipmi_watchdog(E) iTCO_wdt(E) ghash_clmulni_intel(E) intel_cstate(E) iTCO_vendor_support(E) drm(E) evdev(E) dcdbas(E) i2c_algo_bit(E) lpc_ich(E) mei_me(E)
intel_rapl_perf(E) pcspkr(E) mfd_core(E) mei(E) shpchp(E) wmi(E) tpm_tis(E) tpm_tis_core(E) tpm(E)
[Thu Jun  1 04:21:35 2017]  acpi_power_meter(E) button(E) ipmi_si(E) ipmi_poweroff(E) ipmi_devintf(E) ipmi_msghandler(E) autofs4(E) ext4(E) crc16(E) jbd2(E) fscrypto(E) mbcache(E) r
aid1(E) md_mod(E) sg(E) sd_mod(E) ahci(E) ehci_pci(E) libahci(E) ehci_hcd(E) libata(E) bnx2x(E) aesni_intel(E) aes_x86_64(E) ptp(E) glue_helper(E) pps_core(E) lrw(E) mdio(E) gf128mu
l(E) ablk_helper(E) libcrc32c(E) cryptd(E) usbcore(E) crc32c_generic(E) scsi_mod(E) usb_common(E) crc32c_intel(E) fjes(E)
[Thu Jun  1 04:21:35 2017] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G            E   4.9.0-0.bpo.2-amd64 #1 Debian 4.9.13-1~bpo8+1
[Thu Jun  1 04:21:35 2017] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.0.4 08/28/2014
[Thu Jun  1 04:21:35 2017]  0000000000000000 ffffffff97529cd5 ffff9f47bf843e38 0000000000000000
[Thu Jun  1 04:21:35 2017]  ffffffff972778a4 0000000000000000 ffff9f47bf843e90 ffff9f47ac47c000
[Thu Jun  1 04:21:35 2017]  0000000000000002 ffff9f47b1267100 000000000000005b ffffffff9727791f
[Thu Jun  1 04:21:35 2017] Call Trace:
[Thu Jun  1 04:21:35 2017]  <IRQ>
[Thu Jun  1 04:21:35 2017]  [<ffffffff97529cd5>] ? dump_stack+0x5c/0x77
[Thu Jun  1 04:21:35 2017]  [<ffffffff972778a4>] ? __warn+0xc4/0xe0
[Thu Jun  1 04:21:35 2017]  [<ffffffff9727791f>] ? warn_slowpath_fmt+0x5f/0x80
[Thu Jun  1 04:21:35 2017]  [<ffffffff97720d70>] ? dev_watchdog+0x220/0x230
[Thu Jun  1 04:21:35 2017]  [<ffffffff97720b50>] ? dev_deactivate_queue.constprop.27+0x60/0x60
[Thu Jun  1 04:21:35 2017]  [<ffffffff972e6240>] ? call_timer_fn+0x30/0x130
[Thu Jun  1 04:21:35 2017]  [<ffffffff972e782c>] ? run_timer_softirq+0x1dc/0x440
[Thu Jun  1 04:21:35 2017]  [<ffffffff972f6c80>] ? tick_sched_handle.isra.13+0x20/0x50
[Thu Jun  1 04:21:35 2017]  [<ffffffff972f72a8>] ? tick_sched_timer+0x38/0x70
[Thu Jun  1 04:21:35 2017]  [<ffffffff977fdf26>] ? __do_softirq+0x106/0x292
[Thu Jun  1 04:21:35 2017]  [<ffffffff9727db28>] ? irq_exit+0x98/0xa0
[Thu Jun  1 04:21:35 2017]  [<ffffffff977fdd2e>] ? smp_apic_timer_interrupt+0x3e/0x50
[Thu Jun  1 04:21:35 2017]  [<ffffffff977fd042>] ? apic_timer_interrupt+0x82/0x90
[Thu Jun  1 04:21:35 2017]  <EOI>
[Thu Jun  1 04:21:35 2017]  [<ffffffff976c2153>] ? cpuidle_enter_state+0x113/0x260
[Thu Jun  1 04:21:35 2017]  [<ffffffff972bbfce>] ? cpu_startup_entry+0x17e/0x260
[Thu Jun  1 04:21:35 2017]  [<ffffffff9724846d>] ? start_secondary+0x14d/0x190
[Thu Jun  1 04:21:35 2017] ---[ end trace db9d931b0691cee2 ]---
[Thu Jun  1 04:21:35 2017] bnx2x: [bnx2x_stats_comp:205(eth0)]timeout waiting for stats finished
[Thu Jun  1 04:21:35 2017] bnx2x: [bnx2x_stats_comp:205(eth0)]timeout waiting for stats finished
[Thu Jun  1 04:21:37 2017] bnx2x: [bnx2x_clean_tx_queue:1205(eth0)]timeout waiting for queue[0]: txdata->tx_pkt_prod(16172) != txdata->tx_pkt_cons(16171)
[Thu Jun  1 04:21:39 2017] bnx2x: [bnx2x_clean_tx_queue:1205(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(25551) != txdata->tx_pkt_cons(25334)
[Thu Jun  1 04:21:41 2017] bnx2x: [bnx2x_clean_tx_queue:1205(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(65069) != txdata->tx_pkt_cons(64844)
[..]

Related Objects

Event Timeline

elukey created this task.Jun 1 2017, 5:56 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 1 2017, 5:56 AM
elukey added a comment.Jun 1 2017, 5:58 AM

Host depooled manually, tried to run:

root@cp3032:/home/elukey# ifconfig eth0 down
[3097418.717749] bnx2x: [bnx2x_del_all_macs:8501(eth0)]Failed to delete MACs: -5
[3097418.725719] bnx2x: [bnx2x_chip_cleanup:9321(eth0)]Failed to schedule DEL commands for UC MACs list: -5
[3097418.757262] bnx2x: [bnx2x_func_stop:9080(eth0)]FUNC_STOP ramrod failed. Running a dry transaction

root@cp3032:/home/elukey# ifconfig eth0 up
[3097471.715363] bnx2x: [bnx2x_nic_load:2758(eth0)]Function start failed!
SIOCSIFFLAGS: Input/output error

Mentioned in SAL (#wikimedia-operations) [2017-06-01T05:58:32Z] <elukey> powercycle cp3032 - T166758

ema moved this task from Triage to Caching on the Traffic board.Jun 2 2017, 4:02 PM
BBlack closed this task as Resolved.Oct 3 2017, 2:12 PM
BBlack claimed this task.
BBlack added a subscriber: BBlack.

Hasn't recurred AFAIK. Note this is similar to bnx2x dmesg we managed to induce on a bunch of upload@ulsfo machines via bad NUMA tuning. It's probably not a hardware issue.

Dzahn added a subscriber: Dzahn.Feb 13 2019, 8:45 PM

This issue happened today on ms-be2021

1[6234885.896871] bnx2x: [bnx2x_issue_dmae_with_comp:549(eno49)]DMAE timeout!
2[6234885.934074] bnx2x: [bnx2x_write_dmae:597(eno49)]DMAE returned failure -1
3[6234886.303199] bnx2x: [bnx2x_issue_dmae_with_comp:549(eno49)]DMAE timeout!
4[6234886.340738] bnx2x: [bnx2x_write_dmae:597(eno49)]DMAE returned failure -1
5[6234886.713413] bnx2x: [bnx2x_issue_dmae_with_comp:549(eno49)]DMAE timeout!
6[6234886.750420] bnx2x: [bnx2x_write_dmae:597(eno49)]DMAE returned failure -1
7[6234887.120839] bnx2x: [bnx2x_issue_dmae_with_comp:549(eno49)]DMAE timeout!
8[6234887.157802] bnx2x: [bnx2x_write_dmae:597(eno49)]DMAE returned failure -1
9[6234887.528627] bnx2x: [bnx2x_issue_dmae_with_comp:549(eno49)]DMAE timeout!
10[6234887.565181] bnx2x: [bnx2x_write_dmae:597(eno49)]DMAE returned failure -1
11[6234887.940079] bnx2x: [bnx2x_issue_dmae_with_comp:549(eno49)]DMAE timeout!
12[6234887.977990] bnx2x: [bnx2x_write_dmae:597(eno49)]DMAE returned failure -1
.

15:40 <+icinga-wm> PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100%

15:43 < mutante> !log ms-be2021 - powercycling