Page MenuHomePhabricator

cloudgw1002: network interface problem
Closed, ResolvedPublic

Description

Common information

  • address: 185.15.56.237
  • alertname: ProbeDown
  • family: ip4
  • instance: virt.cloudgw.eqiad1.wikimediacloud.org:0
  • job: probes/custom
  • module: icmp_virt_cloudgw_eqiad1_wikimediacloud_org_ip4
  • prometheus: ops
  • severity: critical
  • source: prometheus
  • team: wmcs

Firing alerts



Event Timeline

cloudgw1002 had a kernel problem related to the NIC driver:

[Mon Oct  7 07:17:27 2024] ------------[ cut here ]------------
[Mon Oct  7 07:17:27 2024] NETDEV WATCHDOG: enp101s0f0np0 (bnxt_en): transmit queue 4 timed out
[Mon Oct  7 07:17:27 2024] WARNING: CPU: 10 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x260/0x270
[Mon Oct  7 07:17:27 2024] Modules linked in: cpuid binfmt_misc nf_conntrack_netlink 8021q garp stp mrp llc vrf nft_nat nft_counter nft_chain_nat nf_nat nft_ct intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp ghash_clmulni_intel aesni_intel mgag200 drm_kms_helper libaes crypto_simd cryptd glue_helper mei_me iTCO_wdt cec dell_smbios intel_pmc_bxt iTCO_vendor_support evdev dcdbas rapl dell_wmi_descriptor wmi_bmof pcspkr i2c_algo_bit watchdog mei sg acpi_ipmi ipmi_si button nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipmi_devintf ipmi_msghandler nf_tables nfnetlink fuse drm configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear dm_mod raid1 md_mod sd_mod t10_pi crc_t10dif crct10dif_generic xhci_pci xhci_hcd ahci libahci tg3 libata crct10dif_pclmul libphy crct10dif_common usbcore crc32_pclmul bnxt_en
[Mon Oct  7 07:17:27 2024]  i2c_i801 ptp crc32c_intel scsi_mod lpc_ich i2c_smbus usb_common wmi pps_core
[Mon Oct  7 07:17:27 2024] CPU: 10 PID: 0 Comm: swapper/10 Not tainted 5.10.0-30-amd64 #1 Debian 5.10.218-1
[Mon Oct  7 07:17:27 2024] Hardware name: Dell Inc. PowerEdge R440/04JN2K, BIOS 2.9.3 09/23/2020
[Mon Oct  7 07:17:27 2024] RIP: 0010:dev_watchdog+0x260/0x270
[Mon Oct  7 07:17:27 2024] Code: eb a9 48 8b 1c 24 c6 05 fc 57 0c 01 01 48 89 df e8 35 74 fa ff 44 89 e9 48 89 de 48 c7 c7 28 07 b7 bc 48 89 c2 e8 0b bc 14 00 <0f> 0b eb 86 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41
[Mon Oct  7 07:17:27 2024] RSP: 0018:ffffb4cc80724eb0 EFLAGS: 00010282
[Mon Oct  7 07:17:27 2024] RAX: 0000000000000000 RBX: ffff8ef7e0efc000 RCX: 0000000000000000
[Mon Oct  7 07:17:27 2024] RDX: ffff8eff200b0620 RSI: ffff8eff200a08c0 RDI: 0000000000000300
[Mon Oct  7 07:17:27 2024] RBP: ffff8ef7e0efc3dc R08: 0000000000000000 R09: ffffb4cc80724cd0
[Mon Oct  7 07:17:27 2024] R10: ffffb4cc80724cc8 R11: ffffffffbd0cb828 R12: ffff8ef7e0f05bc0
[Mon Oct  7 07:17:27 2024] R13: 0000000000000004 R14: ffff8ef7e0efc480 R15: 000000000000004a
[Mon Oct  7 07:17:27 2024] FS:  0000000000000000(0000) GS:ffff8eff20080000(0000) knlGS:0000000000000000
[Mon Oct  7 07:17:27 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Oct  7 07:17:27 2024] CR2: 000000c000417000 CR3: 00000004ea60a001 CR4: 00000000007706e0
[Mon Oct  7 07:17:27 2024] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Mon Oct  7 07:17:27 2024] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Mon Oct  7 07:17:27 2024] PKRU: 55555554
[Mon Oct  7 07:17:27 2024] Call Trace:
[Mon Oct  7 07:17:27 2024]  <IRQ>
[Mon Oct  7 07:17:27 2024]  ? __warn+0x80/0x100
[Mon Oct  7 07:17:27 2024]  ? dev_watchdog+0x260/0x270
[Mon Oct  7 07:17:27 2024]  ? report_bug+0x9e/0xc0
[Mon Oct  7 07:17:27 2024]  ? handle_bug+0x35/0x80
[Mon Oct  7 07:17:27 2024]  ? exc_invalid_op+0x14/0x70
[Mon Oct  7 07:17:27 2024]  ? asm_exc_invalid_op+0x12/0x20
[Mon Oct  7 07:17:27 2024]  ? dev_watchdog+0x260/0x270
[Mon Oct  7 07:17:27 2024]  ? pfifo_fast_enqueue+0x150/0x150
[Mon Oct  7 07:17:27 2024]  call_timer_fn+0x27/0x100
[Mon Oct  7 07:17:27 2024]  __run_timers.part.0+0x1d9/0x250
[Mon Oct  7 07:17:27 2024]  ? ktime_get+0x35/0xa0
[Mon Oct  7 07:17:27 2024]  ? lapic_next_deadline+0x28/0x40
[Mon Oct  7 07:17:27 2024]  ? clockevents_program_event+0x8a/0xf0
[Mon Oct  7 07:17:27 2024]  run_timer_softirq+0x26/0x50
[Mon Oct  7 07:17:27 2024]  __do_softirq+0xc2/0x279
[Mon Oct  7 07:17:27 2024]  asm_call_irq_on_stack+0xf/0x20
[Mon Oct  7 07:17:27 2024]  </IRQ>
[Mon Oct  7 07:17:27 2024]  do_softirq_own_stack+0x37/0x50
[Mon Oct  7 07:17:27 2024]  irq_exit_rcu+0x92/0xc0
[Mon Oct  7 07:17:27 2024]  sysvec_apic_timer_interrupt+0x36/0x80
[Mon Oct  7 07:17:27 2024]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[Mon Oct  7 07:17:27 2024] RIP: 0010:cpuidle_enter_state+0xc7/0x350
[Mon Oct  7 07:17:27 2024] Code: 8b 3d fd b8 f3 43 e8 88 d8 9e ff 49 89 c5 0f 1f 44 00 00 31 ff e8 f9 e3 9e ff 45 84 ff 0f 85 fe 00 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 0a 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d
[Mon Oct  7 07:17:27 2024] RSP: 0018:ffffb4cc80257ea8 EFLAGS: 00000246
[Mon Oct  7 07:17:27 2024] RAX: ffff8eff200b3c40 RBX: 0000000000000003 RCX: 000000000000001f
[Mon Oct  7 07:17:27 2024] RDX: 0000000000000000 RSI: 000000003d1879ab RDI: 0000000000000000
[Mon Oct  7 07:17:27 2024] RBP: ffff8eff200bda00 R08: 00247bcecf1bb246 R09: 0000000000000001
[Mon Oct  7 07:17:27 2024] R10: 0000000000000000 R11: 0000000000001ec9 R12: ffffffffbd1aeec0
[Mon Oct  7 07:17:27 2024] R13: 00247bcecf1bb246 R14: 0000000000000003 R15: 0000000000000000
[Mon Oct  7 07:17:27 2024]  ? cpuidle_enter_state+0xb7/0x350
[Mon Oct  7 07:17:27 2024]  cpuidle_enter+0x29/0x40
[Mon Oct  7 07:17:27 2024]  do_idle+0x1f3/0x2b0
[Mon Oct  7 07:17:27 2024]  cpu_startup_entry+0x19/0x20
[Mon Oct  7 07:17:27 2024]  secondary_startup_64_no_verify+0xb1/0xbb
[Mon Oct  7 07:17:27 2024] ---[ end trace f8d3f0a5000802f0 ]---
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: TX timeout detected, starting reset task!
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [0]: tx{fw_ring: 0 prod: 1dc cons: 1dc}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [0]: rx{fw_ring: 1 prod: fa} rx_agg{fw_ring: 9 agg_prod: 487 sw_agg_prod: 235}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [0]: cp{fw_ring: 0 raw_cons: 646bbb46}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [1]: tx{fw_ring: 1 prod: 182 cons: 182}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [1]: rx{fw_ring: 2 prod: 1c1} rx_agg{fw_ring: 10 agg_prod: 3ea sw_agg_prod: 275}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [1]: cp{fw_ring: 16 raw_cons: c8d964aa}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [2]: tx{fw_ring: 2 prod: 105 cons: 105}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [2]: rx{fw_ring: 3 prod: e1} rx_agg{fw_ring: 11 agg_prod: 703 sw_agg_prod: 1ab}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [2]: cp{fw_ring: 17 raw_cons: a11ad100}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [3]: tx{fw_ring: 3 prod: 1a1 cons: 1a1}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [3]: rx{fw_ring: 4 prod: 18d} rx_agg{fw_ring: 12 agg_prod: 5bc sw_agg_prod: 638}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [3]: cp{fw_ring: 18 raw_cons: f087bac0}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [4]: tx{fw_ring: 4 prod: a7 cons: b9}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [4]: rx{fw_ring: 5 prod: 1e3} rx_agg{fw_ring: 13 agg_prod: 6d4 sw_agg_prod: 6bb}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [4]: cp{fw_ring: 19 raw_cons: e99c9b78}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [5]: tx{fw_ring: 5 prod: 1e7 cons: 1e7}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [5]: rx{fw_ring: 6 prod: 1c3} rx_agg{fw_ring: 14 agg_prod: 17b sw_agg_prod: 2aa}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [5]: cp{fw_ring: 20 raw_cons: ae841f57}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [6]: tx{fw_ring: 6 prod: 185 cons: 185}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [6]: rx{fw_ring: 7 prod: 91} rx_agg{fw_ring: 15 agg_prod: 3d0 sw_agg_prod: 5ed}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [6]: cp{fw_ring: 21 raw_cons: d69c5d60}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [7]: tx{fw_ring: 7 prod: 1c8 cons: 1c8}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [7]: rx{fw_ring: 8 prod: 162} rx_agg{fw_ring: 16 agg_prod: 222 sw_agg_prod: 5d}
[Mon Oct  7 07:17:27 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [7]: cp{fw_ring: 22 raw_cons: f725646a}
[Mon Oct  7 07:17:28 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
[Mon Oct  7 07:17:28 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
[Mon Oct  7 07:17:29 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
[Mon Oct  7 07:17:29 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
[Mon Oct  7 07:17:30 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
[Mon Oct  7 07:17:30 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: TX timeout detected, starting reset task!
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [0]: tx{fw_ring: 0 prod: 17e cons: 17e}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [0]: rx{fw_ring: 1 prod: 175} rx_agg{fw_ring: 9 agg_prod: 3c5 sw_agg_prod: 3c6}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [0]: cp{fw_ring: 0 raw_cons: 143d0}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [1]: tx{fw_ring: 1 prod: 17e cons: 17e}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [1]: rx{fw_ring: 2 prod: f3} rx_agg{fw_ring: 10 agg_prod: 135 sw_agg_prod: 135}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [1]: cp{fw_ring: 16 raw_cons: c759}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [2]: tx{fw_ring: 2 prod: 8d cons: 8d}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [2]: rx{fw_ring: 3 prod: ff} rx_agg{fw_ring: 11 agg_prod: 652 sw_agg_prod: 652}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [2]: cp{fw_ring: 17 raw_cons: 1b53d}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [3]: tx{fw_ring: 3 prod: 121 cons: 121}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [3]: rx{fw_ring: 4 prod: 48} rx_agg{fw_ring: 12 agg_prod: 66 sw_agg_prod: 66}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [3]: cp{fw_ring: 18 raw_cons: b96b}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [4]: tx{fw_ring: 4 prod: 121 cons: 133}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [4]: rx{fw_ring: 5 prod: 43} rx_agg{fw_ring: 13 agg_prod: 3e2 sw_agg_prod: 3e3}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [4]: cp{fw_ring: 19 raw_cons: 2e77}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [5]: tx{fw_ring: 5 prod: a2 cons: a2}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [5]: rx{fw_ring: 6 prod: d9} rx_agg{fw_ring: 14 agg_prod: 23 sw_agg_prod: 24}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [5]: cp{fw_ring: 20 raw_cons: 195a9}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [6]: tx{fw_ring: 6 prod: 195 cons: 195}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [6]: rx{fw_ring: 7 prod: 1d0} rx_agg{fw_ring: 15 agg_prod: 606 sw_agg_prod: 606}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [6]: cp{fw_ring: 21 raw_cons: eb1a}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [7]: tx{fw_ring: 7 prod: bd cons: bd}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [7]: rx{fw_ring: 8 prod: 15e} rx_agg{fw_ring: 16 agg_prod: 4e sw_agg_prod: 4e}
[Mon Oct  7 07:18:12 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: [7]: cp{fw_ring: 22 raw_cons: 1376f}
[Mon Oct  7 07:18:13 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
[Mon Oct  7 07:18:13 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
[Mon Oct  7 07:18:14 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
[Mon Oct  7 07:18:14 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
[Mon Oct  7 07:18:14 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
[Mon Oct  7 07:18:14 2024] bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0

Host rebooted by aborrero@cumin1002 with reason: network problem

aborrero renamed this task from ProbeDown to cloudgw1002: ProbeDown .Oct 7 2024, 7:57 AM
aborrero changed the task status from Open to In Progress.Oct 7 2024, 8:02 AM
aborrero claimed this task.
aborrero triaged this task as High priority.
aborrero added a project: User-aborrero.
aborrero renamed this task from cloudgw1002: ProbeDown to cloudgw1002: network interface problem.Oct 7 2024, 9:13 AM
aborrero added a parent task: Unknown Object (Task).
aborrero added subscribers: VRiley-WMF, Jclark-ctr.

hey @VRiley-WMF or @Jclark-ctr have you seen this error before on any network card or related? rings any bell?

Do you think that upgrading the NIC firmware could prevent this error?

checked the server today. No kernel panic.

@aborrero i did just update idrac from 4.4 to 7.0. unrelated. but since i was logged in and causes no reboot. let us know if you would like to update Nic , bios. Bios requires reboot. nic will drop link

@aborrero i did just update idrac from 4.4 to 7.0. unrelated. but since i was logged in and causes no reboot. let us know if you would like to update Nic , bios. Bios requires reboot. nic will drop link

Yeas, please, upgrade it. You can do anytime, because the server is the passive one in the pair, so a reboot should cause no service disruption.

Firmware applied for nic and bios