Page MenuHomePhabricator

KernelError Server cloudgw1002 may have kernel errors
Closed, ResolvedPublic

Description

Common information

  • alertname: KernelError
  • cluster: wmcs
  • instance: cloudgw1002:9100
  • job: node
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


Related Objects

Event Timeline

A kernel error was logged at 00:40 UTC last night:

1Dec 16 00:40:47 cloudgw1002 kernel: ------------[ cut here ]------------
2Dec 16 00:40:47 cloudgw1002 kernel: NETDEV WATCHDOG: enp101s0f0np0 (bnxt_en): transmit queue 5 timed out
3Dec 16 00:40:47 cloudgw1002 kernel: WARNING: CPU: 5 PID: 38 at net/sched/sch_generic.c:467 dev_watchdog+0x260/0x270
4Dec 16 00:40:47 cloudgw1002 kernel: Modules linked in: cpuid binfmt_misc nf_conntrack_netlink 8021q garp stp mrp llc vrf nft_nat nft_chain_nat nf_nat nft_counter nft_ct ipmi_ssif intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp ghash_clmulni_intel aesni_intel mgag200 drm_kms_helper libaes crypto_simd cryptd glue_helper mei_me dell_smbios cec iTCO_wdt intel_pmc_bxt iTCO_vendor_support evdev dcdbas dell_wmi_descriptor wmi_bmof rapl pcspkr i2c_algo_bit mei sg watchdog acpi_ipmi ipmi_si button nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipmi_devintf ipmi_msghandler nf_tables nfnetlink drm fuse configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multipath linear dm_mod raid1 md_mod sd_mod t10_pi crc_t10dif crct10dif_generic xhci_pci xhci_hcd ahci tg3 libahci libata crct10dif_pclmul libphy crct10dif_common crc32_pclmul ptp usbcore scsi_mod
5Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en crc32c_intel lpc_ich pps_core i2c_i801 wmi i2c_smbus usb_common
6Dec 16 00:40:47 cloudgw1002 kernel: CPU: 5 PID: 38 Comm: ksoftirqd/5 Not tainted 5.10.0-32-amd64 #1 Debian 5.10.223-1
7Dec 16 00:40:47 cloudgw1002 kernel: Hardware name: Dell Inc. PowerEdge R440/04JN2K, BIOS 2.22.1 09/13/2024
8Dec 16 00:40:47 cloudgw1002 kernel: RIP: 0010:dev_watchdog+0x260/0x270
9Dec 16 00:40:47 cloudgw1002 kernel: Code: eb a9 48 8b 1c 24 c6 05 bc 33 0c 01 01 48 89 df e8 55 72 fa ff 44 89 e9 48 89 de 48 c7 c7 78 13 d7 9c 48 89 c2 e8 3c c6 14 00 <0f> 0b eb 86 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41
10Dec 16 00:40:47 cloudgw1002 kernel: RSP: 0018:ffffaaa880347da0 EFLAGS: 00010286
11Dec 16 00:40:47 cloudgw1002 kernel: RAX: 0000000000000000 RBX: ffff9f7d89e98000 RCX: ffff9f84dff608c8
12Dec 16 00:40:47 cloudgw1002 kernel: RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9f84dff608c0
13Dec 16 00:40:47 cloudgw1002 kernel: RBP: ffff9f7d89e983dc R08: 0000000000000000 R09: ffffaaa880347bc0
14Dec 16 00:40:47 cloudgw1002 kernel: R10: ffffaaa880347bb8 R11: ffffffff9d2cb908 R12: ffff9f7d89ea5bc0
15Dec 16 00:40:47 cloudgw1002 kernel: R13: 0000000000000005 R14: ffff9f7d89e98480 R15: 000000000000004a
16Dec 16 00:40:47 cloudgw1002 kernel: FS: 0000000000000000(0000) GS:ffff9f84dff40000(0000) knlGS:0000000000000000
17Dec 16 00:40:47 cloudgw1002 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
18Dec 16 00:40:47 cloudgw1002 kernel: CR2: 00007fc653e95c98 CR3: 00000001f897e004 CR4: 00000000007706e0
19Dec 16 00:40:47 cloudgw1002 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
20Dec 16 00:40:47 cloudgw1002 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
21Dec 16 00:40:47 cloudgw1002 kernel: PKRU: 55555554
22Dec 16 00:40:47 cloudgw1002 kernel: Call Trace:
23Dec 16 00:40:47 cloudgw1002 kernel: ? __warn+0x80/0x100
24Dec 16 00:40:47 cloudgw1002 kernel: ? dev_watchdog+0x260/0x270
25Dec 16 00:40:47 cloudgw1002 kernel: ? report_bug+0x9e/0xc0
26Dec 16 00:40:47 cloudgw1002 kernel: ? handle_bug+0x35/0x80
27Dec 16 00:40:47 cloudgw1002 kernel: ? exc_invalid_op+0x14/0x70
28Dec 16 00:40:47 cloudgw1002 kernel: ? asm_exc_invalid_op+0x12/0x20
29Dec 16 00:40:47 cloudgw1002 kernel: ? dev_watchdog+0x260/0x270
30Dec 16 00:40:47 cloudgw1002 kernel: ? pfifo_fast_enqueue+0x150/0x150
31Dec 16 00:40:47 cloudgw1002 kernel: call_timer_fn+0x27/0x100
32Dec 16 00:40:47 cloudgw1002 kernel: __run_timers.part.0+0x1d9/0x250
33Dec 16 00:40:47 cloudgw1002 kernel: ? __switch_to_asm+0x3a/0x60
34Dec 16 00:40:47 cloudgw1002 kernel: ? __switch_to+0x114/0x440
35Dec 16 00:40:47 cloudgw1002 kernel: run_timer_softirq+0x26/0x50
36Dec 16 00:40:47 cloudgw1002 kernel: __do_softirq+0xc2/0x279
37Dec 16 00:40:47 cloudgw1002 kernel: run_ksoftirqd+0x2a/0x40
38Dec 16 00:40:47 cloudgw1002 kernel: smpboot_thread_fn+0xc2/0x160
39Dec 16 00:40:47 cloudgw1002 kernel: ? smpboot_register_percpu_thread+0xf0/0xf0
40Dec 16 00:40:47 cloudgw1002 kernel: kthread+0x118/0x140
41Dec 16 00:40:47 cloudgw1002 kernel: ? __kthread_bind_mask+0x60/0x60
42Dec 16 00:40:47 cloudgw1002 kernel: ret_from_fork+0x1f/0x30
43Dec 16 00:40:47 cloudgw1002 kernel: ---[ end trace 07ed9e6ae0d2ef7e ]---
44Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: TX timeout detected, starting reset task!
45Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [0]: tx{fw_ring: 0 prod: 1f cons: d}
46Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [0]: rx{fw_ring: 1 prod: 3e4} rx_agg{fw_ring: 65535 agg_prod: 0 sw_agg_prod: 0}
47Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [0]: cp{fw_ring: 0 raw_cons: 52732068}
48Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [1]: tx{fw_ring: 1 prod: c5 cons: c5}
49Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [1]: rx{fw_ring: 2 prod: 3bf} rx_agg{fw_ring: 65535 agg_prod: 0 sw_agg_prod: 0}
50Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [1]: cp{fw_ring: 16 raw_cons: 5f312156}
51Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [2]: tx{fw_ring: 2 prod: 478 cons: 460}
52Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [2]: rx{fw_ring: 3 prod: 4f6} rx_agg{fw_ring: 65535 agg_prod: 0 sw_agg_prod: 0}
53Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [2]: cp{fw_ring: 17 raw_cons: 41d70a4c}
54Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [3]: tx{fw_ring: 3 prod: 227 cons: 227}
55Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [3]: rx{fw_ring: 4 prod: 344} rx_agg{fw_ring: 65535 agg_prod: 0 sw_agg_prod: 0}
56Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [3]: cp{fw_ring: 18 raw_cons: 82182b93}
57Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [4]: tx{fw_ring: 4 prod: 4ac cons: 4aa}
58Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [4]: rx{fw_ring: 5 prod: 307} rx_agg{fw_ring: 65535 agg_prod: 0 sw_agg_prod: 0}
59Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [4]: cp{fw_ring: 19 raw_cons: b7880f41}
60Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [5]: tx{fw_ring: 5 prod: 23 cons: 25f}
61Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [5]: rx{fw_ring: 6 prod: 4cb} rx_agg{fw_ring: 65535 agg_prod: 0 sw_agg_prod: 0}
62Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [5]: cp{fw_ring: 20 raw_cons: 4ffaa47e}
63Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [6]: tx{fw_ring: 6 prod: 29e cons: 29e}
64Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [6]: rx{fw_ring: 7 prod: 225} rx_agg{fw_ring: 65535 agg_prod: 0 sw_agg_prod: 0}
65Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [6]: cp{fw_ring: 21 raw_cons: 7881499c}
66Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [7]: tx{fw_ring: 7 prod: 154 cons: 154}
67Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [7]: rx{fw_ring: 8 prod: 636} rx_agg{fw_ring: 65535 agg_prod: 0 sw_agg_prod: 0}
68Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: [7]: cp{fw_ring: 22 raw_cons: 6c37d5f8}
69Dec 16 00:40:48 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
70Dec 16 00:40:48 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
71Dec 16 00:40:49 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
72Dec 16 00:40:49 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0

It did not cause a reboot, but a bunch of these were logged afterwards, between 00:40 and 01:07 UTC:

fnegri@cloudgw1002:~$ journalctl -k -p err
-- Journal begins at Sun 2024-11-10 21:27:16 UTC, ends at Mon 2024-12-16 18:05:34 UTC. --
Dec 16 00:40:47 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: TX timeout detected, starting reset task!
Dec 16 00:40:48 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
Dec 16 00:40:48 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
Dec 16 00:40:49 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
Dec 16 00:40:49 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0

This is no longer happening, but is concerning! I'm mentally filing it in the same box as T374830 because it's one more reason I don't totally trust our network setup.

More errors were logged last night, again around 00:00 UTC (a bit earlier than the previous night though):

fnegri@cloudgw1002:~$ sudo journalctl -k --since -1d -perr
-- Journal begins at Sun 2024-11-10 21:27:16 UTC, ends at Tue 2024-12-17 13:54:33 UTC. --
Dec 17 00:08:37 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: TX timeout detected, starting reset task!
Dec 17 00:08:38 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
Dec 17 00:08:38 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
Dec 17 00:08:39 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
Dec 17 00:08:39 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
Dec 17 00:09:00 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: TX timeout detected, starting reset task!
Dec 17 00:09:01 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
Dec 17 00:09:01 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
Dec 17 00:09:02 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
Dec 17 00:09:02 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
Dec 17 00:09:18 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: TX timeout detected, starting reset task!
Dec 17 00:09:19 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
Dec 17 00:09:19 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
Dec 17 00:09:20 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
Dec 17 00:09:20 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
Dec 17 00:09:36 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: TX timeout detected, starting reset task!
Dec 17 00:09:37 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
Dec 17 00:09:37 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
Dec 17 00:09:38 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: Resp cmpl intr err msg: 0x51
Dec 17 00:09:38 cloudgw1002 kernel: bnxt_en 0000:65:00.0 enp101s0f0np0: hwrm_ring_free type 2 failed. rc:fffffff0 err:0

I don't really know what is the cause of all the messages here. Not unlikely this is happening due to brief moments of CPU pressure.

https://blogs.oracle.com/linux/post/network-interface-transmission-queue-timeout

https://lore.kernel.org/netdev/CACKFLikcVv0VrC1CX+BWKGM9n6J+nOSkJcB97m3RpxqO0svJCA@mail.gmail.com/T/

Certainly I would say that the cloudgw is not highly-optimised as a software-based router and possibly could be tuned more. Things like the offloads configured on the NIC etc. could certainly be looked at.

Or indeed we could look at dedicated CG-NAT hardware boxes to scale it's performance up, or scale it horizontally with multiple x86 systems somehow.

I would not overly worry about this, unless we had problems on the network which, realistically, there is scant evidence of. If it's not simply the system being too busy to deal with the very occasional packet it could maybe be an issue with the NIC, DAC cable, DIMMs or something. But tbh I don't really know where to start with all that.

aborrero added a parent task: Unknown Object (Task).Dec 17 2024, 4:31 PM

@aborrero mentioned on IRC that this server had similar issues a couple months ago: T376589: cloudgw1002: network interface problem

fnegri changed the task status from Open to In Progress.Dec 18 2024, 12:06 PM
fnegri closed this task as Resolved.EditedDec 19 2024, 12:44 PM
fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q1-Q2) board.

I rebooted cloudgw1002 yesterday (it was not logged to SAL because stashbot was down). No kernel errors were logged in cloudgw1002 since the reboot.

It might be that the kernel errors are only happening when the server is under load, and no traffic is going to cloudgw1002 at the moment, as cloudgw1001 is the new primary since the failover yesterday.

I will resolve this task, but we should still replace this server as per T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev.