Page MenuHomePhabricator

lvs3009 NIC HW issue (Broadcom, eno8303)
Closed, ResolvedPublic

Description

While debugging an unrelated issue with the disk utilization, we noticed this in the dmesg output:

[Tue Apr 29 14:05:38 2025] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Down
[Tue Apr 29 14:05:39 2025] Process accounting resumed
[Tue Apr 29 14:05:39 2025] ipip: IPv4 and MPLS over IPv4 tunneling driver
[Tue Apr 29 14:05:41 2025] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 25000 Mbps (NRZ) full duplex, Flow control: none
[Tue Apr 29 14:05:41 2025] bnxt_en 0000:4b:00.0 eno12399np0: FEC autoneg off encoding: Clause 74 BaseR

This is further confirmed by the drop in traffic (https://grafana.wikimedia.org/goto/OWs50UbNR?orgId=1) and the getsel output for lvs3009:

Record:      33
Date/Time:   04/29/2025 14:01:43
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 4 device 0 function 0.
-------------------------------------------------------------------------------
Record:      34
sukhe@lvs3009:~$ sudo lspci -s 04:00.0
04:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe

Details

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ssingh renamed this task from lvs3009 NIC possible HW issue to lvs3009 NIC HW issue (Broadcom, eno12399np0).May 7 2025, 4:16 PM
ssingh triaged this task as High priority.

Change #1143153 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: lvs3009: set lower priority (depool)

https://gerrit.wikimedia.org/r/1143153

RobH moved this task from Backlog to Hardware Failure / Repair on the ops-esams board.
RobH subscribed.

I'll open a case with Dell, which will inevitably require the firmware on the NIC, mainboard, and idrac be updated before they'll authorize a replacement.

Was the server under any particular load before the error we can recreate, or did it just randomly fire?

I can see it seems to have randomly fired a few times:

Mon Mar 17 2025 13:32:01  A fatal error was detected on a component at bus 4 device 0 function 0. 
Mon Mar 17 2025 13:54:37  A fatal error was detected on a component at bus 4 device 0 function 0. 
Tue Apr 29 2025 14:01:43  A fatal error was detected on a component at bus 4 device 0 function 0.

Service Request 209580323 was successfully submitted.

I cannot attach the required TSR report as its too large for the form and I asked them to provide the upload url since its over the 12MB limit on their submit form, so the cadence of this will likely be:

  • support confirms case via email and sends me an upload url
  • i upload the support collection report (cannot attach in phab, too large at 13mb zip, kept locally for update to ticket)
  • they'll require I update the firmware on idrac, bios, and nic to see if it fixes it
  • we'll have to boot back into the OS and let it run to see if it fails again
  • if it fails, we'll get a dispatch of an engineer and part to esams for the repair

Support request confirmed as 'after hours english support' so I had to fill out my contact details a second time and request the upload url for the support request.

This task is now ongoing.

Change #1143153 merged by Ssingh:

[operations/puppet@production] hiera: lvs3009: set lower priority (depool)

https://gerrit.wikimedia.org/r/1143153

Mentioned in SAL (#wikimedia-operations) [2025-05-07T20:23:06Z] <sukhe> depooling lvs3009 for HW maint: T393616

I'll open a case with Dell, which will inevitably require the firmware on the NIC, mainboard, and idrac be updated before they'll authorize a replacement.

Was the server under any particular load before the error we can recreate, or did it just randomly fire?

I think it randomly fired and at least for the last occurrence, there was no traffic spike or anything to correlate it with.

Mentioned in SAL (#wikimedia-operations) [2025-05-07T20:26:00Z] <sukhe@cumin1002> START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3009*} and A:liberica (T393616)

Mentioned in SAL (#wikimedia-operations) [2025-05-07T20:26:07Z] <sukhe@cumin1002> END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3009*} and A:liberica (T393616)

The host has been depooled so you can reboot or shut it down without checking with us. Thanks for the quick response Rob!

ssingh renamed this task from lvs3009 NIC HW issue (Broadcom, eno12399np0) to lvs3009 NIC HW issue (Broadcom, eno8303).May 8 2025, 1:21 PM

url provided by support so i've uploaded the support collection report for their review

They finally answered back first asking simple questions like if the network port or cable are bad (they aren't) and then after another 48 hours requesting firmware updates:

BIOS: https://dl.dell.com/FOLDER12595098M/1/BIOS_K3R8K_WN64_1.16.2.EXE
iDRAC : https://dl.dell.com/FOLDER13034219M/1/iDRAC-with-Lifecycle-Controller_Firmware_R8V2F_WN64_7.20.30.00_A00.EXE
NIC: https://dl.dell.com/FOLDER12577207M/2/Network_Firmware_J8X0M_WN64_23.21.6_01.EXE

After the firmware update please clear the SEL logs https://www.dell.com/support/kbdoc/en-in/000226396/poweredge-how-to-view-or-clear-the-system-event-log

Observe the performance and please do let us know if the issue persists.

Applying them now.

idrac updated, applying bios now

bios updated, applying nic firmware update now

NIC updated.

@ssingh: I'll let this sit idle for a day or so and we can see if it errors, if not can we then return to service and check for errors this week while the case is open?

Link remains stable that I can see, there are no errors reported in either the switch or host side stats.

For the record the device is a BCM57414 NIC, in PCIe slot 4b:00.

cmooney@lvs3009:~$ sudo lspci | grep BCM57414
4b:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
4b:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)

As of now it's on firmware 21.85.21.92, which is our "known good" revision.

The BCM5720 listed in the task description is the unused 1G on-board NIC. The "fatal error was detected on a component at bus 4 device 0 function 0"message also relates to the unused 1G NIC (confirmed under iDRAC inventory).

So I think that may be a red herring, unrelated to why the primary link flapped at ~14:05 on April 29th. The good news is (at least from the switch perspective), that's the only time it randomly flapped in the past 30 days:

https://grafana.wikimedia.org/goto/1kcC5I-Ng

Thanks @RobH! @cmooney: yeah, I updated the task description to reflect that but we though we should get this checked out anyway, since it's the integrated NIC. Thanks for checking, both!

Mentioned in SAL (#wikimedia-operations) [2025-05-13T19:37:22Z] <brett@cumin2002> START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3009*} and A:liberica (T393616)

Mentioned in SAL (#wikimedia-operations) [2025-05-13T19:37:42Z] <brett@cumin2002> END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3009*} and A:liberica (T393616)

I'm not seeing any errors in the kernel log, anomalies in the graphs, or outputs in getsel. I'll go ahead and resolve this. Thanks, all!