Page MenuHomePhabricator

cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem
Closed, ResolvedPublic

Description

Problems:

  • cloudnet100[34] are both experiencing the same transient NIC issues. WMCS already did their homework on this, and found others having the same issue with this firmware version and kernel version.
    • they attempted to flash firmware updates, but it actually only updated kernel packages, as the actual firmware update has to come from HP software/flashing. Basically they've done all they could do before escalating this to DC Ops.
  • HP systems can only remotely flash bios and ilom, with any raid/nic/psu/etc firmware updates requiring the service pack iso image be loaded via usb.

Solution:

  • Either Chris or John will need to create the HP SP bootable image on USB, then flash both of these systems.
  • Please note that one system is active, and the other standby. Please ping Arturo (nic arturo) in IRC when you are ready to do this work and he can let you know which is the standby server.
    • The standby server will have its firmware updated first and then Arturo can fail over to it (once we think its stable) and the second system can be updated.

initial filing of issue

On Jan 3rd 2021 we got a page that there was 100% packet loss, it was a momentary hiccup.

Relevant dmesg: https://phabricator.wikimedia.org/P13635
journal for the +- 10 min of the issue: https://phabricator.wikimedia.org/P13636

The network is back up and running, so this is to investigate later and/or keep track in case it happens again.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2021-01-03T07:06:38Z] <dcaro> Got a network hiccup on cloudnet1004, keeping track here T271058

This just happened again, at Sat 09 Jan 2021 02:37:10 AM UTC

I was about to make another ticket for it until I saw your comment

Received this alert:

Notification Type: PROBLEM
Host: cloudnet1004
State: DOWN
Address: 10.64.20.36
Info: PING CRITICAL - Packet loss = 0%, RTA = 2417.45 ms

Date/Time: Sat Jan 9 02:16:07 UTC 2021

It recovered very quickly, but WMCS did get paged. The logs are naturally noisy, but I did catch this

Jan  9 02:29:07 cloudnet1004 systemd[1]: conntrackd.service: Watchdog timeout (limit 1min)!
Jan  9 02:29:07 cloudnet1004 systemd[1]: conntrackd.service: Killing process 17544 (conntrackd) with signal SIGABRT.
Jan  9 02:29:07 cloudnet1004 systemd[1]: conntrackd.service: Killing process 17544 (conntrackd) with signal SIGKILL.
Jan  9 02:29:07 cloudnet1004 systemd[1]: conntrackd.service: Killing process 17544 (conntrackd) with signal SIGKILL.
Jan  9 02:29:07 cloudnet1004 systemd[1]: conntrackd.service: Failed with result 'watchdog'.
Jan  9 02:29:07 cloudnet1004 systemd[1]: conntrackd.service: Service RestartSec=100ms expired, scheduling restart.
Jan  9 02:29:07 cloudnet1004 systemd[1]: conntrackd.service: Scheduled restart job, restart counter is at 254.
Jan  9 02:29:07 cloudnet1004 systemd[1]: Stopped Conntrack Daemon.
Jan  9 02:29:07 cloudnet1004 systemd[1]: Starting Conntrack Daemon...
Jan  9 02:29:07 cloudnet1004 systemd[1]: conntrackd.service: Supervising process 17732 which is not our child. We'll most likely not notice when it exits.
Jan  9 02:29:07 cloudnet1004 systemd[1]: Started Conntrack Daemon.

That error is happening a fair bit. Dunno if that is related.

Does it have broken hardware or something? This is from dmesg:

Jan  8 08:32:45 cloudnet1004 kernel: [8628599.505555] bnx2x: [bnx2x_mc_assert:750(eno49)]Chip Revision: everest3, FW Version: 7_13_1
Jan  8 08:32:45 cloudnet1004 kernel: [8628599.545553] bnx2x: [bnx2x_panic_dump:1186(eno49)]end crash dump -----------------
Jan  8 08:32:45 cloudnet1004 kernel: [8628599.582176] bnx2x: [bnx2x_sp_rtnl_task:10349(eno49)]Indicating link is down due to Tx-timeout
Jan  8 08:32:47 cloudnet1004 kernel: [8628601.621546] bnx2x: [bnx2x_clean_tx_queue:1208(eno49)]timeout waiting for queue[0]: txdata->tx_pkt_prod(61676) != txdata->tx_pkt_cons(61638)
Jan  8 08:32:49 cloudnet1004 kernel: [8628603.681537] bnx2x: [bnx2x_clean_tx_queue:1208(eno49)]timeout waiting for queue[0]: txdata->tx_pkt_prod(61676) != txdata->tx_pkt_cons(61638)
Jan  8 08:32:59 cloudnet1004 kernel: [8628613.765543] bnx2x: [bnx2x_state_wait:310(eno49)]timeout waiting for state 9
Jan  8 08:33:36 cloudnet1004 kernel: [8628650.818000] bnx2x 0000:04:00.0 eno49: using MSI-X  IRQs: sp 54  fp[0] 56 ... fp[7] 63
Jan  8 08:33:36 cloudnet1004 kernel: [8628650.897584] bnx2x 0000:04:00.0 eno49: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit

That last one is obviously from much earlier, but that's kinda weird.

We confirmed this is the standby, so it won't impact the cloud during this nonsense (and thus isn't a "unbreak now" or real outage).
I just checked the web console, and apparently the network adapter's status is "unknown"

Screen Shot 2021-01-08 at 7.57.02 PM.png (414×1 px, 50 KB)

That might just be this version of iLO being Helpful, though. On Monday, if this is under warranty, we could parse the active health log, possibly (if it is enabled).

There was a similar alert last night:

Notification Type: PROBLEM

Service: Check nf_conntrack usage in neutron netns
Host: cloudnet1004
Address: 10.64.20.36
State: CRITICAL

Date/Time: Sun Jan 10 09:41:55 UTC 2021
This comment was removed by Andrew.

This is possible (but probably not) related to T271647

Mentioned in SAL (#wikimedia-cloud) [2021-01-11T10:07:45Z] <arturo> manually cleanup conntrack table in cloudnet1004 (T271058)

Change 655407 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: neutron: l3_agent: double size of conntrack table

https://gerrit.wikimedia.org/r/655407

Change 655407 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: neutron: l3_agent: double size of conntrack table

https://gerrit.wikimedia.org/r/655407

it seems we are tracking 2 different issues in this task:

  • cloudnet1004 conntrack table being full
  • cloudnet1004 potentially having HW issues in the NIC

Since the conntrack table being full already got a fix in place, I think we should keep using this task to track the potential HW issue in the NIC, which also match task description.

We can open other ticket in case we detect more fun with the conntrack table.

Notification Type: PROBLEM
Host: cloudnet1004
State: DOWN
Address: 10.64.20.36
Info: PING CRITICAL - Packet loss = 100%

Date/Time: Tue Jan 12 02:19:15 UTC 2021

This is clearly an issue in the broadcom driver/firmware. A quick search on the internet shows *a lot* of people reporting similar issues for the same NIC in the same linux kernel 4.x branch.

aborrero renamed this task from cloudnet1004: Network hiccup to cloudnet1004: network hiccup because broadcom driver/firmware problem.Jan 12 2021, 10:26 AM

Mentioned in SAL (#wikimedia-operations) [2021-01-12T10:28:32Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on cloudnet1004.eqiad.wmnet with reason: T271058

Mentioned in SAL (#wikimedia-operations) [2021-01-12T10:28:36Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudnet1004.eqiad.wmnet with reason: T271058

Mentioned in SAL (#wikimedia-cloud) [2021-01-12T10:32:54Z] <arturo> update firmware-bnx2x from 20190114-2 to 20200918-1~bpo10+1 on cloudnet1004 (T271058)

For the record, cloudnet1003 shows the same errors messages from the NIC.

hey @RobH could you please help us here with the vendor side of the firmware update?

I mean, I guess the vendor provides some kind of toolkit to handle this kind of situations.

aborrero renamed this task from cloudnet1004: network hiccup because broadcom driver/firmware problem to cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem.Jan 12 2021, 11:07 AM

hey @RobH could you please help us here with the vendor side of the firmware update?

I mean, I guess the vendor provides some kind of toolkit to handle this kind of situations.

So I'm updating this with what I understand to be our known issues and possible solutions, in summary, to ensure we're all on the same page. I've already chatted with Arturo about this in IRC, so nothing below should be a suprise.

Problems:

  • cloudnet100[34] are both experiencing the same transient NIC issues. WMCS already did their homework on this, and found others having the same issue with this firmware version and kernel version.
    • they attempted to flash firmware updates, but it actually only updated kernel packages, as the actual firmware update has to come from HP software/flashing. Basically they've done all they could do before escalating this to DC Ops.
  • HP systems can only remotely flash bios and ilom, with any raid/nic/psu/etc firmware updates requiring the service pack iso image be loaded via usb.

Solution:

  • Either Chris or John will need to create the HP SP bootable image on USB, then flash both of these systems.
  • Please note that one system is active, and the other standby. Please ping @aborrero (nic arturo) in IRC when you are ready to do this work and he can let you know which is the standby server.
    • The standby server will have its firmware updated first and then Arturo can fail over to it (once we think its stable) and the second system can be updated.

Mentioned in SAL (#wikimedia-cloud) [2021-01-19T10:17:24Z] <arturo> icinga-downtime cloudnet1004 for 1 week (T271058)

Mentioned in SAL (#wikimedia-cloud) [2021-01-27T00:50:08Z] <bstorm> icinga-downtime cloudnet1004 for a week T271058

Mentioned in SAL (#wikimedia-cloud) [2021-02-03T01:50:30Z] <bstorm> icinga-downtime cloudnet1004 for a week T271058

service pack tool is only available for in warranty devices

image.png (258×600 px, 34 KB)
. Have reached out to Chris /papaul for guidance on getting service Pack tool for out of warranty device. Will follow up this week with HPE waiting for Support Account Reference.

Andrew mentioned this in Unknown Object (Task).Feb 9 2021, 9:22 PM
RobH added a subtask: Unknown Object (Task).Feb 10 2021, 10:29 PM
RobH changed the status of subtask Unknown Object (Task) from Open to Stalled.

Mentioned in SAL (#wikimedia-cloud) [2021-02-11T05:37:28Z] <bstorm> downtimed cloudnet1004 for another week T271058

was able to download HP Service Pack for ProLiant with help with papaul. will be available next week to preform firmware update

This comment was removed by Jclark-ctr.

@Jclark-ctr no good news, you will have to try to use a DVD.

Thanks

Unable to update using DVD Iso is 10gb. Dual layer dvd is only 8.5gb. I was able to update firmware using ilo and virtual drive. previous veresion 2.50 Sep 23 2016, current version 2.74 May 08 2020

Mentioned in SAL (#wikimedia-cloud) [2021-02-18T14:50:46Z] <arturo> rebooting cloudnet1004 for T271058

Rebooted the server for a clean start, I see the driver and firmware being loaded. I'll leave here the output for reference:

aborrero@cloudnet1004:~ $ sudo dmesg -T | grep bnx2x
[Thu Feb 18 14:52:35 2021] bnx2x: QLogic 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.712.30-0 (2014/02/10)
[Thu Feb 18 14:52:35 2021] bnx2x 0000:04:00.0: msix capability found
[Thu Feb 18 14:52:35 2021] bnx2x 0000:04:00.0: part number 394D4342-31383735-31543030-47303030
[Thu Feb 18 14:52:36 2021] bnx2x 0000:04:00.0: 32.000 Gb/s available PCIe bandwidth (5 GT/s x8 link)
[Thu Feb 18 14:52:36 2021] bnx2x 0000:04:00.1: msix capability found
[Thu Feb 18 14:52:36 2021] bnx2x 0000:04:00.1: part number 394D4342-31383735-31543030-47303030
[Thu Feb 18 14:52:37 2021] bnx2x 0000:04:00.1: 32.000 Gb/s available PCIe bandwidth (5 GT/s x8 link)
[Thu Feb 18 14:52:37 2021] bnx2x 0000:04:00.0 eno49: renamed from eth0
[Thu Feb 18 14:52:38 2021] bnx2x 0000:04:00.1 eno50: renamed from eth1
[Thu Feb 18 14:52:49 2021] bnx2x 0000:04:00.0: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.13.1.0.fw
[Thu Feb 18 14:52:50 2021] bnx2x 0000:04:00.0 eno49: using MSI-X  IRQs: sp 53  fp[0] 55 ... fp[7] 62
[Thu Feb 18 14:52:50 2021] bnx2x 0000:04:00.0 eno49: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Thu Feb 18 14:52:57 2021] bnx2x 0000:04:00.1 eno50: using MSI-X  IRQs: sp 64  fp[0] 66 ... fp[7] 75
[Thu Feb 18 14:52:58 2021] bnx2x 0000:04:00.1 eno50: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit

Unfortunately the server still shows the same problems, and even self-rebooted over night.

Last log before rebooting shows a kernel crash related to the NIC driver:

Feb 19 09:04:24 cloudnet1004 kernel: [12041.893990] bnx2x: [bnx2x_sp_rtnl_task:10349(eno49)]Indicating link is down due to Tx-timeout
Feb 19 09:04:26 cloudnet1004 kernel: [12043.933782] bnx2x: [bnx2x_clean_tx_queue:1208(eno49)]timeout waiting for queue[7]: txdata->tx_pkt_prod(762) != txdata->tx_pkt_cons(741)
Feb 19 09:04:28 cloudnet1004 kernel: [12045.989781] bnx2x: [bnx2x_clean_tx_queue:1208(eno49)]timeout waiting for queue[7]: txdata->tx_pkt_prod(762) != txdata->tx_pkt_cons(741)
Feb 19 09:04:39 cloudnet1004 kernel: [12056.171823] bnx2x: [bnx2x_state_wait:310(eno49)]timeout waiting for state 9
Feb 19 09:06:47 cloudnet1004 kernel: [12184.525924] INFO: task keepalived:2293 blocked for more than 120 seconds.
Feb 19 09:06:47 cloudnet1004 kernel: [12184.560027]       Tainted: G        W         4.19.0-14-amd64 #1 Debian 4.19.171-2
Feb 19 09:06:47 cloudnet1004 kernel: [12184.596536] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634759] keepalived      D    0  2293   2292 0x00000000
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634762] Call Trace:
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634771]  __schedule+0x29f/0x840
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634774]  schedule+0x28/0x80
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634776]  schedule_preempt_disabled+0xa/0x10
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634779]  __mutex_lock.isra.8+0x2b5/0x4a0
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634785]  ? apparmor_capable+0x6b/0xc0
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634791]  ? security_capable+0x38/0x50
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634796]  rtnetlink_rcv_msg+0x264/0x360
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634798]  ? _cond_resched+0x15/0x30
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634800]  ? rtnl_calcit.isra.31+0x100/0x100
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634804]  netlink_rcv_skb+0x4c/0x120
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634806]  netlink_unicast+0x181/0x210
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634814]  netlink_sendmsg+0x204/0x3d0
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634820]  sock_sendmsg+0x36/0x40
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634826]  ___sys_sendmsg+0x295/0x2f0
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634833]  ? sock_sendmsg+0x36/0x40
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634840]  ? __sys_sendto+0xee/0x160
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634846]  __sys_sendmsg+0x57/0xa0
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634858]  do_syscall_64+0x53/0x110
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634862]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634866] RIP: 0033:0x7f2950615431
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634871] Code: Bad RIP value.
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634873] RSP: 002b:00007ffc136533c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634875] RAX: ffffffffffffffda RBX: 00007ffc13653470 RCX: 00007f2950615431
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634876] RDX: 0000000000000000 RSI: 00007ffc136533f0 RDI: 000000000000000a
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634880] RBP: 00005618ed099490 R08: 0000000000000004 R09: 0000000000000020
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634881] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634882] R13: 00005618edc79118 R14: 0000000000000000 R15: 00005618ed0a4fe0
Feb 19 09:06:47 cloudnet1004 kernel: [12184.634906] INFO: task kworker/42:1:35280 blocked for more than 120 seconds.

I propose we try a newer kernel (buster-backports: 5.10.13-1~bpo10+1) to see if that makes any difference.

I propose we try a newer kernel (buster-backports: 5.10.13-1~bpo10+1) to see if that makes any difference.

Ack, let's give it a shot.

Mentioned in SAL (#wikimedia-cloud) [2021-02-23T10:48:59Z] <arturo> installing linux-image-amd64 from buster-bpo 5.10.13-1~bpo10+1 in cloudnet1004 (T271058)

Mentioned in SAL (#wikimedia-cloud) [2021-02-23T10:49:45Z] <arturo> rebooting clounet1004 into new kernel from buster-bpo (T271058)

Ok, now running the new kernel, will leave it running at least a couple of days and see what happens:

aborrero@cloudnet1004:~ $ sudo dmesg -T | grep bnx
[Tue Feb 23 10:51:49 2021] bnx2x 0000:04:00.0: msix capability found
[Tue Feb 23 10:51:49 2021] bnx2x 0000:04:00.0: part number 394D4342-31383735-31543030-47303030
[Tue Feb 23 10:51:51 2021] bnx2x 0000:04:00.0: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link)
[Tue Feb 23 10:51:51 2021] bnx2x 0000:04:00.1: msix capability found
[Tue Feb 23 10:51:51 2021] bnx2x 0000:04:00.1: part number 394D4342-31383735-31543030-47303030
[Tue Feb 23 10:51:52 2021] bnx2x 0000:04:00.1: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link)
[Tue Feb 23 10:51:52 2021] bnx2x 0000:04:00.1 eno50: renamed from eth1
[Tue Feb 23 10:51:52 2021] bnx2x 0000:04:00.0 eno49: renamed from eth0
[Tue Feb 23 10:52:03 2021] bnx2x 0000:04:00.0: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.13.15.0.fw
[Tue Feb 23 10:52:04 2021] bnx2x 0000:04:00.0 eno49: using MSI-X  IRQs: sp 56  fp[0] 58 ... fp[7] 65
[Tue Feb 23 10:52:04 2021] bnx2x 0000:04:00.0 eno49: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[Tue Feb 23 10:52:09 2021] bnx2x 0000:04:00.1 eno50: using MSI-X  IRQs: sp 66  fp[0] 68 ... fp[7] 77
[Tue Feb 23 10:52:09 2021] bnx2x 0000:04:00.1 eno50: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
aborrero@cloudnet1004:~ $ uname -r
5.10.0-0.bpo.3-amd64

Looking good so far. I'll wait another couple days before drawing more conclusions.

aborrero added a subscriber: Jclark-ctr.

I don't see errors anymore on cloudnet1004 after the kernel upgrade. I think we should do the same in the other server (cloudnet1003).

Requesting permission from the I/F team.

I don't see errors anymore on cloudnet1004 after the kernel upgrade. I think we should do the same in the other server (cloudnet1003).

Sounds fine, tracking down which bits need to be added to 4.19.x would take quite a bit of time.

Mentioned in SAL (#wikimedia-cloud) [2021-03-03T09:30:04Z] <arturo> installing linux kernel 5.10.13-1~bpo10+1 in cloudnet1003 and rebooting it (network failover) (T271058)

Rebooting cloudnet1003 into the new kernel failed to bring interfaces up:

root@cloudnet1003:~# sudo ifup -a
[ 1094.157420] bnx2x 0000:04:00.1: firmware: failed to load bnx2x/bnx2x-e2-7.13.15.0.fw (-2)
[ 1094.196940] bnx2x: [bnx2x_func_hw_init:6004(eno50)]Error loading firmware
[ 1094.229719] bnx2x: [bnx2x_nic_load:2733(eno50)]HW init failed, aborting
RTNETLINK answers: No such file or directory
RTNETLINK answers: Network is down

Waiting for br-external to get ready (MAXWAIT is 32 seconds).
[ 1094.681418] bnx2x 0000:04:00.1: firmware: failed to load bnx2x/bnx2x-e2-7.13.15.0.fw (-2)
[ 1094.721167] bnx2x: [bnx2x_func_hw_init:6004(eno50)]Error loading firmware
[ 1094.753451] bnx2x: [bnx2x_nic_load:2733(eno50)]HW init failed, aborting
RTNETLINK answers: No such file or directory
RTNETLINK answers: Network is down

Waiting for br-internal to get ready (MAXWAIT is 32 seconds).
RTNETLINK answers: Network is down
ifup: failed to bring up eno50.1120
RTNETLINK answers: Network is down
ifup: failed to bring up eno50.1105

We may need the firmware update that @Jclark-ctr did on cloudnet1004.

Mentioned in SAL (#wikimedia-cloud) [2021-03-03T09:58:58Z] <arturo> update firmware-bnx2x from 20190114-2 to 20200918-1~bpo10+1 on cloudnet1003 (T271058)

Mentioned in SAL (#wikimedia-cloud) [2021-03-03T10:00:57Z] <arturo> rebooting again cloudnet1003 (no network failover) (T271058)

Mentioned in SAL (#wikimedia-cloud) [2021-03-03T10:01:57Z] <arturo> icinga-downtime cloudnet1003 for 14 days bc potential alerting storm due to firmware issues (T271058)

ok, updating the firmware-bnx2x + upgrading the kernel did the trick apparently.

I'm closing this task now, thanks everyone for your work on this!

RobH closed subtask Unknown Object (Task) as Declined.Apr 22 2021, 9:48 PM