Page MenuHomePhabricator

cloudvirt10[25-30] connection issues on primary nic
Closed, ResolvedPublic

Description

I just now reimaged cloudvirt1025. The reimage went OK, I launched a canary VM on it and I can ssh to the VM (it is canary1025-01.cloudvirt-canary.eqiad1.wikimedia.cloud).

That was a couple of hours ago. NOW I can't ssh to the host or ping it from cumin1001. Everything looks fine on the console, and I can still ssh to the canary VM.

Icinga also can't reach the host, which is noisy.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2020-12-03T16:56:33Z] <arturo> rebooting cloudvirt1025 to debug network issue T269313

Mentioned in SAL (#wikimedia-cloud) [2020-12-03T17:01:10Z] <arturo> icinga downtime cloudvirt1025 for 48h to debug network issue T269313

I just reimaged cloudvirt1026 and it's exhibiting this same behavior.

The issue is related to ARP. And I suspect something is wrong with the NIC or the OS.
Most of the ARP queries from hosts in the same vlan towards cloudvirt1025 IP don't show up on tcpdump.

Some queries occasionally go through.
All the replies (from queries sent out of cloudvirt1015) come back.

Once a MAC is learned (on either sides, gateway included) things will work fine. Until it times out.

I'm using cloudvirt1035 as a test source, as it is in the same vlan and on the same switch.
Both MACs are learned (and kept up to date) in the switch switching-table, as there is always some traffic going in and out.

Running a continuous arping: cloudvirt1035:~$ sudo arping 10.64.20.49
Shows one brief frame in tcpdump: (and that was the only one I saw) as well as its reply.

cloudvirt1025
19:31:01.622966 ARP, Request who-has cloudvirt1025.eqiad.wmnet tell cloudvirt1035.eqiad.wmnet, length 46
19:31:01.622995 ARP, Reply cloudvirt1025.eqiad.wmnet is-at bc:97:e1:a7:43:94 (oui Unknown), length 28
19:31:01.623322 ARP, Request who-has cloudvirt1035.eqiad.wmnet tell cloudvirt1025.eqiad.wmnet, length 28
19:31:01.623578 ARP, Reply cloudvirt1035.eqiad.wmnet is-at bc:97:e1:4a:68:52 (oui Unknown), length 46
cloudvirt1035
cloudvirt1035:~$ sudo arping 10.64.20.49
ARPING 10.64.20.49
60 bytes from bc:97:e1:a7:43:94 (10.64.20.49): index=0 time=619.026 msec
Timeout
Timeout
Timeout
Timeout
Timeout
Timeout
Timeout
Timeout

I checked that there are no storm control or other thresholds on the switch.
I went through all the ARP related PR (Juniper Problem Reports) for our model/version.

I then added a counter on the switch port facing cloudvirt1025:

[edit interfaces xe-0/0/29]
+    unit 0 {
+        family ethernet-switching {
+            filter {
+                output log-arp;  <- traffic exiting the interface toward the host
+            }
+        }
+    }
[edit firewall]
+    family ethernet-switching {
+        filter log-arp {
+            term arp {
+                from {
+                    source-mac-address {
+                        bc:97:e1:4a:68:52/48;   <- cloudvirt1035 MAC
+                    }
+                    ether-type arp;
+                }
+                then {
+                    accept;
+                    count arp-counter;  <- log/syslog don't work for output filters
+                }
+            }
+            term all {
+                then accept;
+            }
+        }
+    }

Which shows the counters increasing (only when arping is running):

cloudsw1-c8-eqiad# run show firewall counter arp-counter filter log-arp
Filter: log-arp                                                
Counters:
Name                                                Bytes              Packets
arp-counter                                           256                    4

That means ARP frames are making it out to cloudvirt1025, but tcpdump (capturing before iptables) doesn't see them.

As control, arping from cloudvirt1035 to cloudvirt1034 (on the same switch/vlan) works fine.

cloudvirt1035:~$ sudo arping 10.64.20.76
ARPING 10.64.20.76
60 bytes from bc:97:e1:4a:6d:50 (10.64.20.76): index=0 time=53.795 usec
60 bytes from bc:97:e1:4a:6d:50 (10.64.20.76): index=1 time=43.638 usec
60 bytes from bc:97:e1:4a:6d:50 (10.64.20.76): index=2 time=53.997 usec
60 bytes from bc:97:e1:4a:6d:50 (10.64.20.76): index=3 time=55.776 usec

Maybe we should go through the NIC's settings or try to upgrade it?
FYI LLDP reports it as:

Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_214.0.200.0

Arzhel nerd-sniped me with this.

It seems that all broadcast traffic destined for eno1np0 arrives, untagged, on eno2np1(!), hence why ARP doesn't work. After double-checking configs in switch & host, the most plausible explanation was that this is either a NIC firmware bug, kernel driver bug, or a combination of both.

I played around in BIOS settings, disabling EVB; that didn't have an effect. However, either of these two things seem to restore connectivity:

  • root@cloudvirt1025:~# brctl delif brq7425e328-56 eno2np1.1105
  • root@cloudvirt1025:~# ethtool -K eno2np1 rxvlan off

The latter turns of RX VLAN acceleration (which would point to NIC FW bug), and makes sense. The former is a bit more surprising. These two things seem to be related, as this also seems to happen:

root@cloudvirt1025:~# brctl addif brq7425e328-56 eno2np1.1105 # connectivity breaks
root@cloudvirt1025:~# ethtool -K eno2np1 rxvlan off # connectivity is restored
root@cloudvirt1025:~# ethtool -K eno2np1 rxvlan on # still works!
root@cloudvirt1025:~# brctl delif brq7425e328-56 eno2np1.1105 # still works
root@cloudvirt1025:~# brctl addif brq7425e328-56 eno2np1.1105 # breaks again

As next steps, I'd recommend for DC-Ops to experiment with NIC FW versions to begin with (start by upgrading; if that fails, try downgrading), especially if we have other boxes with the same HW that work. If that fails, someone from WMCS or I/F can help by trying to install a newer kernel (e.g. v5.9 from buster-backports) to see if it makes any difference. If all of that fails, we can puppetize rxvlan off or get new HW if we pinpoint this to a specific chip.

Thank you @faidon! This is extremely strange.

OK, to add a little more color:

  • The VLAN configuration is not important. brctl addif brq7425e328-56 eno2np1 is enough to reproduce this behavior.
  • I was thinking why bridge would matter (thinking hwmode/EVB etc. originally). I had tried setting promisc mode to no effect, but with a clearer mind this morning, I tried promisc + down/up and managed to reproduce, without a bridge being involved. ip link set promisc on dev eno2np1; ip link set down dev eno2np1; ip link set up eno2np1 reproduces it, ip link set promisc off dev eno2np1 restores connectivity.

Effectively, the problem boils down to "if the second 10G port is set to promiscuous mode, it grabs & consumes the first port's broadcast traffic". Perhaps this simplified description helps with tracking this down in release notes and whatnot :)

Mentioned in SAL (#wikimedia-cloud) [2020-12-04T09:54:56Z] <arturo> icinga downtime cloudvirt1025 for 6 days (T269313)

Mentioned in SAL (#wikimedia-cloud) [2020-12-04T11:24:34Z] <arturo> icinga downtime cloudvirt1024 for 6 days, to avoid paging noises (T269313)

Mentioned in SAL (#wikimedia-cloud) [2020-12-04T11:25:37Z] <arturo> last log line referencing cloudvirt1024 is a mistake (T269313)

I've upgraded the NIC bios from 21.40.20.00 to 21.65.33.33 (latest). Handing this back to @Andrew to push back into testing and see if that helped!

RobH added a parent task: Unknown Object (Task).Dec 4 2020, 5:01 PM
RobH removed a parent task: Unknown Object (Task).
RobH added a parent task: Unknown Object (Task).

I ran @faidon 's command on cloudvirt1025 and I still have an ssh session... so that seems promising.

root@cloudvirt1025:~# ip link set promisc on dev eno2np1; ip link set down dev eno2np1; ip link set up eno2np1
root@cloudvirt1025:~#

IRC update: Andrew asked me to also flash cloudvirt1026 to increase their test pool, so it has now gone from 21.40.20.00 to 21.65.33.33.

1025 and 1026 look good! @RobH, please upgrade 1027, 1028, 1029 and 1030 accordingly.

Thank you!

Andrew renamed this task from cloudvirt1025 connection issues on primary nic to cloudvirt10[25-30] connection issues on primary nic.Dec 4 2020, 5:38 PM

cloudvirt1027 nic firmware upgraded from 21.40.20.00 to 21.65.33.33; host has rebooted back into the OS but fails initial boot waiting for: [* ] A start job is running for dev-mapp
cloudvirt1028 nic firmware upgraded from 21.40.20.00 to 21.65.33.33; host has rebooted back into the OS but fails initial boot waiting for: [* ] A start job is running for dev-mapp

Should I bother trying to fix an newly imaged but rebooted from console file system or should we just re-image?

IRC update:

they haven't been reimaged since the rack move so I wouldn't expect them to boot

I powered them both down and they are ready for reimage, working on the last two now.

cloudvirt1029 nic firmware upgraded from 21.40.20.00 to 21.65.33.33
cloudvirt1030 nic firmware upgraded from 21.40.20.00 to 21.65.33.33

Since neither host has been reimaged for its move yet, they are now powered down with updated firmware for the NIC.

These are all upgraded, and this should end up clearing this issue. I've not resolved it yet, but if it is still working by end of today I will do so.

Mentioned in SAL (#wikimedia-cloud) [2020-12-04T21:06:54Z] <andrewbogott> putting cloudvirt1025 and 1026 back into service because I'm pretty sure they're fixed. T269313

cloudvirt1027 and 1028 are still showing firmware version 21.40.20.00. I can't easily check 1029 or 1030 since they're offline awaiting reimage.

root@cloudvirt1027:~# ethtool -i eno1np0 | grep firmware
firmware-version: 214.0.173.0/pkg 21.40.20.00

cloudvirt1027:

Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:0B:0C 	21.40.20.00
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:0B:0D 	21.40.20.00
Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:0B:0A 	21.60.16
Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:0B:0B 	21.60.16

So it updated the 1G interfaces but not the 10G interfaces, will work on this now.

I am no longer certain I did any of these right, so I'm now logging into the entire group and rechecking:

cloudvirt1027 was not updated on its 10G, just 1G interfaces (on a combined NIC, so lovely). Downloaded firmware for 10G interface and flashed successfully:

Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:0B:0A 	21.60.16
Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:0B:0B 	21.60.16
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:0B:0C 	21.65.33.33
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:0B:0D 	21.65.33.33

cloudvirt10[25-30] firmware re-check. It seems, depending on which file you download, it updates the 1G or the 10G, but not both. Cloudvirt102[56] have the proper 10G interfaces updated (but not the 1G). The remainder had the 1G updated but not 10G; but I've fixed them so cloudvirt10[27-30) have fully updated NICs, firmware output post work below.

cloudvirt1025 - 10G updated, 1G not updated - in service, not updated today:

Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:43:93 	21.40.9
Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:43:92 	21.40.9
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:43:94 	21.65.33.33
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:43:95 	21.65.33.33

cloudvirt1026 - 10G updated, 1G not updated - in service, not updated today:

Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:45:8A 	21.40.9
Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:45:8B 	21.40.9
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:45:8C 	21.65.33.33
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:45:8D 	21.65.33.33

cloudvirt1027 - 10G wasn't updated, so I fixed it today:

Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:0B:0A 	21.60.16
Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:0B:0B 	21.60.16
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:0B:0C 	21.65.33.33
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:0B:0D 	21.65.33.33

cloudvirt1028 - 10G wasn't updated, so I fixed it today:

Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:3B:D6 	21.60.16
Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:3B:D7 	21.60.16
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:3B:D8 	21.65.33.33
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:3B:D9 	21.65.33.33

cloudvirt1029 - 10G wasn't updated, so I fixed it today:

Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:35:A6 	21.60.16
Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:35:A7 	21.60.16
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:35:A8 	21.65.33.33
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:35:A9 	21.65.33.33

cloudvirt1030 - 10G wasn't updated, so I fixed it today:

Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:4B:96 	21.60.16
Broadcom Gigabit Ethernet BCM5720 - BC:97:E1:A7:4B:97 	21.60.16
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:4B:98 	21.65.33.33
Broadcom Adv. Dual 10Gb Ethernet - BC:97:E1:A7:4B:99 	21.65.33.33

clouvirt10[27-30] have all now been updated and powered down, ready for reimage by cloud services team.

@Andrew

reassigning this back to you so you are aware of the discrepancy on cloudvirt102[56]. since its the unused 1G interface, it likely is not worth taking the machine down to update the firmware on those.