Page MenuHomePhabricator

cloudgw: suspected network problems
Closed, ResolvedPublic

Assigned To
Authored By
aborrero
Nov 28 2024, 10:32 AM
Referenced Files
F57774754: node4_first_packets.pcap
Dec 3 2024, 10:26 AM
F57772729: Screenshot 2024-12-02 at 19.05.48.png
Dec 2 2024, 6:06 PM
F57772716: Screenshot 2024-12-02 at 19.02.02.png
Dec 2 2024, 6:03 PM
F57756613: image.png
Nov 28 2024, 5:32 PM
F57756609: image.png
Nov 28 2024, 5:32 PM
F57755996: image.png
Nov 28 2024, 10:46 AM
F57755984: image.png
Nov 28 2024, 10:35 AM

Description

There are a number of indications that cloudgw servers may be having network problems

Related Objects

Event Timeline

Some of the indications are:

image.png (286×300 px, 46 KB)

  • NIC kernel driver weird messages like bnxt_en 0000:65:00.0 enp101s0f0np0: Received firmware debug notification, data1: 0xdd67, data2: 0x0

Thanks for the task @aborrero

Usage is up over the past few weeks, and in tandem with this we've seen the system dropping packets it receives, usually an indication of a resource constraint that it is unable to keep up with the arrival rate:

image.png (945×962 px, 117 KB)

No single CPU seems to be maxing out, however, though at busy times some of them are showing ~60% usage so perhaps at instantaneous moments they are busier.

I had a look at the basic setup. It's a single-socket system (so no NUMA worries), with 8 physical CPU cores. We have 8 inbound queues set up for RSS which is correct for that, and the CPU usage is evenly spread so at least the basics on that are tuned correctly.

With that said the discards are higher on some inbound queues than others:

cmooney@cloudgw1002:~$ sudo ethtool -S enp101s0f0np0 | egrep rx_discards
     [0]: rx_discards: 652278
     [1]: rx_discards: 62014
     [2]: rx_discards: 7226163
     [3]: rx_discards: 0
     [4]: rx_discards: 1357913
     [5]: rx_discards: 0
     [6]: rx_discards: 2252266
     [7]: rx_discards: 955734

That may just be because it's particular busy flows that are the busy ones (a given flow - based on 5-tupple of packet headers - will always get hashed to the same rx queue). But it also might indicate the recent increase in usage is not a general "across the board" uptick but rather a specific set of endpoints/application flows.

Overall I think it merits us looking deeper to see if we can find exactly where things are going wrong. However as it seems clear this is due to increased usage I think we also need to consider how we might scale up the system (we see peaks of ~6Gb/sec on the 10Gb link now so we need to start thinking that way). Options might include faster servers, more NICs, more servers in paralel (NAT is a headache then) etc.

We should also try to identify the source of the increased traffic and ensure it's not anything malfunctioning and we're happy we need to support it.

I agree, the peaks of ~6Gb/sec on the 10Gb link, taking into account the servers perform NAT, may indicate that we are hitting scale limits.

IPv6 deploy may help with this, as it doesn't use NAT.

The rollout of IPv6 is also a "cheap" way to introduce parallelism: we can deploy IPv4 gateway VIPs on one node, IPv6 on the other, so traffic for each protocol flows in a different box. In case of failover, a single box can have both VIPs, so we would still have redundancy.
There are, of course, other more elaborate setups we could explore.

IPv6 deploy may help with this, as it doesn't use NAT.

Indeed!

The rollout of IPv6 is also a "cheap" way to introduce parallelism: we can deploy IPv4 gateway VIPs on one node, IPv6 on the other

Hah I love this thinking! And Happy Eyeballs means that we get a natural balancing across each of them. I actually do something similar with my two home internet connections and it works flawlessly.

We should also try to identify the source of the increased traffic and ensure it's not anything malfunctioning and we're happy we need to support it.

Based on the graphs it seems that the PAWS service is responsible for the recent big jump in usage. Graphing them together shows total correlation:

image.png (771×1 px, 326 KB)

image.png (771×1 px, 202 KB)

We should work out if this is expected.

I believe paws is running on the following neutron IPs:

workers:

172.16.5.100
172.16.1.161
172.16.5.229
172.16.0.46
172.16.5.253

controller:

172.16.1.198

The node list from kubectl is slightly different shows the same IPs.

The "Network I/O" graph from the Paws usage statistics Grafana dashboard confirms big spikes of "transmit" activity from PAWS. Unfortunately this graph only goes back 2 weeks.

Screenshot 2024-12-02 at 19.05.48.png (1×1 px, 454 KB)

@aborrero might it be an idea to put a special rule on the cloudgw to NAT these IPs to a different outside, public IPv4? That way we could identify them in our netflow logs of traffic on the CRs. Just a thought.

I took some liberties and created a dashboard on the wmcloud.org grafana to show stats for each of those instances:

https://grafana.wmcloud.org/goto/9fr1GSVHz

One finding is the different VMs/PODs are not all generating this traffic at the same time. All of the workers so high spikes in bandwidth, but its happening at different times on each.

As it turns out all of these devices are in rack D5, which explains why looking at the sflow data (which we only have for E4/F4) didn't show me much. In fact they are all on just two cloudvirts:

paws-127a-m3mctzr7itba-node-0  172.16.5.100  fa:16:3e:91:30:18  cloudvirt1036  cloudsw1-d5-eqiad
paws-127a-m3mctzr7itba-node-2  172.16.5.229  fa:16:3e:8d:f0:9a  cloudvirt1036  cloudsw1-d5-eqiad
paws-127a-m3mctzr7itba-node-1  172.16.1.161  fa:16:3e:1e:d1:07  cloudvirt1047  cloudsw2-d5-eqiad
paws-127a-m3mctzr7itba-node-3  172.16.0.46   fa:16:3e:b6:df:d1  cloudvirt1047  cloudsw2-d5-eqiad
paws-127a-m3mctzr7itba-node-4  172.16.5.253  fa:16:3e:fb:45:4c  cloudvirt1047  cloudsw2-d5-eqiad

Right so I was able to capture some of the traffic this morning as I could see PAWS node 4 was sending ~2Gb/sec.

It seems to be a steady stream of 512-byte UDP packets going to IP 20.204.18.216 port 11026.

09:56:31.339663 IP 172.16.5.253.4977 > 20.204.18.216.11026: UDP, length 512

The IP is registered to Microsoft MSN (is this Azure? not sure), and reading between the lines of the reverse DNS in a trace seems to be located near Pune, India. Doesn't return anything on Shodan nor appear to have any open ports.

The obvious question is do we still think that's a crypto-miner? Doesn't fit the profile to me. The second question is what is the legitimate reason for PAWS to be sending streams of such UDP traffic to public endpoints? Can we not lock down what outbound traffic it is allowed send to a reasonable set of IPs/Ports/Services?

PCAP is here of first 1000 captured packets:

There is another 10GB of these captured from cloudnet, and almost 60GB from the cloudvirt itself that was grabbed in about 2 mins if anyone is interested. Contents don't make any sense to me.

Change #1100077 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Block PAWS workers nodes from all UDP traffic other than DNS & NTP

https://gerrit.wikimedia.org/r/1100077

Change #1100077 merged by Cathal Mooney:

[operations/puppet@production] Block PAWS workers nodes from all UDP traffic other than DNS & NTP

https://gerrit.wikimedia.org/r/1100077

Mentioned in SAL (#wikimedia-operations) [2024-12-03T11:31:51Z] <topranks> pushing new nftables rules to cloudgw1001 to block abuse from paws T381078

Change #1100087 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Fix syntax errors in nft rules

https://gerrit.wikimedia.org/r/1100087

Change #1100087 merged by Cathal Mooney:

[operations/puppet@production] Fix syntax errors in nft rules

https://gerrit.wikimedia.org/r/1100087

fnegri triaged this task as High priority.Dec 3 2024, 3:55 PM
This comment was removed by taavi.
aborrero claimed this task.

The most accepted theory is that we had faulty hardware, which was replaced in T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev