There are a number of indications that cloudgw servers may be having network problems
Description
Details
Event Timeline
Some of the indications are:
- NIC kernel driver weird messages like bnxt_en 0000:65:00.0 enp101s0f0np0: Received firmware debug notification, data1: 0xdd67, data2: 0x0
Thanks for the task @aborrero
Usage is up over the past few weeks, and in tandem with this we've seen the system dropping packets it receives, usually an indication of a resource constraint that it is unable to keep up with the arrival rate:
No single CPU seems to be maxing out, however, though at busy times some of them are showing ~60% usage so perhaps at instantaneous moments they are busier.
I had a look at the basic setup. It's a single-socket system (so no NUMA worries), with 8 physical CPU cores. We have 8 inbound queues set up for RSS which is correct for that, and the CPU usage is evenly spread so at least the basics on that are tuned correctly.
With that said the discards are higher on some inbound queues than others:
cmooney@cloudgw1002:~$ sudo ethtool -S enp101s0f0np0 | egrep rx_discards
[0]: rx_discards: 652278
[1]: rx_discards: 62014
[2]: rx_discards: 7226163
[3]: rx_discards: 0
[4]: rx_discards: 1357913
[5]: rx_discards: 0
[6]: rx_discards: 2252266
[7]: rx_discards: 955734That may just be because it's particular busy flows that are the busy ones (a given flow - based on 5-tupple of packet headers - will always get hashed to the same rx queue). But it also might indicate the recent increase in usage is not a general "across the board" uptick but rather a specific set of endpoints/application flows.
Overall I think it merits us looking deeper to see if we can find exactly where things are going wrong. However as it seems clear this is due to increased usage I think we also need to consider how we might scale up the system (we see peaks of ~6Gb/sec on the 10Gb link now so we need to start thinking that way). Options might include faster servers, more NICs, more servers in paralel (NAT is a headache then) etc.
We should also try to identify the source of the increased traffic and ensure it's not anything malfunctioning and we're happy we need to support it.
I agree, the peaks of ~6Gb/sec on the 10Gb link, taking into account the servers perform NAT, may indicate that we are hitting scale limits.
IPv6 deploy may help with this, as it doesn't use NAT.
The rollout of IPv6 is also a "cheap" way to introduce parallelism: we can deploy IPv4 gateway VIPs on one node, IPv6 on the other, so traffic for each protocol flows in a different box. In case of failover, a single box can have both VIPs, so we would still have redundancy.
There are, of course, other more elaborate setups we could explore.
Indeed!
The rollout of IPv6 is also a "cheap" way to introduce parallelism: we can deploy IPv4 gateway VIPs on one node, IPv6 on the other
Hah I love this thinking! And Happy Eyeballs means that we get a natural balancing across each of them. I actually do something similar with my two home internet connections and it works flawlessly.
Based on the graphs it seems that the PAWS service is responsible for the recent big jump in usage. Graphing them together shows total correlation:
We should work out if this is expected.
I believe paws is running on the following neutron IPs:
workers:
172.16.5.100 172.16.1.161 172.16.5.229 172.16.0.46 172.16.5.253
controller:
172.16.1.198
The "Network I/O" graph from the Paws usage statistics Grafana dashboard confirms big spikes of "transmit" activity from PAWS. Unfortunately this graph only goes back 2 weeks.
@aborrero might it be an idea to put a special rule on the cloudgw to NAT these IPs to a different outside, public IPv4? That way we could identify them in our netflow logs of traffic on the CRs. Just a thought.
I took some liberties and created a dashboard on the wmcloud.org grafana to show stats for each of those instances:
https://grafana.wmcloud.org/goto/9fr1GSVHz
One finding is the different VMs/PODs are not all generating this traffic at the same time. All of the workers so high spikes in bandwidth, but its happening at different times on each.
As it turns out all of these devices are in rack D5, which explains why looking at the sflow data (which we only have for E4/F4) didn't show me much. In fact they are all on just two cloudvirts:
paws-127a-m3mctzr7itba-node-0 172.16.5.100 fa:16:3e:91:30:18 cloudvirt1036 cloudsw1-d5-eqiad paws-127a-m3mctzr7itba-node-2 172.16.5.229 fa:16:3e:8d:f0:9a cloudvirt1036 cloudsw1-d5-eqiad
paws-127a-m3mctzr7itba-node-1 172.16.1.161 fa:16:3e:1e:d1:07 cloudvirt1047 cloudsw2-d5-eqiad paws-127a-m3mctzr7itba-node-3 172.16.0.46 fa:16:3e:b6:df:d1 cloudvirt1047 cloudsw2-d5-eqiad paws-127a-m3mctzr7itba-node-4 172.16.5.253 fa:16:3e:fb:45:4c cloudvirt1047 cloudsw2-d5-eqiad
Right so I was able to capture some of the traffic this morning as I could see PAWS node 4 was sending ~2Gb/sec.
It seems to be a steady stream of 512-byte UDP packets going to IP 20.204.18.216 port 11026.
09:56:31.339663 IP 172.16.5.253.4977 > 20.204.18.216.11026: UDP, length 512
The IP is registered to Microsoft MSN (is this Azure? not sure), and reading between the lines of the reverse DNS in a trace seems to be located near Pune, India. Doesn't return anything on Shodan nor appear to have any open ports.
The obvious question is do we still think that's a crypto-miner? Doesn't fit the profile to me. The second question is what is the legitimate reason for PAWS to be sending streams of such UDP traffic to public endpoints? Can we not lock down what outbound traffic it is allowed send to a reasonable set of IPs/Ports/Services?
PCAP is here of first 1000 captured packets:
There is another 10GB of these captured from cloudnet, and almost 60GB from the cloudvirt itself that was grabbed in about 2 mins if anyone is interested. Contents don't make any sense to me.
Change #1100077 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):
[operations/puppet@production] Block PAWS workers nodes from all UDP traffic other than DNS & NTP
Change #1100077 merged by Cathal Mooney:
[operations/puppet@production] Block PAWS workers nodes from all UDP traffic other than DNS & NTP
Mentioned in SAL (#wikimedia-operations) [2024-12-03T11:31:51Z] <topranks> pushing new nftables rules to cloudgw1001 to block abuse from paws T381078
Change #1100087 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):
[operations/puppet@production] Fix syntax errors in nft rules
Change #1100087 merged by Cathal Mooney:
[operations/puppet@production] Fix syntax errors in nft rules
The most accepted theory is that we had faulty hardware, which was replaced in T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev




