Page MenuHomePhabricator

Handling inbound IPIP traffic on low traffic LVS k8s based realservers
Open, MediumPublic

Description

Traffic is currently experimenting with IPIP encapsulation on IPVS using tcp-mss-clamper to perform MSS clamping and handling inbound IPIP traffic using the Linux networking stack IPIP and IP6IP6 support.

We should analyze the viability of this approach for realservers running on Kubernetes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@akosiaris as mentioned on the meeting we need the following questions answered:

  • Is it OK to clamp all egress traffic on a k8s node?
  • IPIP encapsulation needs rp filtering disabled on the ipip / ip6ip6 interface in order to work, is that something calico supports?

Regarding reverse-path filtering it's enough to disable it on "all" and ipip0/ipip60, per Linux kernel documentation:

The max value from conf/{all,interface}/rp_filter is used when doing source validation on the {interface}.

vgutierrez@lvs1014:~$ sudo sysctl -a |grep \.rp_filter |grep -v arp
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.enp4s0f0.rp_filter = 1
net.ipv4.conf.enp4s0f1.rp_filter = 1
net.ipv4.conf.enp5s0f0.rp_filter = 1
net.ipv4.conf.enp5s0f1.rp_filter = 1
net.ipv4.conf.ipip0.rp_filter = 0
net.ipv4.conf.lo.rp_filter = 1
net.ipv4.conf.tunl0.rp_filter = 1

This configuration is enough to have incoming IPIP traffic on ipip0 and send responses via enp4s0f0 without reverse-path blocking any traffic

I see that I did not put this here, sorry.

In the IPIP mail thread we suggested to set a fixed, smaller MTU for all Pod traffic in order to not have to introduce additional iptables rules or the tcp-mss-clamper as this seems like a pretty straight forward configuration which is supported by calico directly (https://docs.tigera.io/calico/latest/networking/configuring/mtu). That would affect all pod traffic, but as it's in a prominent place and supported by the CNI it might be the most clean option.

as mentioned on the email thread that sounds like viable option for us

Agreed. If the POD (both ends of the veth linking it to the main netns) has a lower MTU, but the K8s host physical retains a 1500 MTU, it will work.

SYN/ACK from within the pod will have MSS appropriate to its (lower) MTU, ensuring the 1500 available on the host itself will be enough to receive any response + 20 byte IPIP overhead. The IPIP header is stripped off by the host and the remaining packet is small enough to get to the POD with its lower MTU.

Thanks @cmooney, do you have any idea of when could you proceed with this @JMeybohm?

To add some more information on the rp_filter setting, apparently starting with calico version 3.23.0 (we run 3.23.3 currently) per https://github.com/projectcalico/calico/commit/a69f24a9848c8c8350cceb1c71bf7f4097d5d3b7 it is possible to configure felix (the component that sets rp_filter=1) to allow the pods to spoof their addressses. We most certainly don't want this, so we won't be setting it, but ... in the patch the concept of a default rp_filter was introduced and the default is figured out by the following code

	defaultRPFilter, err := os.ReadFile("/proc/sys/net/ipv4/conf/default/rp_filter")
	if err != nil {
		log.Warn("could not determine default rp_filter setting, defaulting to strict")
		defaultRPFilter = []byte{'1'}
	}

which means that calico just carries the rp_filter setting that we get from https://github.com/wikimedia/operations-puppet/blob/production/modules/base/manifests/sysctl.pp, which has the $enable_rp_filter parameter which defaults to true. In theory, we should be able to switch that value via puppet (to 2 if anything, which will require some exposing more of the values of rp_filter to the puppet interface) and not have felix override it these days.

Now per sysctl.txt

	The max value from conf/{all,interface}/rp_filter is used
	when doing source validation on the {interface}.

which means we can NOT just set rp_filter to 0 for ipip0, but also set it for all. We could however set it apparently to 2 for ipip0 and it should work? That would save us from some puppet refactoring and it would definitely be much easier to test as it would keep the blast radius of the change very limited. @Vgutierrez would that be acceptable? Is there some way already to set rp_filter per interface?

I don't think it should matter to have the same setting for all interfaces on the box. As I understand it we can break these down as follows:

InterfaceAffect of rp_filter
Host physicalDoesn't do anything. If rp_filter is off everything is allowed, if it is on everything is still allowed due to default route out this interface.
Pod-side of veth (eth0)Same as above, there is a default out eth0 in the pod so everything will be allowed in regardless of rp_filter setting (also note the setting is independent in each netns)
Host ipip0Needs to have rp_filter off (0) or in "loose" mode (2) as packets from clients will appear to arrive in on this int after decap
Host-side of veth (caliXXXX)Needs to have rp_filter off (0) or in "loose" mode (2) as pods want to send packets from the service VIP, which is not routed out this interface (packets get forwarded to pods based on iptables mangling afaik)

So as I understand it a global seting of 0 or 2 should be ok.

it is possible to configure felix (the component that sets rp_filter=1) to allow the pods to spoof their addressses. We most certainly don't want this,

Why not? We need to allow the pods to "spoof" their IP, so they can send packets from the service VIP address, right?

With rp_filter alone I don't think we can allow pods to send packets from the service VIP, but block any other spoofed packets from them.

Needs to have rp_filter off (0) or in "loose" mode (2) as pods want to send packets from the service VIP, which is not routed out this interface (packets get forwarded to pods based on iptables mangling afaik)

and

Why not? We need to allow the pods to "spoof" their IP, so they can send packets from the service VIP address, right?
With rp_filter alone I don't think we can allow pods to send packets from the service VIP, but block any other spoofed packets from them.

This isn't true. Pods do not see the service VIP ever. Traffic reaching them has a destination IP that matches their own IP, due to the DNAT that happens on the node layer to implement the probabilistic load balalncing. They reply with the own IP as the source IP to the node that received the original flow (which might very well not be the node they are on) and it's the node that does the inverse process (SNAT) on the packet they receive and send it to the originating client.

This isn't true. Pods do not see the service VIP ever. Traffic reaching them has a destination IP that matches their own IP, due to the DNAT that happens on the node layer

Ah ok ok. Right well then we can leave rp_filter=1 on the caliXXXX ints and prevent IP spoofing from within a POD.

The only place I can see we'd have problems then is on the ipip0 interface, we need to set rp_filter to 0 or 2 there.

@akosiaris is there any update on this one?

If I recall correctly from our discussion at the SRE Summit the current plans are:

  • LVS will send IPIP encapsulated packets to an address on the physical K8s host itself
  • The host will have the ipip0 and ipip60 virtual interfaces on it, where the decapsulated packets arrive
  • The decapsulated packets are routed to the correct POD from the host
  • The veth pairs connecting PODs to the default netns on the hosts will have MTU set at 1460
    • This means the PODs will never announce a TCP MSS greater than 1420.

Please correct me if any of that is incorrect!

@cmooney that also implies increasing MTU on the LVS host as well, right?

@cmooney that also implies increasing MTU on the LVS host as well, right?

I edited my comment. It would, but I had mis-remembered the plan, we'll use smaller MTU on the PODs afaik.

@cmooney that also implies increasing MTU on the LVS host as well, right?

I edited my comment. It would, but I had mis-remembered the plan, we'll use smaller MTU on the PODs afaik.

That's my recollection as well. We lower the PODs MTU so that their packets (encapsulated) fit the default MTU and we don't have to modify the MTU or MSS clamp on the nodes.

That would be enough to accommodate IPv4 and IPv6? We currently clamp at 1440 bytes for ipv4 and at 1400 bytes for ipv6

Nevermind, we only do ipv4 for low-traffic/internal services

That would be enough to accommodate IPv4 and IPv6? We currently clamp at 1440 bytes for ipv4 and at 1400 bytes for ipv6

That's actually a good question. If the IPv6 packets are sent from LVS -> realserver over IPv4 it would be ok:

1420 TCP Segment + 20 byte TCP header + 40 byte IPv6 header + 20 byte IPIP(v4) header = 1500

But if we use IPv6 transport between the LVS and realserver that means IPIP header is 40 bytes, so we need an MSS of 1400, meaning a POD interface MTU of 1460.

1400 TCP Segment + 20 byte TCP header + 40 byte IPv6 header + 40 byte IPIP(v6) header = 1500

@akosiaris is there any update on this one?

If I recall correctly from our discussion at the SRE Summit the current plans are:

  • LVS will send IPIP encapsulated packets to an address on the physical K8s host itself
  • The host will have the ipip0 and ipip60 virtual interfaces on it, where the decapsulated packets arrive
  • The decapsulated packets are routed to the correct POD from the host
  • The veth pairs connecting PODs to the default netns on the hosts will have MTU set at 1480
    • This means the PODs will never announce a TCP MSS greater than 1440.

Please correct me if any of that is incorrect!

Per my memory, this is indeed correct.

Change #1145981 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] calico: Allow to override the MTU via values files

https://gerrit.wikimedia.org/r/1145981

Change #1145982 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] calico: Set veth_mtu to 1480 for staging-codfw

https://gerrit.wikimedia.org/r/1145982

Change #1145981 merged by jenkins-bot:

[operations/deployment-charts@master] calico: Allow to override the MTU via values files

https://gerrit.wikimedia.org/r/1145981

Change #1145982 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] calico: Set veth_mtu to 1480 for {,ml}-staging-codfw

https://gerrit.wikimedia.org/r/1145982

Unfortunately the 2 patches above didn't work. For ml-staging-codfw, just because it's still, via virtual of helmfile.d/admin_ng/values/common.yaml, locked to 0.2.10. However, it did not work either for staging-codfw because while the upstream manifests do indeed have support, that support is implemented via having the CNI config being managed by calico, a support that we did not import in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1112058, sticking with CALICO_MANAGE_CNI: false. We do manage this via puppet. It should be noted that our puppet implementation does allow to differentiate per cluster, same as the chart approach. There is no really functional difference between the 2 ways.

So, the 2 above patches are (almost) moot.

So, what next? Long term vs short term:

Short-term

We can just use Puppet's support for proceeding with our tests.

Long-term

We probably want to minimize our diff from upstream manifests in order to allow easier upgrades in the future. We can gradually move away from managing CNI in puppet now that upstream has support for this, making long term our lives easier.

I 'll indeed test that MTU works with a Puppet patch, but proceed with the long term plan as well.

Change #1148326 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] staging-codfw: Specify MTU of 1460

https://gerrit.wikimedia.org/r/1148326

Change #1148326 merged by Alexandros Kosiaris:

[operations/puppet@production] staging-codfw: Specify MTU of 1460

https://gerrit.wikimedia.org/r/1148326

nobody@wmfdebug:/$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0@if69: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether ae:16:b8:c3:6d:75 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Note the 1460 MTU for eth0

The puppet approach worked in staging-codfw. I don't see a reason why it wouldn't against all clusters, but we probably want to proceed slowly and carefully.

Long-term

We probably want to minimize our diff from upstream manifests in order to allow easier upgrades in the future. We can gradually move away from managing CNI in puppet now that upstream has support for this, making long term our lives easier.

While this remains a sane long term plan, it got torpedoed by the fact that the installation of the config file requires the install CNI binary (which isn't really a CNI plugin), to exist and we don't ship it currently in our debian package. It also seems to imply we 'll need to move the entire management of CNI plugins from Puppet to Calico, which sounds scary and thus we don't have yet a consensus (if we ever will) that it should happen.

Change #1148795 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] staging-eqiad: Specify MTU of 1460

https://gerrit.wikimedia.org/r/1148795

Change #1148795 merged by Alexandros Kosiaris:

[operations/puppet@production] staging-eqiad: Specify MTU of 1460

https://gerrit.wikimedia.org/r/1148795

nobody@wmfdebug:/$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0@if69: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether ae:16:b8:c3:6d:75 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Note the 1460 MTU for eth0

The puppet approach worked in staging-codfw. I don't see a reason why it wouldn't against all clusters, but we probably want to proceed slowly and carefully.

great! thanks @akosiaris

staging-eqiad with an MTU of 1460 as well.

2: eth0@if841: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether b6:17:bc:c1:05:ad brd ff:ff:ff:ff:ff:ff link-netnsid 0

I am gonna leave this be for a few days. My basic tests didn't exhibit anything weird, but we also don't have a lot of health checks on the staging cluster. I don't think my tests were (or whether it makes sense to make them) very thorough, so falling back to time to prove this isn't going to cause problems. Next ones are probably the aux clusters and then codfw.

What might be worth testing is if PMTUD works.

i.e. send a UDP packet to a POD IP of 1500 bytes with do-not-fragment bit set. The host should generate a 'packet too big' ICMP back to the source when it finds it can't route it over the veth to the POD. i.e. the ICMP-based path-mtu-discovery mechansim will work. Most things are TCP, so it's very much edge-case, but it's best if this works too.

I did run the simple one

deploy1003:~# ping -M do -s 1433 10.64.65.59
PING 10.64.65.59 (10.64.65.59) 1433(1461) bytes of data.
ping: local error: message too long, mtu=1460

And ofc the 1432 (32+8 for ICMP header + 20 for IP header = 1460) works fine

deploy1003:~# ping -M do -s 1433 10.64.65.59
PING 10.64.65.59 (10.64.65.59) 1433(1461) bytes of data.
ping: local error: message too long, mtu=1460

so PMTUD itself works quite fine.

This is also corroborated by an nmap TCP scan with the path-mtu.nse script (port 4002 is the one open for that workload)

deploy1003:~$ sudo nmap -p 4002 10.64.65.59 --script /usr/share/nmap/scripts/path-mtu.nse
Starting Nmap 7.80 ( https://nmap.org ) at 2025-05-22 14:55 UTC
Nmap scan report for 10-64-65-59.cxserver-staging-tls-service.cxserver.svc.cluster.local (10.64.65.59)
Host is up (0.00021s latency).

PORT     STATE SERVICE
4002/tcp open  mlchat-proxy

Host script results:
|_path-mtu: PMTU == 1460

Nmap done: 1 IP address (1 host up) scanned in 0.43 seconds

This nmap script unfortunately doesn't work with UDP (at least judging from output, I decided I didn't want to read the Lua code). So falling back to traceroute

deploy1003:~$ traceroute -FU -p 53 10.64.64.63 1461
traceroute to 10.64.64.63 (10.64.64.63), 30 hops max, 1461 byte packets
 1  ae2-1018.cr1-eqiad.wikimedia.org (10.64.16.2)  1.312 ms  1.266 ms  1.330 ms
 2  kubestage1006.eqiad.wmnet (10.64.0.218)  0.365 ms * *
 3  kubestage1006.eqiad.wmnet (10.64.0.218)  0.360 ms !F-1460  0.366 ms !F-1460  0.360 ms !F-1460

and traceroute's --mtu option apparently works as well (if the above wasn't enough)

deploy1003:~$ traceroute -U --mtu -p 53 10.64.64.63
traceroute to 10.64.64.63 (10.64.64.63), 30 hops max, 65000 byte packets
 1  ae2-1018.cr1-eqiad.wikimedia.org (10.64.16.2)  0.288 ms F=1500  0.253 ms  0.250 ms
 2  kubestage1006.eqiad.wmnet (10.64.0.218)  0.219 ms  0.186 ms  0.181 ms
 3  10-64-64-63.kube-dns.kube-system.svc.cluster.local (10.64.64.63)  0.359 ms F=1460  0.369 ms  0.242 ms
deploy1003:~$ traceroute --mtu -p 4002 10.64.65.59
traceroute to 10.64.65.59 (10.64.65.59), 30 hops max, 65000 byte packets
 1  ae2-1018.cr1-eqiad.wikimedia.org (10.64.16.2)  0.546 ms F=1500  0.505 ms  1.089 ms
 2  kubestage1003.eqiad.wmnet (10.64.16.55)  0.253 ms  0.186 ms  0.174 ms
 3  * F=1460 *

Not sure what else to run tbh.

Thanks @akosiaris that's great. TIL about that nmap script, that's really useful. Also the traceroute mtu flag :)

FWIW I repeated it while running a tcpdump, pcap is here:

Packet 11 there is what's important. Packet 10 was the client sending 1500 bytes which won't make it, the important thing is the ICMP comes back to alert the client to that fact. Even though the test is with TCP this mechanism works for IP in general so it will be valid for UDP, ICMP or anything else. I think we can be confident things are ok.

After having to deal a bit with a staging-eqiad calico upgrade yesterday, I did find 1 thing that will break. This is a bit complex:

  • We have a set of calico GlobalNetworkPolicies applied. They do basic stuff like allowing ICMP, allow DNS, allow pods to egress to any other pod (ingress rules are still needed on the workload side).
  • There are as well some WMF specific stuff like access to MediaWiki API, RESTBase and urldownloaders. Those are there to centralize egress to those, freeing deployers from having to do it themselves. These are unimportant to the rest of the story
  • Applying any kind of network policy to a workload, either via calico GlobalNetworkPolicy or k8s NetworkPolicy is an allowlist with a default-deny approach.
  • All of the above are applied on all namespaces except kube-system. That namespace doesn't have any kind of GlobalNetworkPolicy applied, thus being the only one where we have default-allow. That means that any new workload in kube-system, unlike new workloads in other namespaces is free to talk to anything and receive traffic from anywhere. This is fine, workload spawning in kube-system is well guarded and only SREs can do it.
  • However, there are a couple of workloads on kube-system namely CoreDNS, eventrouter, helm-state-metrics that have their own NetworkPolicy. Those trigger the default-deny approach. Unfortunately k8s NetworkPolicy resources do not allow to specify ICMP, meaning that ICMP traffic to e.g. coredns gets dropped. Yesterday I identified that and temporarily edited the GlobalNetworkPolicy for my tests applying the allow-all-icmp one to kube-system as well. I was planning to properly deploy the change today, but... This effectively switched kube-system from default-allow to default-deny for all workloads (that are not sharing the namespace of the host). After running the tests above, I decided to undo my change by helmfile sync the latest version of the calico chart. Which broke cause one of the workloads, calico-kube-controllers could no longer talk to anything due to the switch to default-deny. I did not anticipate this and did not notice the failure due to it being staging.
  • Today I put 2+2 together and fixed it after it was reported.

However, that puts us in an interesting conudrum. PMTU for DNS (whether it's over TCP or UDP) is currently broken cause ICMP is blocked.

How could this cause issues? Overall DNS requests and responses are small. For requests and responses inside the cluster, the MTU is 1460 across all pods, so we shouldn't notice something. Packets from clients are going to be honoring it, responses from CoreDNS are going to be honoring it, we should be OK.

For requests originating outside the cluster however, we might see issues in the future. Not if the response is too big, as the CoreDNS pod has a 1460 MTU, which is smaller than the 1500 default. Requests to the pod are also gonna be fragmented in IPv4, so probably no issue there either? But not IPv6, which we don't currently have. We might see difficult to debug issues there?

FWIW, I am not keen on switching the entire kube-system namespace to default-deny, we 've been functioning with a default-allow for quite some time now.

Hmm ok. Yes the big problem with pmtud is that it relies on ICMP. This is why it regularly doesn't work across the internet due to people dropping ICMP. I'd assumed this wasn't a problem for us as it was all internal.

But like you say most DNS packets are small. The only places I am aware where packets can get big is with zone transfers or responses containing DNSSEC signatures, keys etc. Neither of which we have here I think.

Requests to the pod are also gonna be fragmented in IPv4, so probably no issue there either?

Indeed, though many hosts will never send a query of more than 512 bytes in UDP. In theory I think resolvers supporting EDNS (for instance with option ends0 in /etc/resolv.conf) could send larger queries, however the maximum size of a DNS name is still 255 bytes. So I don't think there is any valid query you could construct longer than 512.

Responses in EDNS0 are limited to the 'udp message size' the server is configured for. If a response is larger than that the server returns a response with the 'TC bit' set, signalling to the client to retry the query over TCP. It is better that this udp message size is smaller than our MTU, which means the server won't generate larger packets and then fragment them at the IP layer, but instead will force TCP fallback. We should set this to 1280 or similar I think.

Overall I think it should be ok, but might be worth running it past Brandon too.

@akosiaris a quick question about this:

meaning that ICMP traffic to e.g. coredns gets dropped

In terms of pmtud that means that if coredns sends large UDP packets - which get dropped elsewhere - it won't get the ICMP "packet too big" messages back. But that is not really a worry. The CoreDNS PODs have a lower MTU than pretty much everything on the network, they are not going to send packets that are too large for anything else.

Are the physical K8s hosts blocked from sending ICMP? So for instance if a 1500-byte UDP packet was sent to a pod IP - and couldn't get there because we have reduced the MTU on the veth interface connecting the POD - can the host send an ICMP back to the client?

@akosiaris a quick question about this:

meaning that ICMP traffic to e.g. coredns gets dropped

In terms of pmtud that means that if coredns sends large UDP packets - which get dropped elsewhere - it won't get the ICMP "packet too big" messages back. But that is not really a worry. The CoreDNS PODs have a lower MTU than pretty much everything on the network, they are not going to send packets that are too large for anything else.

Agreed on this.

Are the physical K8s hosts blocked from sending ICMP? So for instance if a 1500-byte UDP packet was sent to a pod IP - and couldn't get there because we have reduced the MTU on the veth interface connecting the POD - can the host send an ICMP back to the client?

No, they are not blocked. Indeed the host could send that ICMP back instead. I didn't see that happening in my tests, however I also didn't specifically try this out, we can test that.

@akosiaris thanks for confirming. So overall my thinking is:

  • Path MTU discovery should work for hosts sending traffic to PODs, which is the direction we could have problems
  • DNS queries to coredns will be less than 512 bytes anyway
  • Most traffic is TCP and the MSS the PODs send will ensure packets aren't too big anyway (we don't rely on pmtud)

Indeed the host could send that ICMP back instead. I didn't see that happening in my tests, however I also didn't specifically try this out, we can test that.

Might be worth testing to confirm our thinking yes.

Change #1155543 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] aux-k8s: Switch MTU to 1460

https://gerrit.wikimedia.org/r/1155543

Change #1155543 merged by Alexandros Kosiaris:

[operations/puppet@production] aux-k8s: Switch MTU to 1460

https://gerrit.wikimedia.org/r/1155543

I 've gone ahead and switch all of aux-k8s to MTU 1460. This time around, I went for a more hands off approach, namely:

  • I did NOT restart calico-node
  • I let puppet converge on most nodes on its own
  • Piggybacked on a deployment to verify the MTU was applied

And it was. It's also gradually being applied as pods get restarted (for whatever reason, including deployments)

Leaving this around for a couple of weeks. If nothing croaks, the next step is to apply it everywhere.

Change #1166763 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Kubernetes: Switch MTU for all clusters to 1460

https://gerrit.wikimedia.org/r/1166763

Change #1166763 merged by Alexandros Kosiaris:

[operations/puppet@production] Kubernetes: Switch MTU for all clusters to 1460

https://gerrit.wikimedia.org/r/1166763

All kubernetes clusters are now configured to use MTU 1460. This will take some time (weeks) to fully propagate, as this requires a pod restart. Deployments, node maintenance, evictions and other events that end up restarting or rescheduling pods will trigger it. In a few weeks we should be in a position to look at the few left hanging fruits and manually restart those.

@akosiaris I think we could start considering enabling inbound IPIP traffic on the staging environment, deploying IPIP interfaces (assuming you'll be using the regular kernel networking stack and not some eBPF "magic") shouldn't affect the ability to handle non-encapsulated traffic.

As soon as IPIP encapsulated traffic is handled we can validate that it's working as expected without impacting the traffic coming from load balancers, we used sre.loadbalancer.migrate-service-ipip cookbook to perform this validation for T373020: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/loadbalancer/migrate-service-ipip.py#182 and we could use a similar one here

@akosiaris I think we could start considering enabling inbound IPIP traffic on the staging environment, deploying IPIP interfaces (assuming you'll be using the regular kernel networking stack and not some eBPF "magic") shouldn't affect the ability to handle non-encapsulated traffic.

As soon as IPIP encapsulated traffic is handled we can validate that it's working as expected without impacting the traffic coming from load balancers, we used sre.loadbalancer.migrate-service-ipip cookbook to perform this validation for T373020: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/loadbalancer/migrate-service-ipip.py#182 and we could use a similar one here

Yup, scheduling it for the weeks of either August 11th or August 18th.

Yup, scheduling it for the weeks of either August 11th or August 18th.

gentle ping, do you need something from my side?

Yup, scheduling it for the weeks of either August 11th or August 18th.

gentle ping, do you need something from my side?

Maybe a way to verify everything is work fine? What's the way to go about this?

you got that available as part of the sre.loadbalancer.migrate-service-ipip cookbook on https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/loadbalancer/migrate-service-ipip.py#131:

def _ipip_traffic_accepted(self,  *,
                           outer_src_ip: str, outer_dst_ip: str,
                           inner_src_ip: str, inner_dst_ip: str,
                           dport: int) -> bool:
    """Send a single SYN packet using IPIP encapsulation"""
    s = socket(AF_INET, SOCK_STREAM)
    s.bind((inner_src_ip, 0))
    sport = s.getsockname()[1]
    syn_packet = (
        IP(src=outer_src_ip, dst=outer_dst_ip) /
        IP(src=inner_src_ip, dst=inner_dst_ip) /
        TCP(sport=sport, dport=dport, flags="S", seq=1000)
    )
    response = sr1(syn_packet, timeout=3, verbose=self.dry_run)
    s.close()
    return response is not None
cmooney mentioned this in Unknown Object (Task).Aug 20 2025, 7:38 PM

Hi @akosiaris: Following up on this after a discussion during Traffic's planning with @Vgutierrez, and on behalf of the team.

We were curious to know when you would be able to take this, with the understanding that things are busy and we don't expect it to happen immediately. From Traffic's end, we have decided to triage this for Q3 for the rollout and not Q2, given that it is short quarter and we are unlikely to roll such a big change before December that will affect the core sites. (Liberica is already running on all PoPs.) So that's at least our position.

Does that seem fine per you and the planning for Serviceops? Do note that as per @Vgutierrez's last comment, we already have a check in place in the migration cookbook so you do not have to worry about that.

Thanks!

Hi @akosiaris: Following up on this after a discussion during Traffic's planning with @Vgutierrez, and on behalf of the team.

We were curious to know when you would be able to take this, with the understanding that things are busy and we don't expect it to happen immediately. From Traffic's end, we have decided to triage this for Q3 for the rollout and not Q2, given that it is short quarter and we are unlikely to roll such a big change before December that will affect the core sites. (Liberica is already running on all PoPs.) So that's at least our position.

Does that seem fine per you and the planning for Serviceops? Do note that as per @Vgutierrez's last comment, we already have a check in place in the migration cookbook so you do not have to worry about that.

Thanks!

Hey sorry for not replying sooner, this fell through the cracks. As far as I know, ServiceOps is anyway fully booked for Q2, so before Q3 it sounds improbable that anything regarding this can be scheduled. So overall, yes this sounds ok for now.

Thanks @akosiaris, that sounds good. We would like to get this done in Q3 to resolve this blocker and to deploy Liberica everywhere, so please do factor that in for your planning. Thank you!

JMeybohm mentioned this in Unknown Object (Task).Nov 7 2025, 8:42 AM

Change #1228582 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] base::sysctl: Allow more finegrained rp_filter behavior

https://gerrit.wikimedia.org/r/1228582

Change #1228583 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] base::sysctl: Switch priority of the ubuntu-defaults stanza

https://gerrit.wikimedia.org/r/1228583

Summarizing the current state and our recent discussion about this:

  • All calico (Pod) interfaces have the MTU set to 1460
  • Bookworm k8s workers have net.ipv4.conf.{all,default,cali*}.rp_filter=1
  • Status quo for IPIP enabled hosts (non k8s) is to set net.ipv4.conf.{all,default}.rp_filter=0 via modules/base/manifests/sysctl.pp and explicitly set net.ipv4.conf.{ipip0,ipip60}.rp_filter=0 via modules/profile/manifests/lvs/realserver/ipip.pp.

During that we realized that trixie hosts currently use different values and that our settings from puppet do not apply anymore/get overridden by /usr/lib/sysctl.d/50-default.conf.

It's not clear why the net.ipv4.conf.default.rp_filter would need to change to 0 for IPIP and we would like to not do that on k8s nodes to prevent IP spoofing from inside containers (although that would require CAP_NET_ADMIN). If we could keep net.ipv4.conf.default.rp_filter=1 the dynamically created calico interfaces would still inherit that setting while net.ipv4.conf.all.rp_filter=0 would allow the ipip interfaces to also have rp_filter=0 set.

This also means the change required is way less invasive and we can probably get away with basic testing of the IPIP encapsulation (T352956#11086373) and some additional workload tests. The latter could be done on a node that is fenced off for regular workload and that is not an LVS target (not part of the kubesvc cluster in conftool-data/nodes/{codfw,eqiad}.yaml).

It's not clear why the net.ipv4.conf.default.rp_filter would need to change to 0 for IPIP and we would like to not do that on k8s nodes to prevent IP spoofing from inside containers (although that would require CAP_NET_ADMIN). If we could keep net.ipv4.conf.default.rp_filter=1 the dynamically created calico interfaces would still inherit that setting while net.ipv4.conf.all.rp_filter=0 would allow the ipip interfaces to also have rp_filter=0 set.

We’ve been overly conservative and disabled rp_filter too broadly when enabling IPIP encapsulation.
In practice, most realservers have a single NIC, so keeping rp_filter enabled provides limited additional protection in that context (this explicitly does not apply to Kubernetes nodes).
From a technical point of view, it should be sufficient to disable rp_filter only on the ipip0 interface. There is no need to disable it on ipip60 as the Linux kernel does not implement rp_filter for IPv6.
This would allow us to keep net.ipv4.conf.default.rp_filter=1, ensuring that dynamically created interfaces (such as Calico interfaces on k8s nodes) inherit a safe default, while selectively relaxing the setting where IPIP requires it.

Summarizing from notes during an informal SRE summit session.

  • We 've progressed a bit, there's already number of patches uploaded and ready to merge.
  • We 'll need to patch profile::lvs::realserver::ipip to not install the eBPF MSS clamper as we don't need it for kubernetes low-traffic clusters. We 'll also probably can skip the ipip60 interface given low-traffic doesn't support IPv6 (this might be counter-productive if there are plans to adopt IPv6 for kubernetes based services)
  • We 'll need a patch to apply profile::lvs::realserver::ipip to staging kubernetes cluster role
  • Given that IPIP is per service, we 'll need to enable it for an entire service, so k8s-ingress-staging is the one to go for.
  • In wikikube, regarless of the fact that conftool wise, all services are "backed" by kubesvc, we will be migrating service by service. It's IP+PORT based.
    • Technically the only thing needed is in the service catalog : adding the ipip encapsulation config option
    • To facilitate the move to Liberica (and deprecation of PyBal), we 'll need, later in time, move to the mh (Maglev) scheduler in service::catalog. Again, this is decoupled from the previous items.
  • We will have to update the cookbook to perform the migration.

We 'll also probably can skip the ipip60 interface given low-traffic doesn't support IPv6

Unlike the ebpf mss clamp I think the presence of the ipip60 interface on its own shouldn't impact performance? Maybe we should create it anyway in that case - one less thing to change if we do want to support v6 for low-traffic.

Change #1228582 merged by Alexandros Kosiaris:

[operations/puppet@production] base::sysctl: Allow more finegrained rp_filter behavior

https://gerrit.wikimedia.org/r/1228582

Change #1228583 merged by Alexandros Kosiaris:

[operations/puppet@production] base::sysctl: Switch priority of the ubuntu-defaults stanza

https://gerrit.wikimedia.org/r/1228583

Change #1237277 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] liberica: Enable it in staging cluster

https://gerrit.wikimedia.org/r/1237277

Change #1237280 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] k8s-staging: Set ipip_encapsulation in service::catalog

https://gerrit.wikimedia.org/r/1237280

Change #1237467 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] lvs: Allow disabling TCP MSS clamping for IPIP realservers

https://gerrit.wikimedia.org/r/1237467