Investigate IPVS IPIP encapsulation support
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Vgutierrez
	Oct 13 2023, 8:14 AM

Description

As pointed out by @jhathaway IPVS supports IPIP encapsulation so PyBaL should be able to benefit from that.

This is interesting cause we could move from the current IPVS DSR support that requires L2 connectivity to IPIP encapsulation, closing the gap between PyBaL and Liberica and effectively decreasing the risk of such migration

Vagrantfile PoC: https://phabricator.wikimedia.org/P52928

how to use it:

$ vagrant up
$ VIP=$(fgrep "vip =" Vagrantfile | cut -f2 -d'"')
$ LB=$(fgrep "lb_ip =" Vagrantfile | cut -f2 -d'"')
$ sudo ip route add $VIP via $LB
$ curl -s -v -o /dev/null $VIP
*   Trying 10.10.10.10:80...
* Connected to 10.10.10.10 (10.10.10.10) port 80 (#0)
> GET / HTTP/1.1
> Host: 10.10.10.10
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Server: nginx/1.22.1
< Date: Fri, 13 Oct 2023 10:56:11 GMT
< Content-Type: text/html
< Content-Length: 615
< Last-Modified: Fri, 13 Oct 2023 10:48:34 GMT
< Connection: keep-alive
< ETag: "65292082-267"
< backend: 192.168.42.100
< Accept-Ranges: bytes
< 
{ [615 bytes data]
* Connection #0 to host 10.10.10.10 left intact

using thsark in one of the real servers (vagrant ssh backend[01]) shows how requests come via ipip0 and response goes back via eth1:

vagrant@bookworm:~$ sudo -i tshark -o tcp.analyze_sequence_numbers:FALSE -i ipip0 -i eth1 -z proto,colinfo,frame.interface_name,frame.interface_name port 80
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ipip0' and 'eth1'
 ** (tshark:3518) 11:06:41.001064 [Main MESSAGE] -- Capture started.
 ** (tshark:3518) 11:06:41.001265 [Main MESSAGE] -- File: "/tmp/wireshark_2_interfacesLZ2QC2.pcapng"
    1 0.000000000 192.168.42.1 ? 10.10.10.10  TCP 60 47078 ? 80 [SYN] Seq=1618396380 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=4138444511 TSecr=0 WS=128  frame.interface_name == "ipip0"
    2 0.000283971 192.168.42.1 ? 10.10.10.10  TCP 52 47078 ? 80 [ACK] Seq=1618396381 Ack=1890327363 Win=64256 Len=0 TSval=4138444512 TSecr=274137172  frame.interface_name == "ipip0"
    3 0.000284019 192.168.42.1 ? 10.10.10.10  HTTP 127 GET / HTTP/1.1   frame.interface_name == "ipip0"
    4 0.000697318 192.168.42.1 ? 10.10.10.10  TCP 52 47078 ? 80 [ACK] Seq=1618396456 Ack=1890328241 Win=64128 Len=0 TSval=4138444512 TSecr=274137173  frame.interface_name == "ipip0"
    5 0.000713903 192.168.42.1 ? 10.10.10.10  TCP 52 47078 ? 80 [FIN, ACK] Seq=1618396456 Ack=1890328241 Win=64128 Len=0 TSval=4138444512 TSecr=274137173  frame.interface_name == "ipip0"
    6 0.000833913 192.168.42.1 ? 10.10.10.10  TCP 52 47078 ? 80 [ACK] Seq=1618396457 Ack=1890328242 Win=64128 Len=0 TSval=4138444513 TSecr=274137173  frame.interface_name == "ipip0"
    7 0.000032012  10.10.10.10 ? 192.168.42.1 TCP 74 80 ? 47078 [SYN, ACK] Seq=1890327362 Ack=1618396381 Win=65160 Len=0 MSS=1460 SACK_PERM TSval=274137172 TSecr=4138444511 WS=64  frame.interface_name == "eth1"
    8 0.000311346  10.10.10.10 ? 192.168.42.1 TCP 66 80 ? 47078 [ACK] Seq=1890327363 Ack=1618396456 Win=65088 Len=0 TSval=274137173 TSecr=4138444512  frame.interface_name == "eth1"
    9 0.000440283  10.10.10.10 ? 192.168.42.1 HTTP 944 HTTP/1.1 200 OK  (text/html)  frame.interface_name == "eth1"
   10 0.000727290  10.10.10.10 ? 192.168.42.1 TCP 66 80 ? 47078 [FIN, ACK] Seq=1890328241 Ack=1618396457 Win=65088 Len=0 TSval=274137173 TSecr=4138444512  frame.interface_name == "eth1"

By default ipip0 gets configured with MTU 1480 and eth1 with MTU 1500, if we use curl to trigger a request bigger than the MTU we can see how fragmentation happens and is handled:

$ curl -H "Foo: $(python3 -c 'print(chr(0x42)*1600)')" 10.10.10.10 -v -o /dev/null -s
*   Trying 10.10.10.10:80...
* Connected to 10.10.10.10 (10.10.10.10) port 80 (#0)
> GET / HTTP/1.1
> Host: 10.10.10.10
> User-Agent: curl/7.88.1
> Accept: */*
> Foo: B[x1600, you get the idea]
> 
< HTTP/1.1 200 OK
< Server: nginx/1.22.1
< Date: Fri, 13 Oct 2023 12:37:30 GMT
< Content-Type: text/html
< Content-Length: 615
< Last-Modified: Fri, 13 Oct 2023 10:48:34 GMT
< Connection: keep-alive
< ETag: "65292082-267"
< backend: 192.168.42.100
< Accept-Ranges: bytes
< 
{ [615 bytes data]
* Connection #0 to host 10.10.10.10 left intact

tshark shows the fragmentation as expected (after disabling the segmentation offload with ethtool):

13 315.823694565 192.168.42.1 ? 10.10.10.10  TCP 60 36532 ? 80 [SYN] Seq=1816054424 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=4143892384 TSecr=0 WS=128
14 315.823889655 192.168.42.1 ? 10.10.10.10  TCP 52 36532 ? 80 [ACK] Seq=1816054425 Ack=3606676205 Win=64256 Len=0 TSval=4143892384 TSecr=279585060
15 315.823916323 192.168.42.1 ? 10.10.10.10  TCP 1500 GET / HTTP/1.1  [TCP segment of a reassembled PDU]
16 315.823916350 192.168.42.1 ? 10.10.10.10  HTTP 286 GET / HTTP/1.1 
17 315.824210855 192.168.42.1 ? 10.10.10.10  TCP 52 36532 ? 80 [ACK] Seq=1816056107 Ack=3606677083 Win=64128 Len=0 TSval=4143892385 TSecr=279585061
18 315.824301302 192.168.42.1 ? 10.10.10.10  TCP 52 36532 ? 80 [FIN, ACK] Seq=1816056107 Ack=3606677083 Win=64128 Len=0 TSval=4143892385 TSecr=279585061
19 315.824361163 192.168.42.1 ? 10.10.10.10  TCP 52 36532 ? 80 [ACK] Seq=1816056108 Ack=3606677084 Win=64128 Len=0 TSval=4143892385 TSecr=279585061
20 315.823438198 192.168.42.1 ? 10.10.10.10  TCP 74 36532 ? 80 [SYN] Seq=1816054424 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=4143892384 TSecr=0 WS=128
21 315.823726231  10.10.10.10 ? 192.168.42.1 TCP 74 80 ? 36532 [SYN, ACK] Seq=3606676204 Ack=1816054425 Win=65160 Len=0 MSS=1460 SACK_PERM TSval=279585060 TSecr=4143892384 WS=64
22 315.823996195  10.10.10.10 ? 192.168.42.1 TCP 66 80 ? 36532 [ACK] Seq=3606676205 Ack=1816055873 Win=63744 Len=0 TSval=279585061 TSecr=4143892384
23 315.824004052  10.10.10.10 ? 192.168.42.1 TCP 66 80 ? 36532 [ACK] Seq=3606676205 Ack=1816056107 Win=63552 Len=0 TSval=279585061 TSecr=4143892384
24 315.824103640  10.10.10.10 ? 192.168.42.1 HTTP 944 HTTP/1.1 200 OK  (text/html)
25 315.824315534  10.10.10.10 ? 192.168.42.1 TCP 66 80 ? 36532 [FIN, ACK] Seq=3606677083 Ack=1816056108 Win=64128 Len=0 TSval=279585061 TSecr=4143892385

Details

Subject	Repo	Branch	Lines +/-
pybal: do not install from component	operations/puppet	production	+1 -6
Release 1.15.14	operations/debs/pybal	1.15-stretch	+6 -0
Add support for IPIP encapsulation	operations/debs/pybal	1.15-stretch	+66 -15

Customize query in gerrit

	Title	Reference	Author	Source Branch	Dest Branch
	Release 1.15.14	repos/sre/pybal!3	vgutierrez	release-bullseye	bullseye-wikimedia
	Release 1.15.14	repos/sre/pybal!1	vgutierrez	release-1.14	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
In Progress	Vgutierrez	T332027 Replace current L4LB with with Katran-based alternative
Resolved	Vgutierrez	T348837 Investigate IPVS IPIP encapsulation support
Resolved	Vgutierrez	T351069 Enable IPIP encapsulation for ncredir
Resolved	Vgutierrez	T352143 Firewall rules prevent IPIP/IP6IP6 encapsulated traffic from reaching realservers
Resolved	Vgutierrez	T352160 RP filtering drops requests incoming via IPIP tunnels on ncredir realservers

Event Timeline

Vgutierrez created this task.Oct 13 2023, 8:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 13 2023, 8:14 AM

Vgutierrez triaged this task as Medium priority.Oct 13 2023, 8:15 AM

Vgutierrez moved this task from Backlog to Traffic team actively servicing on the Traffic board.

Vgutierrez added a parent task: T332027: Replace current L4LB with with Katran-based alternative.

ayounsi updated the task description. (Show Details)Oct 13 2023, 8:19 AM

Maintenance_bot added a project: SRE.Oct 13 2023, 8:29 AM

Fabfur subscribed.Oct 13 2023, 9:14 AM

Alternative to consider: injecting REDIRECTs for traffic meant for a VIP. See the second section at http://www.linuxvirtualserver.org/docs/arp.html. I haven't tested it and it requires some sort of Netfilter implementation on the realservers, but it avoids MTU-related issues (when tunneling traffic). Nevermind, ARP problem is solved at Wikimedia by not annoucing ARP. MTU is a challenge when using any type of encapsulation (in this case IPIP), but that's a different issue :)

Vgutierrez updated the task description. (Show Details)Oct 13 2023, 12:39 PM

In T348837#9249192, @Southparkfan wrote:

Alternative to consider: injecting REDIRECTs for traffic meant for a VIP. See the second section at http://www.linuxvirtualserver.org/docs/arp.html. I haven't tested it and it requires some sort of Netfilter implementation on the realservers, but it avoids MTU-related issues (when tunneling traffic).

regarding ARP we already handle this with:

$ cat /usr/share/wikimedia-lvs-realserver/sysctl.conf 
# Ignore and do not announce ARP for the LVS service IP
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2

regarding MTU related issues we are analyzing the best approach for us to avoid relying on IP fragmentation

@Vgutierrez thanks for opening this ticket and investigating ipip support in ipvs. Another alternative would be GUE encapsulation, which is also supported by Katran. Evidently UDP encapsulation may have performance benefits because routers are tuned to support it, the patch for foo over udp which is similar to GUE posted some performance numbers, https://lwn.net/Articles/614433/.

Change 965763 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/debs/pybal@1.15-stretch] Add support for IPIP encapsulation

https://gerrit.wikimedia.org/r/965763

gerritbot added a project: Patch-For-Review.Oct 13 2023, 3:11 PM

In T348837#9250190, @jhathaway wrote:

[...] Evidently UDP encapsulation may have performance benefits because routers are tuned to support it [...]

At the endpoints, there might be performances differences between encapsulation protocols (especially if there is encryption, or if they're using older libraries, etc)

In between (in the network) encapsulated packets are forwarded/routed at the same speed. There is one difference though regarding load balancing.

IP packets encapsulated in UDP have more outer headers fields (especially src/dst ports) to use in the router/switches ECMP hashing algorithm than IP in IP.

We want the packets of a given flow (eg. a client HTTP GET) to be routed through the same path between the LVS and the realserver. Different patch could cause packets to arrive in the wrong order. But we want multiple sessions between the same LVS and real server to be spread over as many links as possible.

It's theoretically possible for a router to start digging and get data from the inner packet, which Juniper does for other rare protocols but not for regular IP. Some protocols also have additional headers that could be used for that (eg. IPv6 flow label, tunnel-session-identifier).

In our infrastructure though it's not too much of an issue as:

Most inbound flows are not large (the direct return part is the larger)
Real servers are numerous and spread over the infra (destination IP entropy)
Liberica will have multiple active/active LBs (more source IP entropy)
We have enough capacity on the infra links

Some more doc:

Regarding the UDP encapsulation it's an interesting idea, and is a reminder that currently our switches distribute flows based on source internet IP, which gives us lots of entropy. With IPIP this is seriously diminished, we'd need a sufficiently large number of L4LB's and realservers to get a good traffic balance.

Either way, katran/liberica uses IPIP, so having the option for GUE in IPVS doesn't solve that problem if we hit it. I think we can probably stick with IPIP for that reason.

In T348837#9249898, @Vgutierrez wrote:

regarding MTU related issues we are analyzing the best approach for us to avoid relying on IP fragmentation

It would be great if we could utilize the fact we support jumbo frames across the network. But it's non-trivial to change all realservers, K8s and Ganeti hosts to support. You also in that scenario need to ensure the servers don't send oversize packets back to internet hosts. There are a few ways we could look at doing that, one option is to configure server interface for jumbo MTU, but then different MTU on multiple routes, for instance:

ip route default via x.x.x.x mtu 1500
ip route 10.0.0.0/8 via x.x.x.x mtu 9000

Alternately we could leave interface MTUs at 1500, and restrict/rewrite the TCP MSS the L4LB/realservers send, so clients never send us packets larger than 1480.

Regarding MTU. We MUST NOT need to fragment any v4 packet. And MUST reduce the need of IPv6 PMTUD as much as possible.

There are 2 main options:

1/ increase the MTU on all the relevant hosts (for example to 9000) as Cathal mentioned
The risk/downside here is to have to start chasing many edge cases:

for internet traffic (and thus requires to have static MTU routes defined for all our subnets)
For routes learned over BGP (eg. k8s, even though only until all our infra is on the new network design)

The advantage are mostly for all internal high sized traffic. For example an internal hosts talking to each other would benefit from Jumbo Frames (through a VIP or not).

2/ Decrease the MSS on the realservers (any host were a tunnel can terminate)
In a TCP handshake each side tells its peer what its MSS is, here the two sides are for example the user somewhere on the internet and the realserver.

So we can lower the MSS value sent with handshakes to internet peers so they never send packets too large for our tunnels.

This can be done with iptables, or a similar ip command as above:
ip route default via x.x.x.x advmss 1436

Note that both can also be combined, if we set a MTU of 9000, we can lower the MSS only for internet traffic:
ip route 10.0.0.0/8 via x.x.x.x advmss 8960 (and thus requires to have static MSS routes defined for all our subnets)

3/ Rewrite the MSS at the routers
To handle the Cloudflare tunnels, we clamp the MSS value to 1436 at the router by MITM all the outbound syn/syn-ack packets and re-writing their MTU on the go.
Even though we've been using it for a long term, it is a brittle solution with a potential performance hit. It also locks us with vendor specific features. Using option 2 we could stop using it.

About IPv6, Cloudflare uses an MTU lower than the default: https://blog.cloudflare.com/increasing-ipv6-mtu/

We choose 1400 because we think it's the next best value to use after 1280. With 1400 we believe 93.2% of IPv6 connections will not need to rely on Path MTU Detection/ICMP. In the near future we plan to increase this value further. We won't settle on 1500 though - we want to leave a couple of bytes for IPv4 encapsulation, to allow the most popular tunnels to keep working without suffering poor latency when Path MTU Detection kicks in.

So my preference here would go to option (2) limit the MSS on the relevant end hosts (instead of MITM it), and maybe, later on, start increasing their MTU once their MSS is not dependent on MTU anymore.

@Vgutierrez as a side note you might be interested by this blogpost https://blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths/

Could we take the opposite approach with the MTU fixup for the tunneling, and arrange the host/interface settings on both sides (the LBs and the target hosts) such that they only use a >1500 MTU on the specific unicast routes for the tunnels, but default to their current 1500 for all other traffic? If per-route MTU can usefully be set higher than base interface MTU, this seems trivial, but even if not, surely with some set of ip commands we could set the iface MTU to the higher value, while clamping it back down to 1500 for all cases except the tunnel.

There are 2 main options:
2/ Decrease the MSS on the realservers (any host were a tunnel can terminate)
In a TCP handshake each side tells its peer what its MSS is, here the two sides are for example the user somewhere on the internet and the realserver.

ip route default via x.x.x.x advmss 1436

Huh I did not know that was an option. I agree there are probably less moving parts going this way so I'd lean towards option 2 also.

How to configure it may be a challenge. I'm not sure if we can just add a second default to certain hosts via some post-up script, or if we can add some way (metric=0) so it would be preferred? Otherwise we might need to avoid setting the "gateway" in /etc/network/interfaces so we can instead add the default with iproute2 as a "post-up".

@Vgutierrez as a side note you might be interested by this blogpost https://blog.cloudflare.com/lost-in-transit-debugging-dropped-packets-from-negative-header-lengths/

Reading through this earlier I wasn't sure if the issue they hit with GUE would also apply to IPVS-encapsulated IPIP packets. I think if we are clamping the MSS elsewhere perhaps we won't trigger it. But good to be aware either way and test everything.

In T348837#9253673, @BBlack wrote:

If per-route MTU can usefully be set higher than base interface MTU, this seems trivial,

While you can set MTUs on a route, afaik any frames exceeding interface/driver MTU will still be dropped when they hit the NIC.

but even if not, surely with some set of ip commands we could set the iface MTU to the higher value, while clamping it back down to 1500 for all cases except the tunnel.

The one thing you may not be able to control with mtu/advmss on a route is traffic to the local subnet, as that route is added by the kernal when the IP is added to the interface. Not sure if that can be modified to differ from interface MTU.

Aside from that, you could certainly have:

ip route default mtu 1500
ip route 10/8 mtu 9000

Or replace the second statement with a specific host route for every configured realserver.

The realservers probably only need a higher interface-mtu (to allow receipt of tunneled packets), and 1500 on a default route (so they don't send any jumbos themselves). I think where this maybe gets more complicated is with Kubernetes, where we learn some routes from switches, and Ganeti, where there is internal routing through interfaces on the host to VM. All probably solvable but we need to weigh up the complexity.

In T348837#9253425, @cmooney wrote:

Regarding the UDP encapsulation it's an interesting idea, and is a reminder that currently our switches distribute flows based on source internet IP, which gives us lots of entropy. With IPIP this is seriously diminished, we'd need a sufficiently large number of L4LB's and realservers to get a good traffic balance.

Please note that Katran mitigates this issue by randomizing the source IP and port of the outer IPIP header (https://github.com/facebookincubator/katran/blob/3556504b7c744ced5e022c6c6fcf4e74160f4774/katran/lib/bpf/pckt_encap.h#L128). By default it's using 172.16/10 for IPv6 and 0100::/64 for IPv6. This is done deterministically per flow, so as long as the source IP and port don't change, this randomized IP/port doesn't change either

In T348837#9253720, @cmooney wrote:

The one thing you may not be able to control with mtu/advmss on a route is traffic to the local subnet, as that route is added by the kernal when the IP is added to the interface. Not sure if that can be modified to differ from interface MTU.

Yeah that's gonna be a thorn to deal with. Maybe it's possible to override it with an explicit route for the subnet that happens later in the up-commands of /e/n/i?

Aside from that, you could certainly have:
ip route default mtu 1500
ip route 10/8 mtu 9000
Or replace the second statement with a specific host route for every configured realserver.

The realservers probably only need a higher interface-mtu (to allow receipt of tunneled packets), and 1500 on a default route (so they don't send any jumbos themselves).

My worry with leaving a higher MTU in effect for anything other than the tunnel, is it tends to lead to subtle problems with other unrelated traffic. Even in the scenario above (a realserver with raised iface MTU and a 1500 default route) the realserver would be exchanging direct traffic with other hosts on its subnet. TCP can probe for that case and advmss can make it even simpler, but other protocols (UDP for logstash, DNS, etc?) would not.

The other traditional angle of attack on this is to try to upgrade all host interfaces on all of our internal VLANs to the higher MTU (and then maybe use 1500 on default route when there's concern about the direct return to public networks, and then also possibly couple that later with larger MTUs to our own internal subnets as an optimization). However, in the past I've found that trying to upgrade the MTU of all the hosts on a network gets problematic. There's always some edge cases (some hardware device that doesn't support a larger MTU and then can't talk to the icinga server properly, or something doesn't work out ok during the initial bootstrap/imaging of a server before it gets the MTU set right, etc).

I think where this maybe gets more complicated is with Kubernetes, where we learn some routes from switches, and Ganeti, where there is internal routing through interfaces on the host to VM. All probably solvable but we need to weigh up the complexity.

And yeah, then there's all of these kinds of things. If we can't solve the easier cases, we certainly can't solve these :)

BTW This is also the approach recommended by Katran

In T348837#9253591, @ayounsi wrote:

So my preference here would go to option (2) limit the MSS on the relevant end hosts (instead of MITM it), and maybe, later on, start increasing their MTU once their MSS is not dependent on MTU anymore.

jbond subscribed.Oct 16 2023, 2:38 PM

One potential issue with relying solely on MSS reduction is that, obviously, it only affects TCP. For now this is fine, as long as we're only using LVS (or future liberica) for TCP traffic (I think that's currently the case for LVS anyways!), but we could add UDP-based things in the future (e.g. DNS and QUIC/HTTP3), at which point we'll have to solve these problems differently.

for QUIC there are ongoing efforts like https://datatracker.ietf.org/doc/draft-pskim-passive-probing-pmtud/

In T348837#9253425, @cmooney wrote:

Either way, katran/liberica uses IPIP, so having the option for GUE in IPVS doesn't solve that problem if we hit it. I think we can probably stick with IPIP for that reason.

Though not documented, Katran supports GUE as well, https://github.com/facebookincubator/katran/commit/74c3338c2f7ea4d305e2f9440a668d4454643235. That said, I don't have a great deal of knowledge of the trade offs when choosing an encapsulation method, thanks @cmooney and @ayounsi for your detailed replies which furthered my understanding.

In T348837#9254127, @BBlack wrote:

One potential issue with relying solely on MSS reduction is that, obviously, it only affects TCP. For now this is fine, as long as we're only using LVS (or future liberica) for TCP traffic (I think that's currently the case for LVS anyways!), but we could add UDP-based things in the future (e.g. DNS and QUIC/HTTP3), at which point we'll have to solve these problems differently.

We used MSS clamping with IPVS at my last job and it mostly just worked, but we had a handful of instances where it did not. Those instances often resulted in days and sometimes weeks of painful debugging with clients. In our case the clients had a vested interest in solving the problem, but that will probably not always be the case for our much wider client base. My preference would be to to not rely on MSS, which relies on well behaving clients we don't control, and instead solve the MTU problem internally, which is something we can completely control. Solving the MTU issue would hopefully be mainly a onetime cost, whereas clamping MSS my result in a trickle of issues until the heat death of the universe 😬. The horizon for QUIC and DNS load balancing also seem near enough that they should be in scope for the new design.

The issue is that MSS and MTU are tightly coupled. If we increase the MTU on the realservers to allow for the encapsulation overhead (eg. to 9000), the realservers will advertise their new increased MSS to the client (eg 8960).
Most of the time, the client will have a standard MTU(1500), and thus won't send packets larger than that. But by communicating a higher MSS to the client we remove a safeguard. If the client is miss-configured (eg. interface with a MTU of 1600) then it will try to send packets that big (as we can receive them) but the network in between won't be able to forward it.
That's why I would expect some edge cases failures if we just increase our MTU without lowering the MSS set to at least world wide web clients.
Thanks to the Cloudflare MSS clamping we do at a few POPs, it looks like clients behave properly (they respect the lower MSS we send them).

Additionally, a quick look shows that Cloudflare servers use a MSS of 1400 over IPv4, Google 1412.
For IPv6, Cloudflare is at 1360 and Google at 1440. (reminds me of T283058: Consider lowering IPv6 TCP MSS)
This is usually to accommodate clients that are being tunneled "transparently" somewhere in the path, and reducing the need for fragmentation or IPv6 PMTUD. And in the case of Cloudflare at least, accommodate their internal tunneling.

Lowering the MSS for internet clients seems like the best option to me. If we need to increase it (along with the MTU) for the internal flows it can be tracked as a separate project.

Regarding UDP based protocols, DNS over UDP is usually capped at 512 bytes.

QUIC is somewhere in between 1200 and 1350, https://blog.apnic.net/2022/07/11/a-look-at-quic-use/ (see QUIC packet sizes)

@ayounsi thanks for detailed replied and the linked blog posts. Given that additional data, I am substantially less concerned about using MSS clamping.

In T348837#9253425, @cmooney wrote:

Regarding the UDP encapsulation it's an interesting idea, and is a reminder that currently our switches distribute flows based on source internet IP, which gives us lots of entropy. With IPIP this is seriously diminished, we'd need a sufficiently large number of L4LB's and realservers to get a good traffic balance.

I've implemented a source IP/IPv6 randomizer for IPIP/IP6tnl packets (same way as katran does it) to alleviate this issue. The code can be found here https://gitlab.wikimedia.org/vgutierrez/ipip-multiqueue-optimizer

Regarding UDP based protocols, DNS over UDP is usually capped at 512 bytes.

While that was true at one stage, most DNS implementations now support EDNS and thus larger DNS packets over UDP. A key driver there is supporting larger packets containing DNSSEC keys and signatures. The industry seems mostly agreed that using 4096-byte EDNS(0) buffers (potentially requiring fragmentation) is impractical, and setting the EDNS buffer size to somewhere between 1232-1472bytes is common. This internet draft proposes to standardize on 1400. Either way large resolver implementations (which likely account for the majority of traffic to our authdns) are unlikely to be still restricting themselves to 512 bytes over UDP.

In T348837#9267626, @Vgutierrez wrote:

I've implemented a source IP/IPv6 randomizer for IPIP/IP6tnl packets (same way as katran does it) to alleviate this issue. The code can be found here https://gitlab.wikimedia.org/vgutierrez/ipip-multiqueue-optimizer

Wow nice work!

Mentioned in SAL (#wikimedia-operations) [2023-10-24T09:16:25Z] <vgutierrez> upload golang-github-florianl-go-tc to apt.wm.o (bookworm) - T348837

ayounsi mentioned this in T350462: Provide a TCP MSS clamping mechanism for real servers.Nov 7 2023, 2:43 PM

Change 965763 merged by Vgutierrez:

[operations/debs/pybal@1.15-stretch] Add support for IPIP encapsulation

https://gerrit.wikimedia.org/r/965763

Maintenance_bot removed a project: Patch-For-Review.Nov 13 2023, 10:11 AM

Change 973732 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/debs/pybal@1.15-stretch] Release 1.15.14

https://gerrit.wikimedia.org/r/973732

gerritbot added a project: Patch-For-Review.Nov 13 2023, 10:40 AM

Change 973732 abandoned by Vgutierrez: