As pointed out by @jhathaway IPVS supports IPIP encapsulation so PyBaL should be able to benefit from that.
This is interesting cause we could move from the current IPVS DSR support that requires L2 connectivity to IPIP encapsulation, closing the gap between PyBaL and Liberica and effectively decreasing the risk of such migration
Vagrantfile PoC: https://phabricator.wikimedia.org/P52928
how to use it:
$ vagrant up $ VIP=$(fgrep "vip =" Vagrantfile | cut -f2 -d'"') $ LB=$(fgrep "lb_ip =" Vagrantfile | cut -f2 -d'"') $ sudo ip route add $VIP via $LB $ curl -s -v -o /dev/null $VIP * Trying 10.10.10.10:80... * Connected to 10.10.10.10 (10.10.10.10) port 80 (#0) > GET / HTTP/1.1 > Host: 10.10.10.10 > User-Agent: curl/7.88.1 > Accept: */* > < HTTP/1.1 200 OK < Server: nginx/1.22.1 < Date: Fri, 13 Oct 2023 10:56:11 GMT < Content-Type: text/html < Content-Length: 615 < Last-Modified: Fri, 13 Oct 2023 10:48:34 GMT < Connection: keep-alive < ETag: "65292082-267" < backend: 192.168.42.100 < Accept-Ranges: bytes < { [615 bytes data] * Connection #0 to host 10.10.10.10 left intact
using thsark in one of the real servers (vagrant ssh backend[01]) shows how requests come via ipip0 and response goes back via eth1:
vagrant@bookworm:~$ sudo -i tshark -o tcp.analyze_sequence_numbers:FALSE -i ipip0 -i eth1 -z proto,colinfo,frame.interface_name,frame.interface_name port 80 Running as user "root" and group "root". This could be dangerous. Capturing on 'ipip0' and 'eth1' ** (tshark:3518) 11:06:41.001064 [Main MESSAGE] -- Capture started. ** (tshark:3518) 11:06:41.001265 [Main MESSAGE] -- File: "/tmp/wireshark_2_interfacesLZ2QC2.pcapng" 1 0.000000000 192.168.42.1 ? 10.10.10.10 TCP 60 47078 ? 80 [SYN] Seq=1618396380 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=4138444511 TSecr=0 WS=128 frame.interface_name == "ipip0" 2 0.000283971 192.168.42.1 ? 10.10.10.10 TCP 52 47078 ? 80 [ACK] Seq=1618396381 Ack=1890327363 Win=64256 Len=0 TSval=4138444512 TSecr=274137172 frame.interface_name == "ipip0" 3 0.000284019 192.168.42.1 ? 10.10.10.10 HTTP 127 GET / HTTP/1.1 frame.interface_name == "ipip0" 4 0.000697318 192.168.42.1 ? 10.10.10.10 TCP 52 47078 ? 80 [ACK] Seq=1618396456 Ack=1890328241 Win=64128 Len=0 TSval=4138444512 TSecr=274137173 frame.interface_name == "ipip0" 5 0.000713903 192.168.42.1 ? 10.10.10.10 TCP 52 47078 ? 80 [FIN, ACK] Seq=1618396456 Ack=1890328241 Win=64128 Len=0 TSval=4138444512 TSecr=274137173 frame.interface_name == "ipip0" 6 0.000833913 192.168.42.1 ? 10.10.10.10 TCP 52 47078 ? 80 [ACK] Seq=1618396457 Ack=1890328242 Win=64128 Len=0 TSval=4138444513 TSecr=274137173 frame.interface_name == "ipip0" 7 0.000032012 10.10.10.10 ? 192.168.42.1 TCP 74 80 ? 47078 [SYN, ACK] Seq=1890327362 Ack=1618396381 Win=65160 Len=0 MSS=1460 SACK_PERM TSval=274137172 TSecr=4138444511 WS=64 frame.interface_name == "eth1" 8 0.000311346 10.10.10.10 ? 192.168.42.1 TCP 66 80 ? 47078 [ACK] Seq=1890327363 Ack=1618396456 Win=65088 Len=0 TSval=274137173 TSecr=4138444512 frame.interface_name == "eth1" 9 0.000440283 10.10.10.10 ? 192.168.42.1 HTTP 944 HTTP/1.1 200 OK (text/html) frame.interface_name == "eth1" 10 0.000727290 10.10.10.10 ? 192.168.42.1 TCP 66 80 ? 47078 [FIN, ACK] Seq=1890328241 Ack=1618396457 Win=65088 Len=0 TSval=274137173 TSecr=4138444512 frame.interface_name == "eth1"
By default ipip0 gets configured with MTU 1480 and eth1 with MTU 1500, if we use curl to trigger a request bigger than the MTU we can see how fragmentation happens and is handled:
$ curl -H "Foo: $(python3 -c 'print(chr(0x42)*1600)')" 10.10.10.10 -v -o /dev/null -s * Trying 10.10.10.10:80... * Connected to 10.10.10.10 (10.10.10.10) port 80 (#0) > GET / HTTP/1.1 > Host: 10.10.10.10 > User-Agent: curl/7.88.1 > Accept: */* > Foo: B[x1600, you get the idea] > < HTTP/1.1 200 OK < Server: nginx/1.22.1 < Date: Fri, 13 Oct 2023 12:37:30 GMT < Content-Type: text/html < Content-Length: 615 < Last-Modified: Fri, 13 Oct 2023 10:48:34 GMT < Connection: keep-alive < ETag: "65292082-267" < backend: 192.168.42.100 < Accept-Ranges: bytes < { [615 bytes data] * Connection #0 to host 10.10.10.10 left intact
tshark shows the fragmentation as expected (after disabling the segmentation offload with ethtool):
13 315.823694565 192.168.42.1 ? 10.10.10.10 TCP 60 36532 ? 80 [SYN] Seq=1816054424 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=4143892384 TSecr=0 WS=128 14 315.823889655 192.168.42.1 ? 10.10.10.10 TCP 52 36532 ? 80 [ACK] Seq=1816054425 Ack=3606676205 Win=64256 Len=0 TSval=4143892384 TSecr=279585060 15 315.823916323 192.168.42.1 ? 10.10.10.10 TCP 1500 GET / HTTP/1.1 [TCP segment of a reassembled PDU] 16 315.823916350 192.168.42.1 ? 10.10.10.10 HTTP 286 GET / HTTP/1.1 17 315.824210855 192.168.42.1 ? 10.10.10.10 TCP 52 36532 ? 80 [ACK] Seq=1816056107 Ack=3606677083 Win=64128 Len=0 TSval=4143892385 TSecr=279585061 18 315.824301302 192.168.42.1 ? 10.10.10.10 TCP 52 36532 ? 80 [FIN, ACK] Seq=1816056107 Ack=3606677083 Win=64128 Len=0 TSval=4143892385 TSecr=279585061 19 315.824361163 192.168.42.1 ? 10.10.10.10 TCP 52 36532 ? 80 [ACK] Seq=1816056108 Ack=3606677084 Win=64128 Len=0 TSval=4143892385 TSecr=279585061 20 315.823438198 192.168.42.1 ? 10.10.10.10 TCP 74 36532 ? 80 [SYN] Seq=1816054424 Win=64240 Len=0 MSS=1460 SACK_PERM TSval=4143892384 TSecr=0 WS=128 21 315.823726231 10.10.10.10 ? 192.168.42.1 TCP 74 80 ? 36532 [SYN, ACK] Seq=3606676204 Ack=1816054425 Win=65160 Len=0 MSS=1460 SACK_PERM TSval=279585060 TSecr=4143892384 WS=64 22 315.823996195 10.10.10.10 ? 192.168.42.1 TCP 66 80 ? 36532 [ACK] Seq=3606676205 Ack=1816055873 Win=63744 Len=0 TSval=279585061 TSecr=4143892384 23 315.824004052 10.10.10.10 ? 192.168.42.1 TCP 66 80 ? 36532 [ACK] Seq=3606676205 Ack=1816056107 Win=63552 Len=0 TSval=279585061 TSecr=4143892384 24 315.824103640 10.10.10.10 ? 192.168.42.1 HTTP 944 HTTP/1.1 200 OK (text/html) 25 315.824315534 10.10.10.10 ? 192.168.42.1 TCP 66 80 ? 36532 [FIN, ACK] Seq=3606677083 Ack=1816056108 Win=64128 Len=0 TSval=279585061 TSecr=4143892385