All our core router -> server BGP sessions currently have "multihop" enabled, to support EBGP connections from servers the routers see as directly connected, but may take an extra IP hop to get to the CR (via the other CR as it's VRRP master).
On the server side all the sessions are set up as "multihop", i.e. they send the BGP packets with a TTL greater than 1 so the connection works. The Kubernetes and PyBal nodes use default TTL of 64:
cmooney@kubernetes1023:~$ sudo tcpdump -v -i eno1 -l -p -nn tcp port 179 and src host 10.64.32.21 tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes 13:55:18.567530 IP (tos 0xc0, ttl 64, id 31002, offset 0, flags [DF], proto TCP (6), length 71) 10.64.32.21.54163 > 208.80.154.197.179: Flags [P.], cksum 0x95a4 (incorrect -> 0x3079), seq 433376052:433376071, ack 4133911694, win 83, options [nop,nop,TS val 3997879078 ecr 2361014759], length 19: BGP Keepalive Message (4), length: 19
cmooney@lvs1017:~$ sudo tcpdump -i eno1np0 -v -l -p -nn src host 10.64.0.80 and tcp port 179 tcpdump: listening on eno1np0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 13:57:48.607257 IP (tos 0x0, ttl 64, id 43759, offset 0, flags [DF], proto TCP (6), length 71) 10.64.0.80.44614 > 208.80.154.197.179: Flags [P.], cksum 0x75df (incorrect -> 0x7b37), seq 968772036:968772055, ack 2770000857, win 83, options [nop,nop,TS val 2552897587 ecr 2361166036], length 19: BGP Keepalive Message (4), length: 19
Where we manually configure multihop, however, we in all cases are specifying a TTL to use. Our Anycast/Bird servers are configured to set the TTL to 2 for the BGP packets:
root@dns2004:/etc/bird# grep -A1 "protocol bgp" bird.conf protocol bgp { multihop 2; -- protocol bgp { multihop 2;
root@dns2004:~# tcpdump -i eno8303 -l -p -nn -v host 208.80.153.48 and tcp port 179 tcpdump: listening on eno8303, link-type EN10MB (Ethernet), snapshot length 262144 bytes 17:14:27.436596 IP (tos 0xc0, ttl 2, id 3498, offset 0, flags [DF], proto TCP (6), length 71) 208.80.153.48.56247 > 208.80.153.192.179: Flags [P.], cksum 0xd3cb (incorrect -> 0xd3b4), seq 2969821142:2969821161, ack 2426692537, win 83, options [nop,nop,TS val 3043247674 ecr 3753479673], length 19: BGP Keepalive Message (4), length: 19 17:14:27.540170 IP (tos 0xc0, ttl 193, id 43160, offset 0, flags [none], proto TCP (6), length 52) 208.80.153.192.179 > 208.80.153.48.56247: Flags [.], cksum 0x7c2d (correct), ack 19, win 16384, options [nop,nop,TS val 3753486830 ecr 3043247674], length 0
On the router side all our server peerings also have the TTL to use set to 2, except for the Anycast group which uses 193 (see T209989):
set protocols bgp group PyBal multihop ttl 2 set protocols bgp group Kubernetes4 multihop ttl 2 set protocols bgp group Kubernetes6 multihop ttl 2 set protocols bgp group Anycast4 multihop ttl 193 set protocols bgp group Kubestage4 multihop ttl 2 set protocols bgp group Kubestage6 multihop ttl 2 set protocols bgp group Kubemlserve4 multihop ttl 2 set protocols bgp group Kubemlserve6 multihop ttl 2 set protocols bgp group Anycast6 multihop ttl 193 set protocols bgp group Kubemlstaging4 multihop ttl 2 set protocols bgp group Kubemlstaging6 multihop ttl 2
I have been looking at this as with the codfw switch upgrade the potential path for BGP packets from servers on existing vlans could now take up to an extra 2 hops, i.e. SERVER -> LEAF -> SPINE -> CR1 -> CR2.
To support this we could increase the TTL from 2 to 4 where we have it configured, but I really wonder about the point of all this very-strict control of TTL on the server sessions. It makes a lot more sense to me, both on the Bird and Juniper side, to simply configure "multihop", which will cause the devices to use the systems default TTL for outgoing packets, and accept any TTL on incoming ones.
The classic use-case for TTL security mechanisms is to prevent the "ping of death" from a remote malicious user who spoofs a TCP RST packet from a source address matching a BGP peer. Enforcing a TTL of 1 prevents this working off-link, and a slightly higher TTL means the attacker has to be very close to you. This is perfectly valid on the WAN, but I don't think we need to worry about it on our peerings to servers.