Page MenuHomePhabricator

Use default BGP multihop TTL between CRs and servers
Closed, ResolvedPublic

Description

All our core router -> server BGP sessions currently have "multihop" enabled, to support EBGP connections from servers the routers see as directly connected, but may take an extra IP hop to get to the CR (via the other CR as it's VRRP master).

On the server side all the sessions are set up as "multihop", i.e. they send the BGP packets with a TTL greater than 1 so the connection works. The Kubernetes and PyBal nodes use default TTL of 64:

cmooney@kubernetes1023:~$ sudo tcpdump -v -i eno1 -l -p -nn tcp port 179 and src host 10.64.32.21 
tcpdump: listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
13:55:18.567530 IP (tos 0xc0, ttl 64, id 31002, offset 0, flags [DF], proto TCP (6), length 71)
    10.64.32.21.54163 > 208.80.154.197.179: Flags [P.], cksum 0x95a4 (incorrect -> 0x3079), seq 433376052:433376071, ack 4133911694, win 83, options [nop,nop,TS val 3997879078 ecr 2361014759], length 19: BGP
	Keepalive Message (4), length: 19
cmooney@lvs1017:~$ sudo tcpdump -i eno1np0 -v -l -p -nn src host 10.64.0.80 and tcp port 179 
tcpdump: listening on eno1np0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
13:57:48.607257 IP (tos 0x0, ttl 64, id 43759, offset 0, flags [DF], proto TCP (6), length 71)
    10.64.0.80.44614 > 208.80.154.197.179: Flags [P.], cksum 0x75df (incorrect -> 0x7b37), seq 968772036:968772055, ack 2770000857, win 83, options [nop,nop,TS val 2552897587 ecr 2361166036], length 19: BGP
	Keepalive Message (4), length: 19

Where we manually configure multihop, however, we in all cases are specifying a TTL to use. Our Anycast/Bird servers are configured to set the TTL to 2 for the BGP packets:

root@dns2004:/etc/bird# grep -A1 "protocol bgp" bird.conf 
protocol bgp {
    multihop 2;
--
protocol bgp {
    multihop 2;
root@dns2004:~# tcpdump -i eno8303 -l -p -nn -v host 208.80.153.48 and tcp port 179 
tcpdump: listening on eno8303, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:14:27.436596 IP (tos 0xc0, ttl 2, id 3498, offset 0, flags [DF], proto TCP (6), length 71)
    208.80.153.48.56247 > 208.80.153.192.179: Flags [P.], cksum 0xd3cb (incorrect -> 0xd3b4), seq 2969821142:2969821161, ack 2426692537, win 83, options [nop,nop,TS val 3043247674 ecr 3753479673], length 19: BGP
	Keepalive Message (4), length: 19
17:14:27.540170 IP (tos 0xc0, ttl 193, id 43160, offset 0, flags [none], proto TCP (6), length 52)
    208.80.153.192.179 > 208.80.153.48.56247: Flags [.], cksum 0x7c2d (correct), ack 19, win 16384, options [nop,nop,TS val 3753486830 ecr 3043247674], length 0

On the router side all our server peerings also have the TTL to use set to 2, except for the Anycast group which uses 193 (see T209989):

set protocols bgp group PyBal multihop ttl 2
set protocols bgp group Kubernetes4 multihop ttl 2
set protocols bgp group Kubernetes6 multihop ttl 2
set protocols bgp group Anycast4 multihop ttl 193
set protocols bgp group Kubestage4 multihop ttl 2
set protocols bgp group Kubestage6 multihop ttl 2
set protocols bgp group Kubemlserve4 multihop ttl 2
set protocols bgp group Kubemlserve6 multihop ttl 2
set protocols bgp group Anycast6 multihop ttl 193
set protocols bgp group Kubemlstaging4 multihop ttl 2
set protocols bgp group Kubemlstaging6 multihop ttl 2

I have been looking at this as with the codfw switch upgrade the potential path for BGP packets from servers on existing vlans could now take up to an extra 2 hops, i.e. SERVER -> LEAF -> SPINE -> CR1 -> CR2.

To support this we could increase the TTL from 2 to 4 where we have it configured, but I really wonder about the point of all this very-strict control of TTL on the server sessions. It makes a lot more sense to me, both on the Bird and Juniper side, to simply configure "multihop", which will cause the devices to use the systems default TTL for outgoing packets, and accept any TTL on incoming ones.

The classic use-case for TTL security mechanisms is to prevent the "ping of death" from a remote malicious user who spoofs a TCP RST packet from a source address matching a BGP peer. Enforcing a TTL of 1 prevents this working off-link, and a slightly higher TTL means the attacker has to be very close to you. This is perfectly valid on the WAN, but I don't think we need to worry about it on our peerings to servers.

Event Timeline

cmooney triaged this task as Medium priority.Nov 3 2023, 2:16 PM
cmooney created this task.

Change 971488 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove specific TTL values from server BGP groups

https://gerrit.wikimedia.org/r/971488

Change 971490 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Change Bird multihop command to use default system TTL

https://gerrit.wikimedia.org/r/971490

Change 971498 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Block incoming packets on the edge for CR loopbacks on TCP 179

https://gerrit.wikimedia.org/r/971498

For cross-sites router to router we use the TTL value to eventually take down the session if the BGP session takes a too long path, it's clearly not optimal and streamlining it would be part of T167841: Cleanup confed BGP peerings and policies.
For server facing BGP sessions I don't know why specific TTL was used (it predates me), either to replicate the same setup, or for the security aspect of it.

It makes sens to me to remove it, just make sure that BFD properly establishes once the explicit TTL is removed.

For cross-sites router to router we use the TTL value to eventually take down the session if the BGP session takes a too long path, it's clearly not optimal and streamlining it would be part of T167841: Cleanup confed BGP peerings and policies.

Yes it is definitely a useful knob to have there, I'm not suggesting we remove it over transport links.

For server facing BGP sessions I don't know why specific TTL was used (it predates me), either to replicate the same setup, or for the security aspect of it.

It makes sens to me to remove it, just make sure that BFD properly establishes once the explicit TTL is removed.

Yeah we need to be careful introducing it and double check all is ok. I've done extensive lab testing and I'm not worried, but still need to proceed carefully. Worth noting that the BFD packets sent by BIRD right now are all using the default system TTL (only the BGP packets have TTL 2 on them, the "multihop" stanza in the BFD config doesn't have a ttl on it).

Eh not sure how I accidentally set this to resolved!

Change 971498 abandoned by Cathal Mooney:

[operations/homer/public@master] Block incoming packets on the edge for CR loopbacks on TCP 179

Reason:

Agreed that we're better without the additional config given low risk. We can keep for reference.

https://gerrit.wikimedia.org/r/971498

Change 971488 merged by jenkins-bot:

[operations/homer/public@master] Remove specific TTL values from server BGP groups

https://gerrit.wikimedia.org/r/971488

Mentioned in SAL (#wikimedia-operations) [2023-11-15T18:36:17Z] <topranks> remove TTL setting on server-facing BGP peerings on cr3-ulsfo T350488

Mentioned in SAL (#wikimedia-operations) [2023-11-15T18:42:53Z] <topranks> Reset BGP to lvs4010 from cr3-ulsfo to validate new config T350488

Change 971490 merged by Cathal Mooney:

[operations/puppet@production] Change Bird multihop command to use default system TTL

https://gerrit.wikimedia.org/r/971490

Mentioned in SAL (#wikimedia-operations) [2023-11-15T19:10:31Z] <topranks> merging patch to remove TTL restriction on Bird Anycast BGP peerings (T350488)

Mentioned in SAL (#wikimedia-operations) [2023-11-15T19:39:11Z] <topranks> re-enabling puppet on DNS hosts to adjust TTL setting in BIRD (T350488)

Patches merged, all looking ok.

For example on dns5004 this was situation before, server using TTL 2, CR using 193:

19:27:22.338917 IP (tos 0xc0, ttl 2, id 2340, offset 0, flags [DF], proto TCP (6), length 71)
    103.102.166.10.179 > 103.102.166.131.64331: Flags [P.], cksum 0x1b94 (incorrect -> 0x831e), seq 140972607:140972626, ack 1513291613, win 85, options [nop,nop,TS val 1865306745 ecr 1025991869], length 19: BGP
	Keepalive Message (4), length: 19
19:27:22.441139 IP (tos 0xc0, ttl 193, id 30028, offset 0, flags [none], proto TCP (6), length 52)
    103.102.166.131.64331 > 103.102.166.10.179: Flags [.], cksum 0xe41f (correct), seq 1, ack 19, win 16384, options [nop,nop,TS val 1026017323 ecr 1865306745], length 0

Now server using 64, CR using 255:

19:37:38.182849 IP (tos 0xc0, ttl 64, id 3937, offset 0, flags [DF], proto TCP (6), length 71)
    103.102.166.10.39373 > 103.102.166.130.179: Flags [P.], cksum 0x1b93 (incorrect -> 0xeda8), seq 3668898001:3668898020, ack 3525283050, win 83, options [nop,nop,TS val 2962555939 ecr 1028436521], length 19: BGP
	Keepalive Message (4), length: 19
19:37:38.287391 IP (tos 0xc0, ttl 255, id 42734, offset 0, flags [none], proto TCP (6), length 52)
    103.102.166.130.179 > 103.102.166.10.39373: Flags [.], cksum 0x88f7 (correct), seq 1, ack 19, win 16384, options [nop,nop,TS val 1028447048 ecr 2962555939], length 0

BFD unaffected (it was already using 64 server side).

cmooney renamed this task from Use default BGP multihop TTL between devices to Use default BGP multihop TTL between CRs and servers.Nov 16 2023, 12:46 PM