Page MenuHomePhabricator

Anycast: consistent routers->servers routing
Open, MediumPublic

Description

The scenario we need to prevent is the following:

  1. User establish a TCP session to an anycasted VIP: packets are for example flowing through: user -> transit_A -> cr1 -> server1
  2. A routing change on the Internet happen and the user path becomes: user -> transit_B -> cr2 -> server2

This would cause the TCP session to break as server2 has no knowledge of that client.

The two possible ways for the infra to handle that use case are:

  • A) user -> transit_B -> cr2 -> server1 consistently
  • B) user -> transit_B -> cr2 -> cr1 -> server1

As hashing is done on source/dest L3 headers, the cr1->server1 step is consistent.

About A) I so far can't find any mention of it in Juniper's consistent hashing doc (here and there). Which mean we would need to test it to figure out if it works as we want it. Which is unlikely.
Regardless, Juniper consistent LB only works with single-hop eBGP, while we currently do multi-hop (as the servers peer with the routers loopback). And only on MPCs, so maybe not on MX204s.

As a side note, Juniper's consistent LB means that if the server pool is >2, adding/removing a server will not reshuffle all the sessions, which we don't need to care about right now as we only have 2 servers per sites.

However, it is possible to use BGP MEDs to achieve B). By having all servers tag a higher MED to a prefix when they advertise that prefix to the same router.

Edit: according to https://apps.juniper.net/feature-explorer/feature-info.html?fKey=6434&fn=Consistent%20load%20balancing%20for%20ECMP%20groups consistent-hash is available on MX204s.

Event Timeline

ayounsi triaged this task as Medium priority.May 26 2020, 5:23 PM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptMay 26 2020, 5:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi added a subscriber: faidon.May 26 2020, 5:24 PM

Even if we could experimentally verify option A, we probably can't trust it across future firmware differences (between sites, or between two routers in a site). Option B via MEDs sounds like a good path forward for now, though!

Related: we have the issue of ICMP Packet-Too-Big routing: AFAIK Juniper doesn't even try to route a PTB from an intermediate router to the same server as the primary traffic it was referencing. This probably isn't a major issue for the authdns case, because (a) the client recursors should mostly be on server (rather than eyeball) networks with full MTU + (b) the overwhelming majority of all traffic is UDP with small-enough packet sizes to fit any reasonable network. However, it would be nice to be correct for edge cases like recursors in eyeball networks with MTU problems, and future-proof against increasing TCP usage in the future (for cookie init and other blind-injection-avoidance, and also DoTLS and future DNSSEC packet size increases). Cloudflare's generic answer to this problem has been https://github.com/cloudflare/pmtud , but there might be different and/or simpler approaches we want to try as well.

Also:

As a side note, Juniper's consistent LB means that if the server pool is >2, adding/removing a server will not reshuffle all the sessions, which we don't need to care about right now as we only have 2 servers per sites.

In the core sites, we have 3 servers for these (dns1001, dns1002, authdns1001, and similarly in codfw).

Change 598836 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Anycast: introduce new "deterministic" variable

https://gerrit.wikimedia.org/r/598836

Mentioned in SAL (#wikimedia-operations) [2020-05-27T07:54:51Z] <XioNoX> test new bird conf on dns4001 - T253666

Option B via MEDs sounds like a good path forward for now, though!

https://gerrit.wikimedia.org/r/598836 has been tested and is ready to be merged.

Related: we have the issue of ICMP Packet-Too-Big routing

I created a dedicated task for that issue: T253732

In the core sites, we have 3 servers for these (dns1001, dns1002, authdns1001, and similarly in codfw).

Will that be the final/permanent state?
If so, we only made Bird peer with the routers' loopback for ease of configuration.
2 neighbors to set by site, vs. a neighbors list per server (as they are in different vlans).

ema moved this task from Triage to Network on the Traffic board.May 27 2020, 11:43 AM
ayounsi updated the task description. (Show Details)Fri, Jul 3, 7:24 AM