Page MenuHomePhabricator

Anycast: consistent routers->servers routing
Open, Stalled, LowPublic

Description

The scenario we need to prevent is the following:

  1. User establish a TCP session to an anycasted VIP: packets are for example flowing through: user -> transit_A -> cr1 -> server1
  2. A routing change on the Internet happen and the user path becomes: user -> transit_B -> cr2 -> server2

This would cause the TCP session to break as server2 has no knowledge of that client.

The two possible ways for the infra to handle that use case are:

  • A) user -> transit_B -> cr2 -> server1 consistently
  • B) user -> transit_B -> cr2 -> cr1 -> server1

As hashing is done on source/dest L3 headers, the cr1->server1 step is consistent.

About A) I so far can't find any mention of it in Juniper's consistent hashing doc (here and there). Which mean we would need to test it to figure out if it works as we want it. Which is unlikely.
Regardless, Juniper consistent LB only works with single-hop eBGP, while we currently do multi-hop (as the servers peer with the routers loopback). And only on MPCs, so maybe not on MX204s.

As a side note, Juniper's consistent LB means that if the server pool is >2, adding/removing a server will not reshuffle all the sessions, which we don't need to care about right now as we only have 2 servers per sites.

However, it is possible to use BGP MEDs to achieve B). By having all servers tag a higher MED to a prefix when they advertise that prefix to the same router.

Edit: according to https://apps.juniper.net/feature-explorer/feature-info.html?fKey=6434&fn=Consistent%20load%20balancing%20for%20ECMP%20groups consistent-hash is available on MX204s.

Event Timeline

ayounsi triaged this task as Medium priority.May 26 2020, 5:23 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Even if we could experimentally verify option A, we probably can't trust it across future firmware differences (between sites, or between two routers in a site). Option B via MEDs sounds like a good path forward for now, though!

Related: we have the issue of ICMP Packet-Too-Big routing: AFAIK Juniper doesn't even try to route a PTB from an intermediate router to the same server as the primary traffic it was referencing. This probably isn't a major issue for the authdns case, because (a) the client recursors should mostly be on server (rather than eyeball) networks with full MTU + (b) the overwhelming majority of all traffic is UDP with small-enough packet sizes to fit any reasonable network. However, it would be nice to be correct for edge cases like recursors in eyeball networks with MTU problems, and future-proof against increasing TCP usage in the future (for cookie init and other blind-injection-avoidance, and also DoTLS and future DNSSEC packet size increases). Cloudflare's generic answer to this problem has been https://github.com/cloudflare/pmtud , but there might be different and/or simpler approaches we want to try as well.

Also:

As a side note, Juniper's consistent LB means that if the server pool is >2, adding/removing a server will not reshuffle all the sessions, which we don't need to care about right now as we only have 2 servers per sites.

In the core sites, we have 3 servers for these (dns1001, dns1002, authdns1001, and similarly in codfw).

Change 598836 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Anycast: introduce new "deterministic" variable

https://gerrit.wikimedia.org/r/598836

Mentioned in SAL (#wikimedia-operations) [2020-05-27T07:54:51Z] <XioNoX> test new bird conf on dns4001 - T253666

Option B via MEDs sounds like a good path forward for now, though!

https://gerrit.wikimedia.org/r/598836 has been tested and is ready to be merged.

Related: we have the issue of ICMP Packet-Too-Big routing

I created a dedicated task for that issue: T253732

In the core sites, we have 3 servers for these (dns1001, dns1002, authdns1001, and similarly in codfw).

Will that be the final/permanent state?
If so, we only made Bird peer with the routers' loopback for ease of configuration.
2 neighbors to set by site, vs. a neighbors list per server (as they are in different vlans).

ayounsi changed the task status from Open to Stalled.Aug 3 2020, 6:59 AM
ayounsi lowered the priority of this task from Medium to Low.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!