Page MenuHomePhabricator

Anycast AuthDNS
Open, LowPublic

Description

As our network of PoPs expand, it makes sense to start thinking about distributing other critical services than just pure HTTP(S) to them. The obvious one that immediately comes to mind is authoritative DNS, as it is critical for serving user traffic efficiently. DNS can take a significant chunk of page load time with a cold or expired recursor cache (something that is unlikely for our sites, though).

Right now, we have three nameservers (ns0-1-2), one in each of eqiad/codfw/esams. ns0/1/2 are service IPs each local to the PoP and pointed to one server each. A downtime on one of them, e.g. for a server reboot, means that clients (recursors) will repeat a query, imposing additional latency. A downtime on all three of them would be catastrophic. The three nameservers are in different places of the world and some recursors are/can be smart with selecting the lowest-latency authoritative server for a domain (SRTT); not all of them implement this, though, and those that do aren't always well implemented.

Rather than add yet another nameserver to ulsfo and possibly our future PoPs, it makes more sense to start thinking about setting up anycast for our DNS service IPs.

I've thought about this a bit and here's what I have come up so far:

  • We designate a IPv4 /24 (& IPv6 /48?) from our (limited) unused IP space as an anycast IP space that we will advertise from all of our PoPs. Either 198.35.27.0/24 or a /24 from 185.15.56.0/22 can be used for that; the latter may have been a martian before and there may be some risk in using it exclusively. If we're feeling generous and very risk-averse, we can assign two /24s from disparate subnets for extra protection against routing failures.
  • Out of these 1-2 /24s, we assign 2-4 IPs for ns0-[13] as service IPs. The rest will remain unused unless we come up with some other useful service to be anycasted.
  • We set up two servers or VMs per PoP to serve as AuthDNS (among others?), each listening to *all* 2-4 IPs.
  • We set up those (now global across our network!) service IPs behind LVS in all of our sites. Pybal does BGP anyway and already supports DNS monitoring (for recdns).
  • We add a feature to Pybal that adds an option to not advertise the IP over BGP if all of the realservers are marked as down. This would ensure that if all servers in one PoP are down, traffic would be rerouted (internally) to a another PoP. This may be essentially an alternative action to depool-threshold (rather than stop depooling, stop announcing the IP) and could be generally useful for anycast'ing even TCP services internally.
  • We configure static routes of lower metric than BGP to point to one of the realservers, so that if all servers across all sites are marked as down (because of a misconfiguration or broken Pybal version), nameservers would still be reachable. We do this already for other Pybal endpoints as well, but this is even more important here because of the absence of depool-threshold.
  • Optional: on pybal, we assign e.g. ns0's 90/10 traffic to box1 and 10/90 to box2 and vice-versa for ns1; this ensures that a) traffic is load-balanced between the two servers b) each of them gets a significant portion for half of the IPs (easier troubleshooting, DDoS protection), c) each of them gets a small portion of traffic to the other IP as well, to make sure that everything is working when the time comes.

Details

Related Gerrit Patches:

Event Timeline

faidon created this task.May 4 2015, 2:23 PM
faidon raised the priority of this task from to Low.
faidon updated the task description. (Show Details)
faidon added projects: acl*sre-team, Traffic.
faidon added subscribers: faidon, BBlack.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 4 2015, 2:23 PM
BBlack added a comment.EditedMay 6 2015, 5:30 PM

Just tracking some stuff from irc-conversation:

  • The past-martianness of 185.15.56.0/22 probably isn't a pragmatic issue and can be ignored. It was only a past-martian due to being unallocated, and was allocated to RIPE and removed from such lists back in Feb 2011. In general, many such networks are now in active use due to address depletion, so by now most network admins should have figured out they can't use outdated lists for this stuff.
  • We've only got, effectively, 4x /24 left in our current spaces to allocate to this and upcoming future PoPs: 198.35.27.0/24 + 185.15.5[678].0/24. 185.15.59.0/24 is current in use for knams<->esams stuff, although that could maybe be moved to free up another if necc?.
  • It's probably ok to go ahead and be generous/risk-averse and plan to use 2x disparate /24's for anycast, which would be 198.35.27.0/24 + 185.15.56.0/24, leaving room to create addressing for our next two (or maybe three) PoPs before we have to request/buy more space somewhere.
elukey added a subscriber: elukey.Apr 21 2016, 3:36 PM

Change 286066 had a related patch set uploaded (by BBlack):
note future anycast networks

https://gerrit.wikimedia.org/r/286066

Actually, amending the network thoughts above: should use 198.35.27.0/24 + 185.15.58.0/24. Using 58 instead of 56 leaves us a contiguous /23 for more future flexibility, instead of leaving us two broken-up /24's.

Change 286066 merged by BBlack:
note future anycast networks

https://gerrit.wikimedia.org/r/286066

mark added a subscriber: mark.Jun 3 2016, 1:58 PM
BBlack moved this task from Triage to DNS Infra on the Traffic board.Sep 30 2016, 2:11 PM
ayounsi added a subscriber: ayounsi.Apr 3 2017, 7:16 PM
ayounsi moved this task from Backlog to Configuration on the netops board.Jun 27 2017, 2:51 PM

Change 391149 had a related patch set uploaded (by BBlack; owner: Ayounsi):
[operations/puppet@production] Have every rdns advertise a private anycast VIP

https://gerrit.wikimedia.org/r/391149

Change 392635 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] dnsrecursor: send hostname in version responses

https://gerrit.wikimedia.org/r/392635

Change 391149 merged by BBlack:
[operations/puppet@production] Have every rdns advertise a private anycast VIP

https://gerrit.wikimedia.org/r/391149

Change 392635 merged by BBlack:
[operations/puppet@production] dnsrecursor: send hostname in version responses

https://gerrit.wikimedia.org/r/392635

Change 393668 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Bird: add monitoring to the VIP and bird process

https://gerrit.wikimedia.org/r/393668

Change 393668 merged by Ayounsi:
[operations/puppet@production] Bird: add monitoring to the VIP and bird process

https://gerrit.wikimedia.org/r/393668

faidon moved this task from Configuration to Troubleshooting on the netops board.Aug 3 2018, 1:57 AM

Some interesting stuff here (see also the Mailing Lists link there in the datatracker for discussion): https://datatracker.ietf.org/doc/draft-moura-dnsop-authoritative-recommendations/?include_text=1

jbond added a subscriber: jbond.Mar 13 2019, 2:23 PM
jbond added a comment.Mar 13 2019, 6:53 PM

Some comments for consideration

We set up those (now global across our network!) service IPs behind LVS in all of our sites.

is auth dns already behind LVS. putting anything in front of production auth DNS is always a red flag to me, especialy if it keeps state. If possible i would have the dns servers talk BGP directly to the edge routers and have the edge routers configured with ECMP (disclaimer im not failure with LVS so could be unfounded fear)

We configure static routes of lower metric than BGP to point to one of the real servers

Another option; if you have a contiguous /23 to use for the anycast prefix then you can have all instances advertise both the /24 and /23. to depool a server you simply withdraw the /24 prefix. BGP will always prefer the more specific prefix so anynodes advertising the /23 will not receive traffic and in the case that depool goes wrong and depools all server (withdraws the /24 everywhere) then routing will fall back to the /23 meaning everything serves traffic. We used this effectively and tested it heavily in previous job. note you would also need a /47 for ipv6 to do the same thing there as well

Out of these 1-2 /24s, we assign 2-4 IPs for ns0-[13]

i think we should knock this down to two servers i don't see an advantage of maintaining 4 NS servers going forward, it only adds complexity

Thanks for your comments!

is auth dns already behind LVS.

AuthDNS is not behind LVS and currently have static routes on the routers to redirect the VIPs to the proper machines.
See for example: https://wikitech.wikimedia.org/wiki/Service_restarts#Authoritative_DNS

If possible i would have the dns servers talk BGP directly to the edge routers and have the edge routers configured with ECMP (disclaimer im not failure with LVS so could be unfounded fear)

Indeed. That's what we're currently (slowly) experimenting with, for the recursive DNS servers, see https://wikitech.wikimedia.org/wiki/Anycast_recursive_DNS
If it's successful for the internal recursive DNS, we're considering doing the same with the public Authoritative DNS.

Another option; if you have a contiguous /23 to use for the anycast prefix then you can have all instances advertise both the /24 and /23. to depool a server you simply withdraw the /24 prefix.

The idea of using two distinct /24 is to reduce the risk of one BGP typo on the internet to take down/redirect/etc all our DNS. The downside is that we're "wasting" many IPs.
Afaik, no decision has been made about using one or two /24, but using a contiguous /23 would have the downside without the advantage.
Also a static route is a "last resort" for any kind of dynamic system (be it BGP or pybal, etc) failure. Advertising a /23 and /24s is an interesting depool idea though!

i think we should knock this down to two servers i don't see an advantage of maintaining 4 NS servers going forward, it only adds complexity

With anycast, the more distributed servers, the better. As they get closer to users and reduces latency. So we should also have them in POPs.

In term of NS records, there are different options with performance/redundancy/"cost" tradeoffs.
If for example we set:
ns0: Anycast IP from one /24
ns1: Anycast IP from the second /24
We get the best in term of performance (only anycast IPs) and redundancy (IPs from two different prefixes), but that "costs" more IPs to be wasted

If we only set:
ns0: Anycast IP from one /24
we get performance and cost, with no redundancy

Last, if we set:
ns0: Anycast IP from one /24
ns1: eqiad NS server IP
We get redundancy, cost, but we sacrifice performances as in theory only 1/2 of the clients will hit the closest anycast server.

Volans added a subscriber: Volans.Mar 13 2019, 11:41 PM

Thanks for the response

In the last option the anycast prefix should get more then 50% of the traffic due to the SRTT algorithm mentioned by bblack but i take your point. Also worth mentioning that RSSAC has a work group[1] which should update the study referenced in the first post and i believe it has some of the authors from that original study.

i don't see an advantage of maintaining 4 NS servers going forward

i think ill retract this as i now see the other option
ns0-2: remain as they are
ns3: new anycast prefix

[1]https://www.icann.org/en/system/files/files/rssac-sow-resolver-behaviors-07aug18-en.pdf

jijiki added a subscriber: jijiki.Jul 29 2019, 4:21 PM
BBlack added a comment.EditedAug 15 2019, 2:48 PM

General status updates and planning, for this very old ticket which is still on the radar!

T186550 and T228190 cover anycasting our internal recdns, which is nearing completion and probably will be mostly-done by EOQ. The basic model there is that dnsX00N (2 per site at every site, including the far-flung edges) run pdns-recursor, bird, and some healtchecking stuff and systemd rules (e.g. the BGP daemon requires the recdns daemon already be running, etc), and advertise 10.3.0.1 via BGP to local routers, etc.

The current draft plan for working towards anycasting AuthDNS in stages looks something like this:

  1. Parallelize and/or otherwise improve authdns-update. Currently this serially runs through the 3x authdns servers one by one, sshing to each and executing all the check->deploy steps. Since the rest of the anycast authdns plans involve a lot more authdns servers (temporarily as many as 13 initially, eventually settling down to 10, but then +2 for each future edge site down the road...), we'll need to at least parallelize the DNS patch deployment process, and possibly look at other related bits for dns-discovery, and for tolerating site isolation failures in better ways, etc.
  2. Bring up local authdns instances in all of the current recdns clusters globally (dnsX00N machines), with our current ns[012] IPs defined on the loopback as usual, but without any advertised routing for public authdns traffic. This basically spins up 10 more authdnses (in addition to the current public three), whose sole purpose is to locally answer authdns requests over the loopback for all of our anycasted recdnses, make them more performant and reliable at their own jobs, which is mostly resolving our own domains via queries to our authdns. We'll need to set up dependencies and/or healthchecks here to ensure that the local authdns daemons are running reliably as a prerequisite to starting the recdns daemon and/or advertising the internal recdns anycast.
  3. Start advertising the current unicast ns[012] IPs via bird with healthchecking, from the dnsX00N hosts to their local routers. Use ECMP at the router for these, and then deprioritize and later withdraw the current static routes to the legacy authdns machines. This moves public authdns resolution to the dnsX00N hosts' authdns daemons as well, but using the existing unicast public IPs which are only advertised from codfw, eqiad, and esams, respectively, and allows to decom those legacy authdns machines without replacement, making dnsX00N the DNS clusters for both recdns and authdns. It also improves the resiliency of our existing unicast authdns by having a pair of machines active at each of the 3 sites with the current unicast authdns IPs.
  4. Define/allocate our public authdns anycast IP(s), including resolving the design questions around how many /24's and what they are, and define these on the DNS cluster machines' loopbacks with bird routing advertisements towards the routers as well, and begin advertising the anycast authdns space(s) from all of our edge routers to the world.
  5. Decide on a target solution for our delegation NS records upstream at the TLD servers / whois, and the steps to get there. For example, we might initially add a fourth (and fifth, if we go with 2x anycast IPs) NS IPs to the set without touching the existing ns[012] IPs, and then later as our comfort level grows we might withdraw the unicasts over time until it's all-anycast. There's also some naming bikeshedding to do here about the new 1-2 nameserver names, so we don't end up with just ns3.wikimedia.org and ns4.wikimedia.org inexplicably as the names of our only public NS IPs. I like nsa.wikimedia.org and nsb.wikimedia.org myself, since having the hostname nsa is kind of amusing :)
BBlack renamed this task from Anycast (Auth)DNS to Anycast AuthDNS.Tue, Nov 5, 6:08 PM