As our network of PoPs expand, it makes sense to start thinking about distributing other critical services than just pure HTTP(S) to them. The obvious one that immediately comes to mind is authoritative DNS, as it is critical for serving user traffic efficiently. DNS can take a significant chunk of page load time with a cold or expired recursor cache (something that is unlikely for our sites, though).
Right now, we have three nameservers (ns0-1-2), one in each of eqiad/codfw/esams. ns0/1/2 are service IPs each local to the PoP and pointed to one server each. A downtime on one of them, e.g. for a server reboot, means that clients (recursors) will repeat a query, imposing additional latency. A downtime on all three of them would be catastrophic. The three nameservers are in different places of the world and some recursors are/can be smart with selecting the lowest-latency authoritative server for a domain (SRTT); not all of them implement this, though, and those that do aren't always [[ http://irl.cs.ucla.edu/data/files/papers/res_ns_selection.pdf | well implemented ]].
Rather than add yet another nameserver to ulsfo and possibly our future PoPs, it makes more sense to start thinking about setting up anycast for our DNS service IPs.
I've thought about this a bit and here's what I have come up so far:
- We designate a IPv4 /24 (& IPv6 /48?) from our (limited) unused IP space as an anycast IP space that we will advertise from all of our PoPs. Either 198.35.27.0/24 or a /24 from 185.15.56.0/22 can be used for that; the latter may have been a martian before and there may be some risk in using it exclusively. If we're feeling generous and very risk-averse, we can assign two /24s from disparate subnets for extra protection against routing failures.
- Out of these 1-2 /24s, we assign 2-4 IPs for ns0-[13] as service IPs. The rest will remain unused unless we come up with some other useful service to be anycasted.
- We set up two servers or VMs per PoP to serve as AuthDNS (among others?), each listening to *all* 2-4 IPs.
- We set up those (//now global across our network!//) service IPs behind LVS in all of our sites. Pybal does BGP anyway and already supports DNS monitoring (for recdns).
- We add a feature to Pybal that adds an option to not advertise the IP over BGP if all of the realservers are marked as down. This would ensure that if all servers in one PoP are down, traffic would be rerouted (internally) to a another PoP. This may be essentially an alternative action to depool-threshold (rather than stop depooling, stop announcing the IP) and could be generally useful for anycast'ing even TCP services internally.
- We configure static routes of lower metric than BGP to point to one of the realservers, so that if all servers across all sites are marked as down (because of a misconfiguration or broken Pybal version), nameservers would still be reachable. We do this already for other Pybal endpoints as well, but this is even more important here because of the absence of depool-threshold.
- Optional: on pybal, we assign e.g. ns0's 90/10 traffic to box1 and 10/90 to box2 and vice-versa for ns1; this ensures that a) traffic is load-balanced between the two servers b) each of them gets a significant portion for half of the IPs (easier troubleshooting, DDoS protection), c) each of them gets a small portion of traffic to the other IP as well, to make sure that everything is working when the time comes.