Introduction
This is the parent task for reducing toil involved in management of DNS/NTP hosts by removing manual configuration processes and the reliance on the Puppet repository to define their pooled state.
For a refresher: all DNS hosts are NTP hosts, provide the anycasted internal recursor, and are also the authoritative DNS hosts, and so a single box serves three roles (rec and auth roles merged in T330670). We have three nameservers, ns[0-2]: ns0 points to dns100[4-6] and ns1 points to dns200[4-6] via static routes on the core routers in the respective site. ns2 is anycasted and announced via bird so it is essentially spread over all these hosts; we announce the ns2 IP from all sites so ns2 traffic should hit all DNS hosts.
We currently have 14 DNS hosts: three hosts each in the core sites and two each in the edge sites.
Progress
- NTP
- ns[0-2] routes automation
- Replacing authdns_servers, recdns, ntp (Debian installer) with confctl
Problem Statement
As of today, when we have to perform maintenance work on a DNS host such as a reboot or reimage the process involved is:
- If it is a host to which nsX points to, update the static routes to remove the route to the host in question.
- This involves changes on both core routers in the given site.
- There is no review in this process and most Traffic members rely on the diff of the output before committing the change.
- Update DNS records for ntp.$site.wikimedia.org, such as ntp.eqiad.wikimedia.org, used by the install servers, to point them away from the host in question.
$ dig ntp.eqiad.wikimedia.org +short dns1004.wikimedia.org. 208.80.154.6
Taking the above example, if we have to perform maintenance on dns1004, we update this CNAME to point to another DNS host in eqiad instead. This record has a TTL of one hour so we usually wait for an hour or perform this step in advance.
The changes involved are again manual: update this record in the DNS repository, run authdns-update and then revert when done.
- Remove the host from the Puppet repository, specifically from the authdns_servers key in hieradata/common.yaml.
authdns_servers: 'dns1004.wikimedia.org': 208.80.154.6 'dns1005.wikimedia.org': 208.80.154.153 'dns1006.wikimedia.org': 208.80.154.77 ...
- Run agent on the following hosts to ensure complete removal of the above host:
sudo cumin 'A:cumin or A:dns-rec or A:netbox' 'run-puppet-agent'
- Removing a host additionally involves stopping the bird service to depool it from the anycast network.
- Complete maintenance and then revert the steps.
The above is a slow and error-prone process, which takes the full resources of a single engineer at any time as if we start working on a DNS host, we have to complete it otherwise the host being down/unavailable blocks other processes, such as cookbooks and auth DNS updates. With the recent increase in the number of reboots and reimages, we felt the need to improve this to reduce the toil in the Traffic team.
Solution
We need to automate the above process and remove the manual configuration that defines which DNS host is pooled (or not) and also ensure that all other relevant configuration bits such as the NTP settings and static routes are included and adjusted automatically. The goal here is to do all of this via a cookbook: running a single command to do a rolling reboot or reimage of the DNS hosts without any human intervention and no manual Puppet changes :)
We plan to achieve this by automating the following, step-by-step:
NTP automation
Instead of having individual records per site, we should instead point all autoinstall files to a single domain, ntp.anycast.wmnet, the equivalent of recdns.anycast.wmnet. This should point to the same anycast IP as the recdns hosts, 10.3.0.1. By doing this we no longer care when a single DNS host is down as the next available one will (should) be reached, in the same site.
Once this is in operation and has been tested, we can remove all existing ntp.$site.wikimedia.org DNS records and thus no longer need to worry about updating this record.
ns[0-2] routes automation
Instead of using static routes for the namservers, we should use bird and announce them via BGP. This automates the handling of a host being down and also removes the primary static routes, thus ensuring easier maintenance and visibility in case a route goes down.
We can still continue to have the static routes for backup but they don't need to be the primary source of truth. No changes are required for ns2 as it's already anycasted and nothing changes there.
Replacing authdns_servers with confctl
The most important part of this automation would be removing the dependency on the authdns_servers key in the Puppet repo as the source of pooled state for the DNS hosts. This key is used in a bunch of other places as well but specifically in modules/profile/templates/dns/auth/wikimedia-authdns.conf.erb.
NAMESERVERS="<%= @authdns_servers.keys.join(' ') %>"
NAMESERVERS above for example should derive this not from the Puppet config -- which would then necessitate a manual change and Puppet agent runs -- but from a more dynamic setup such as confctl. Essentially, instead of relying on authdns_servers, we should shift to confctl here so that we can dynamically control this list without Puppet changes.
- This requires more research on our end to see what is feasible and how it will be implemented.
Consumers
In automating the above, the following projects will be affected and thus we need to consult with each of them:
- Existing cookbooks that push to the DNS hosts, such as the Netbox DNS update cookbook that calls authdns-update.
- The authdns-update script itself, that is used by a large number of SREs.
- NTP changes should affect FR-Tech and Debian installs [CC @Dwisehaupt / @MoritzMuehlenhoff]
- What about ntp_servers in homer/config/comm.yaml? We can maintain this list manually, we don't modify this for regular maintenance work.
- Announcing the ns[0-1] static routes via bird/BGP [CC @ayounsi / @cmooney].
- Anything in the Puppet repository that utilizes the authdns_server key in hieradata/common.yaml and thus considers it to be the source of truth for the definition of a pooled/active DNS host.
Challenges
- This is a big change from our current setup and we will roll out the changes slowly and gradually, similar to how we did it for T340479. Nevertheless, this will be a significant change.
- The automation of the NTP peers for systemd-timesyncd generated via modules/profile/manifests/systemd/timesyncd.pp and resolv.conf for the DNS hosts themselves generated in modules/profile/manifests/dns/recursor.pp: both these consume the authdns_servers key and we need to make sure that we can maintain the existing automation there, especially when decomissioning and commissioning new hosts.
Timeline
We are looking to work on this immediately with the understanding that we will not be pushing any changes beyond November, given that this is the last quarter of the year.
Ideal Goals
- We should also include a way to ease/automate the DNS depooling of sites that is currently performed via the DNS repository and involves these steps: a single-line code change, pushing to the repository and then running authdns-update. This step takes a while and cannot be performed immediately in case of an emergency because of the dependence on the Git commit. Automating this to be done through a single command would be ideal.