In continuation of the work carried under T347054, we want to reduce the toil and SPOFs around our DNS work, which also includes our NTP setup given that we are running ntpd on the DNS boxes and systemd-timesyncd as the NTP client.
Introduction
We have three important uses/current implementations around our NTP setup:
- For the Debian installer, we use the anycast address ntp.anycast.wmnet as the NTP server under modules/install_server/files/autoinstall/common.cfg:
d-i clock-setup/ntp-server string ntp.anycast.wmnet
This address is announced by all the DNS boxes so the install server should connect to the one closest to it. There is nothing more to be done here.
- For the DNS hosts themselves where we are running ntpd, we generate the list (automatically) in modules/profile/manifests/dns/recursor.pp. This is generated statically using the authdns_servers Hiera key. We do want to get rid of this as well at some point but that's for another task. Note that this peer list is only for the DNS hosts and does not affect the main consumers directly, which are all the other hosts.
- For the clients themselves, we use P:systemd::timesyncd as the NTP client and this is what this task is about. The current list of these hosts is:
sukhe@cumin1002:~$ sudo cumin "P:systemd::timesyncd" 2184 hosts will be targeted:
Current problem with the P:systemd::timesyncd NTP servers list
The current NTP servers are generated in modules/profile/manifests/systemd/timesyncd.pp and look like:
# For historical context, this array was manually managed via # hieradata/$::site/profile/systemd/timesyncd.yaml. # # To set ntp_servers in a site, use the ntp_peers under it and the peers of # the closest core site, which we determine from $::datacenters_tree. if $ntp_servers == undef { $_ntp_servers = [$ntp_peers[$::site], $ntp_peers[$site_nearest_core[$::site]]].flatten } else { $_ntp_servers = $ntp_servers } class {'systemd::timesyncd': ensure => $ensure, ntp_servers => $_ntp_servers, }
The logic here is pretty simple: for a host in a given site, the list of these servers is the list of the DNS boxes in that site plus the list of the DNS boxes in the nearest core site. For say cp7001 in magru, this list looks like:
sukhe@cp7001:~$ cat /etc/systemd/timesyncd.conf ## THIS FILE IS MANAGED BY PUPPET [Time] Servers=dns7001.wikimedia.org dns7002.wikimedia.org dns1004.wikimedia.org dns1005.wikimedia.org dns1006.wikimedia.org
Note that the above is generated statically by Puppet. But in T347054, we manage the state of the DNS hosts themselves dynamically via confd. This can result in a situation where a given DNS host has been depooled but unless it it is removed from Puppet as well (which we don't unless we decommission the host), the host will still continue to exist in this list, when it can theoretically be not available/powered down/rebooting.
By itself, this is a not a serious problem as NTP is meant to work with mutiple servers (which is why the redundancy in our setup as well) and a single host down is not an issue. But from our POV, this is not ideal as this list can extend beyond a single host being down and we will have to update that change in Puppet and roll it out for systemd-timesyncd to be aware of it. It's also not a correct reflection of the state of the DNS boxes and we should fix that.
Solutions
Using confd to manage this list
The simplest solution is to template and manage /etc/systemd/timesyncd.conf via confd and use the current state of the DNS boxes as reported by etcd/confctl. This is fairly easy to do as we are already doing this in a bunch of other places on the DNS hosts themselves, however, it involves rolling out confd to all other 2184 hosts, which may not be ideal.
Anycast NTP
The motivation behind this task is to anycast the list of NTP servers. This will allow us to manage the servers more dynamically as the DNS boxes can simply enter and exit the pool of the available NTP server when desired without a need to update Puppet. This is also one of the main reasons why we did the current NTP anycast for the Debian installer.
To do so, we will come up with the three new NTP addresses: ntp-[abc].anycast.wmnet and then configure the clients to use these instead of the current list. In advertising these, we will follow the current logic:
ntp-a.anycast.wmnet: announced from all sites
ntp-b.anycast.wmnet: announced from all sites
ntp-c.anycast.wmnet: announced from only the core sites
For ntp-[ab].anycast.wmnet, they are announced from all sites but only one DNS box announces each. So in magru, dns7001 will advertise ntp-a.anycast.wmnet while dns7002 will advertise ntp-b.anycast.wmnet. The core sites have 3x dnsboxes, so it's their third servers (dns1006.wikimedia.org and dns2006.wikimedia.org) which will be the only ones in the network advertising ntp-c.anycast.wmnet.
This way we can still map the current setup and maintain the same redundancy as desired but with a dynamic anycast setup. The output of this exercise will then look like:
[Time] Servers=ntp-a.anycast.wmnet ntp-b.anycast.wmnet ntp-c.anycast.wmnet
There should be no further updates to this file from this point forward.
Concerns
- The drawback of this approach in general and relative to the confd one is that the setup of this might be more complex, especially given that we will have to come up with a smart way to distribute the ntp-[ab] announcements from different hosts. I think this is not a huge blocker as we are already doing similar things (for the unicast ns0/1 announcements).
- This brings yet another thing under the BGP/bird setup, increasing the critical services we have under it.
- We will need to deprecate ntp.anycast.wmnet and switch it over to something like ntp-[ab].anycast.wmnet. No issues there.
- This is a big change and we should be confident in moving towards this setup. We can start small and do it on a few hosts but eventually it will touch all 2184 hosts and more.
- We will need to update or improve the monitoring to adapt for this change.
netops [ @ayounsi / @cmooney ] your input required as always, thank you.