Page MenuHomePhabricator

Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work
Closed, ResolvedPublic

Description

Introduction

This is the parent task for reducing toil involved in management of DNS/NTP hosts by removing manual configuration processes and the reliance on the Puppet repository to define their pooled state.

For a refresher: all DNS hosts are NTP hosts, provide the anycasted internal recursor, and are also the authoritative DNS hosts, and so a single box serves three roles (rec and auth roles merged in T330670). We have three nameservers, ns[0-2]: ns0 points to dns100[4-6] and ns1 points to dns200[4-6] via static routes on the core routers in the respective site. ns2 is anycasted and announced via bird so it is essentially spread over all these hosts; we announce the ns2 IP from all sites so ns2 traffic should hit all DNS hosts.

We currently have 14 DNS hosts: three hosts each in the core sites and two each in the edge sites.

Progress

  • NTP
  • ns[0-2] routes automation
  • Replacing authdns_servers, recdns, ntp (Debian installer) with confctl

Problem Statement

As of today, when we have to perform maintenance work on a DNS host such as a reboot or reimage the process involved is:

  1. If it is a host to which nsX points to, update the static routes to remove the route to the host in question.
    • This involves changes on both core routers in the given site.
    • There is no review in this process and most Traffic members rely on the diff of the output before committing the change.
  2. Update DNS records for ntp.$site.wikimedia.org, such as ntp.eqiad.wikimedia.org, used by the install servers, to point them away from the host in question.
$ dig ntp.eqiad.wikimedia.org +short
dns1004.wikimedia.org.
208.80.154.6

Taking the above example, if we have to perform maintenance on dns1004, we update this CNAME to point to another DNS host in eqiad instead. This record has a TTL of one hour so we usually wait for an hour or perform this step in advance.

The changes involved are again manual: update this record in the DNS repository, run authdns-update and then revert when done.

  1. Remove the host from the Puppet repository, specifically from the authdns_servers key in hieradata/common.yaml.
authdns_servers:
  'dns1004.wikimedia.org': 208.80.154.6
  'dns1005.wikimedia.org': 208.80.154.153
  'dns1006.wikimedia.org': 208.80.154.77
  ...
  1. Run agent on the following hosts to ensure complete removal of the above host:
sudo cumin 'A:cumin or A:dns-rec or A:netbox' 'run-puppet-agent'
  1. Removing a host additionally involves stopping the bird service to depool it from the anycast network.
  1. Complete maintenance and then revert the steps.

The above is a slow and error-prone process, which takes the full resources of a single engineer at any time as if we start working on a DNS host, we have to complete it otherwise the host being down/unavailable blocks other processes, such as cookbooks and auth DNS updates. With the recent increase in the number of reboots and reimages, we felt the need to improve this to reduce the toil in the Traffic team.

Solution

We need to automate the above process and remove the manual configuration that defines which DNS host is pooled (or not) and also ensure that all other relevant configuration bits such as the NTP settings and static routes are included and adjusted automatically. The goal here is to do all of this via a cookbook: running a single command to do a rolling reboot or reimage of the DNS hosts without any human intervention and no manual Puppet changes :)

We plan to achieve this by automating the following, step-by-step:

NTP automation

Instead of having individual records per site, we should instead point all autoinstall files to a single domain, ntp.anycast.wmnet, the equivalent of recdns.anycast.wmnet. This should point to the same anycast IP as the recdns hosts, 10.3.0.1. By doing this we no longer care when a single DNS host is down as the next available one will (should) be reached, in the same site.

Once this is in operation and has been tested, we can remove all existing ntp.$site.wikimedia.org DNS records and thus no longer need to worry about updating this record.

ns[0-2] routes automation

Instead of using static routes for the namservers, we should use bird and announce them via BGP. This automates the handling of a host being down and also removes the primary static routes, thus ensuring easier maintenance and visibility in case a route goes down.

We can still continue to have the static routes for backup but they don't need to be the primary source of truth. No changes are required for ns2 as it's already anycasted and nothing changes there.

Replacing authdns_servers with confctl

The most important part of this automation would be removing the dependency on the authdns_servers key in the Puppet repo as the source of pooled state for the DNS hosts. This key is used in a bunch of other places as well but specifically in modules/profile/templates/dns/auth/wikimedia-authdns.conf.erb.

NAMESERVERS="<%= @authdns_servers.keys.join(' ') %>"

NAMESERVERS above for example should derive this not from the Puppet config -- which would then necessitate a manual change and Puppet agent runs -- but from a more dynamic setup such as confctl. Essentially, instead of relying on authdns_servers, we should shift to confctl here so that we can dynamically control this list without Puppet changes.

  • This requires more research on our end to see what is feasible and how it will be implemented.

Consumers

In automating the above, the following projects will be affected and thus we need to consult with each of them:

  1. Existing cookbooks that push to the DNS hosts, such as the Netbox DNS update cookbook that calls authdns-update.
    1. Moving the hosts behind confctl should make this process quite easy and ideal for the cookbooks at least!
    2. There might be more cookbook-related changes here that result from this; the IF team can better comment and are the experts [CC @Volans / @jbond].
  2. The authdns-update script itself, that is used by a large number of SREs.
  3. NTP changes should affect FR-Tech and Debian installs [CC @Dwisehaupt / @MoritzMuehlenhoff]
    • What about ntp_servers in homer/config/comm.yaml? We can maintain this list manually, we don't modify this for regular maintenance work.
  4. Announcing the ns[0-1] static routes via bird/BGP [CC @ayounsi / @cmooney].
  5. Anything in the Puppet repository that utilizes the authdns_server key in hieradata/common.yaml and thus considers it to be the source of truth for the definition of a pooled/active DNS host.

Challenges

  1. This is a big change from our current setup and we will roll out the changes slowly and gradually, similar to how we did it for T340479. Nevertheless, this will be a significant change.
  2. The automation of the NTP peers for systemd-timesyncd generated via modules/profile/manifests/systemd/timesyncd.pp and resolv.conf for the DNS hosts themselves generated in modules/profile/manifests/dns/recursor.pp: both these consume the authdns_servers key and we need to make sure that we can maintain the existing automation there, especially when decomissioning and commissioning new hosts.

Timeline

We are looking to work on this immediately with the understanding that we will not be pushing any changes beyond November, given that this is the last quarter of the year.

Ideal Goals

  1. We should also include a way to ease/automate the DNS depooling of sites that is currently performed via the DNS repository and involves these steps: a single-line code change, pushing to the repository and then running authdns-update. This step takes a while and cannot be performed immediately in case of an emergency because of the dependence on the Git commit. Automating this to be done through a single command would be ideal.

Details

SubjectRepoBranchLines +/-
operations/cookbooksmaster+29 -6
operations/cookbooksmaster+62 -0
operations/puppetproduction+21 -6
operations/puppetproduction+42 -59
operations/puppetproduction+1 -1
operations/puppetproduction+6 -6
operations/puppetproduction+80 -0
operations/puppetproduction+6 -18
operations/puppetproduction+1 -1
operations/puppetproduction+1 -2
operations/puppetproduction+6 -3
operations/puppetproduction+0 -17
operations/puppetproduction+9 -9
operations/puppetproduction+6 -4
operations/puppetproduction+34 -17
operations/puppetproduction+20 -0
operations/puppetproduction+10 -5
operations/puppetproduction+25 -0
operations/puppetproduction+1 -1
operations/puppetproduction+0 -2
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+20 -9
operations/puppetproduction+24 -9
operations/puppetproduction+29 -6
operations/puppetproduction+9 -0
operations/puppetproduction+5 -11
operations/puppetproduction+0 -1
operations/dnsmaster+6 -6
operations/puppetproduction+1 -118
operations/puppetproduction+16 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 972883 abandoned by Ssingh:

[operations/puppet@production] hiera: update authdns_servers (PCC test commit, DO NOT MERGE)

Reason:

https://gerrit.wikimedia.org/r/972883

Change 975009 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] conftool: introduce schema and host file for dnsboxes

https://gerrit.wikimedia.org/r/975009

Change 975843 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: add support for setting ferm rules via confd

https://gerrit.wikimedia.org/r/975843

Change 976254 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: add support for authdns-update hosts via confd

https://gerrit.wikimedia.org/r/976254

Change 977101 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: add support for generating .ssh/config via confd

https://gerrit.wikimedia.org/r/977101

Change 977252 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] admin: reserve uid/gid for authdns user

https://gerrit.wikimedia.org/r/977252

Change 977259 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update::account: switch to systemd::sysuser

https://gerrit.wikimedia.org/r/977259

Change 977259 abandoned by Ssingh:

[operations/puppet@production] P:dns::auth::update::account: switch to systemd::sysuser

Reason:

for now

https://gerrit.wikimedia.org/r/977259

Change 977252 merged by Ssingh:

[operations/puppet@production] admin: reserve uid/gid for authdns user

https://gerrit.wikimedia.org/r/977252

Change 975009 merged by Ssingh:

[operations/puppet@production] conftool: introduce schema and host file for dnsboxes

https://gerrit.wikimedia.org/r/975009

Mentioned in SAL (#wikimedia-operations) [2023-11-29T16:53:07Z] <sukhe> sudo confctl --object-type dnsbox select 'dc=.*' set/pooled=yes T347054

@ssingh what's your timeline to switch to use this new method to get what DNS hosts are pooled? As you know we need to adjust spicerack/cookbooks accordingly.

@ssingh what's your timeline to switch to use this new method to get what DNS hosts are pooled? As you know we need to adjust spicerack/cookbooks accordingly.

Hi @Volans: Good question, we are still figuring this out internally and so I will update this ticket as we go along. For now, we are just testing it out on one host, dns6001, and that too for the ferm rule. We will still keep the conditional profile::dns::auth::confd_enabled and remove this and fully switch to etcd-based data in January. So we might role it out but will let Puppet be the source of truth before the holidays.

Let me know if you have any questions or feedback about the above approach.

Mentioned in SAL (#wikimedia-operations) [2023-11-29T17:14:00Z] <sukhe> disable puppet on A:dns-rec to roll out CR 975843: T347054

Change 975843 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update: add support for setting ferm rules via confd

https://gerrit.wikimedia.org/r/975843

For awareness that on dns6001, we have rolled out setting the ferm rules for authdns-update via the confd-managed file and have tested it by toggling the pooled state of a DNS host. We will let this change persist on dns6001 as we roll out other changes as well.

Mentioned in SAL (#wikimedia-operations) [2023-11-29T18:31:46Z] <sukhe> disable puppet on A:dns-rec to roll out CR 976254: T347054

Change 976254 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update: add support for authdns-update hosts via confd

https://gerrit.wikimedia.org/r/976254

Mentioned in SAL (#wikimedia-operations) [2023-11-29T19:25:59Z] <sukhe> disable puppet on A:dns-rec to roll out CR 977101: T347054

Change 977101 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update: add support for generating .ssh/config via confd

https://gerrit.wikimedia.org/r/977101

Change 978680 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: use correct key name ("ip" not "IP")

https://gerrit.wikimedia.org/r/978680

Change 978680 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update: use correct key name ("ip" not "IP")

https://gerrit.wikimedia.org/r/978680

Change 978685 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth: pass .ssh/config to authdns-update

https://gerrit.wikimedia.org/r/978685

Change 978685 merged by Ssingh:

[operations/puppet@production] P:dns::auth: pass .ssh/config to authdns-update

https://gerrit.wikimedia.org/r/978685

Mentioned in SAL (#wikimedia-operations) [2023-11-29T20:25:27Z] <sukhe> sudo cumin -s1 -b60 "A:dns-rec and not P{dns6001*}" "enable-puppet 'do not enable' && run-puppet-agent": T347054

Mentioned in SAL (#wikimedia-operations) [2023-11-29T20:25:44Z] <sukhe> [correction] sudo cumin -b1 -s60 "A:dns-rec and not P{dns6001*}" "enable-puppet 'do not enable' && run-puppet-agent": T347054

Summary of changes today:

  • ferm rules for the DNS hosts are managed via confd and list of servers in authdns-update are now managed by confd, for just dns6001.
  • On all DNS hosts, the /etc/resolv.conf data that was generated by host_core is now managed by confd as well. This applies to all hosts and not just dns6001.

Mentioned in SAL (#wikimedia-operations) [2023-11-30T15:33:39Z] <sukhe> clean-up /etc/hosts on A:dns-rec to remove entries populated by host_core: T347054

Change 979159 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: dnsbox: remove anycast-hc dependency on pdns-rec

https://gerrit.wikimedia.org/r/979159

Change 980427 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth: add support for depooling recdns via confd

https://gerrit.wikimedia.org/r/980427

Change 980929 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] dnsrecursor: forward_zones for wikimedia.org, too

https://gerrit.wikimedia.org/r/980929

Change 979159 merged by Ssingh:

[operations/puppet@production] hiera: dnsbox: remove anycast-hc dependency on pdns-rec

https://gerrit.wikimedia.org/r/979159

Change 980929 merged by Ssingh:

[operations/puppet@production] dnsrecursor: forward_zones for wikimedia.org, too

https://gerrit.wikimedia.org/r/980929

Change 1004205 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] conftool: update schema for dnsbox for anycast authdns setups

https://gerrit.wikimedia.org/r/1004205

Change 1005559 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth: update confd keys to reflect new schema

https://gerrit.wikimedia.org/r/1005559

Change 1004205 abandoned by Ssingh:

[operations/puppet@production] conftool: update schema for dnsbox for anycast authdns setups

Reason:

no longer going ahead with the custom dnsbox schema

https://gerrit.wikimedia.org/r/1004205

Change 1006021 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] conftool-data: add dnsbox hosts data

https://gerrit.wikimedia.org/r/1006021

Change 1006057 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] dns6001: set confd_enabled to false

https://gerrit.wikimedia.org/r/1006057

Change 1006532 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth:: switch confd .ssh/config back to Puppet

https://gerrit.wikimedia.org/r/1006532

Mentioned in SAL (#wikimedia-operations) [2024-02-26T16:22:56Z] <sukhe> etcd: purging /conftool/v1/dnsbox: old schema, deprecated: T347054

Change 1006021 merged by Ssingh:

[operations/puppet@production] conftool-data: add dnsbox hosts data

https://gerrit.wikimedia.org/r/1006021

Change 1006532 merged by Ssingh:

[operations/puppet@production] P:dns::auth:: switch confd .ssh/config back to Puppet

https://gerrit.wikimedia.org/r/1006532

Change 1006548 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::update::account: add .ssh/config file

https://gerrit.wikimedia.org/r/1006548

Change 1006548 merged by Ssingh:

[operations/puppet@production] P:dns::update::account: add .ssh/config file

https://gerrit.wikimedia.org/r/1006548

Change 1006549 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: revert to using ferm rules from Puppet

https://gerrit.wikimedia.org/r/1006549

Change 1006549 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update: revert to using ferm rules from Puppet

https://gerrit.wikimedia.org/r/1006549

Change 1006554 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::upate: absent confd management of ferm rules

https://gerrit.wikimedia.org/r/1006554

Change 1006554 merged by Ssingh:

[operations/puppet@production] P:dns::auth::upate: absent confd management of ferm rules

https://gerrit.wikimedia.org/r/1006554

Change 1006558 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: update authdns-update for new confctl changes

https://gerrit.wikimedia.org/r/1006558

Change 1006558 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update: update authdns-update for new confctl changes

https://gerrit.wikimedia.org/r/1006558

Change 1006573 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: remove redundant conftool_prefix

https://gerrit.wikimedia.org/r/1006573

Change 1006573 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update: remove redundant conftool_prefix

https://gerrit.wikimedia.org/r/1006573

Change 1006576 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] wikimedia-authdns.conf.tpl.erb: get the correct key path

https://gerrit.wikimedia.org/r/1006576

Change 1006576 merged by Ssingh:

[operations/puppet@production] wikimedia-authdns.conf.tpl.erb: get the correct key path

https://gerrit.wikimedia.org/r/1006576

Change 1006955 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: move authdns-update state to confd

https://gerrit.wikimedia.org/r/1006955

Change 1006955 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update: move authdns-update state to confd

https://gerrit.wikimedia.org/r/1006955

Status as of today: we are now managing authdns-update state, that is, the list of hosts in /etc/wikimedia-authdns.conf. To depool a host so that it doesn't receive authdns updates (state, not service) you can do something like:

sudo confctl select 'name=dns6001.wikimedia.org,service=authdns-update' set/pooled=no

Doing the above for example will remove dns6001 from getting authdns-update from all hosts.

Even from itself? As in what happens if an operator runs authdns-update on a depooled host?

Do we need to change anything on the sre.dns.netbox cookbook?
It currently runs:

cd {git} && utils/deploy-check.py -g {netbox} --deploy

Even from itself? As in what happens if an operator runs authdns-update on a depooled host?

The NAMESERVERS list that is populated by confd affects only the hosts to which we SSH and run authdns-update, not the host itself. So if you depool dns1004 and run authdns-update from there, nothing changes. If you run authdns-update from dns1005 (or anywhere else), it won't touch dns1004. On my end, I think this behaviour makes sense. But is it fine from your perspective of automation and cookbooks?

Do we need to change anything on the sre.dns.netbox cookbook?
It currently runs:

cd {git} && utils/deploy-check.py -g {netbox} --deploy

So not right now as we are just assuming everything is pooled and there is still state management done through Puppet (for services). I plan on finishing that this as soon as possible and then yes, we should set up some time to talk about how this impacts you.

Sorry about not doing this earlier but things were in flux as you can see in this commit and it took us some re-design to get here (especially with one-but-more-in-future anycast auth DNS setups), which we didn't account for in round one.

Even from itself? As in what happens if an operator runs authdns-update on a depooled host?

The NAMESERVERS list that is populated by confd affects only the hosts to which we SSH and run authdns-update, not the host itself. So if you depool dns1004 and run authdns-update from there, nothing changes. If you run authdns-update from dns1005 (or anywhere else), it won't touch dns1004. On my end, I think this behaviour makes sense. But is it fine from your perspective of automation and cookbooks?

I'm more worried for the human side of the problem. In my experience people rely a lot on muscle memory and shell history, so they tend to go always to the same host to run a given command. So I think it would be very easy that someone will run the authdns-update on dns1001 even if depooled just because they're used to run it there.
As for the automation it's currently using the cumin alias but can totally be adapted to use confctl if that's what needs to be used.

Even from itself? As in what happens if an operator runs authdns-update on a depooled host?

The NAMESERVERS list that is populated by confd affects only the hosts to which we SSH and run authdns-update, not the host itself. So if you depool dns1004 and run authdns-update from there, nothing changes. If you run authdns-update from dns1005 (or anywhere else), it won't touch dns1004. On my end, I think this behaviour makes sense. But is it fine from your perspective of automation and cookbooks?

I'm more worried for the human side of the problem. In my experience people rely a lot on muscle memory and shell history, so they tend to go always to the same host to run a given command. So I think it would be very easy that someone will run the authdns-update on dns1001 even if depooled just because they're used to run it there.

Fair enough, let me think a bit more about this and see if it makes sense to skip running authdns-update from the host itself as well.

As for the automation it's currently using the cumin alias but can totally be adapted to use confctl if that's what needs to be used.

Yes, we will need to do that once the changes are rolled out, baked in and we are working with the assumption that state management is strictly via confd. Which so far only holds true for just authdns-update but that is set to change for recdns, ntp, and all authdns-nsX. I will set up some time to sync up sooner than later, given most of the design is finalized.

Thanks!

Change 980427 merged by Ssingh:

[operations/puppet@production] P:dns::auth: add support for depooling services via confd/confctl

https://gerrit.wikimedia.org/r/980427

Update: We have merged the service depooling change on dns6001. This means that service depooling -- recdns, ntp, authdns-ns2 -- on dns6001, is now managed via confd.

Doing:

sudo confctl select 'name=dns6001.wikimedia.org,service=recdns' set/pooled=no

will withdraw the route for 10.3.0.1 from dns6001 and thus depool it from servicing anycast recdns.

We will roll this out fleet-wide on Monday.

Change 1007918 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] dns::auth: move all service statement management to confd

https://gerrit.wikimedia.org/r/1007918

Change 1005559 abandoned by Ssingh:

[operations/puppet@production] P:dns::auth: update confd keys to reflect new schema

Reason:

https://gerrit.wikimedia.org/r/1005559

Change 1006057 abandoned by Ssingh:

[operations/puppet@production] dns6001: set confd_enabled to false

Reason:

no longer required

https://gerrit.wikimedia.org/r/1006057

Mentioned in SAL (#wikimedia-operations) [2024-03-04T16:12:08Z] <sukhe> sudo cumin "A:dns-rec" "disable-puppet 'merging CR 1007918'": T347054

Change 1007918 merged by Ssingh:

[operations/puppet@production] dns::auth: move all service state management to confd

https://gerrit.wikimedia.org/r/1007918

Mentioned in SAL (#wikimedia-operations) [2024-03-04T16:59:21Z] <sukhe> sudo cumin -b1 -s120 "A:dns-rec" "run-puppet-agent --enable 'merging CR 1007918'": finish rolling out confd state management: T347054

ssingh closed this task as Resolved.EditedMar 4 2024, 6:55 PM
ssingh claimed this task.

Final Update

We have finished rolling the changes today, so all state management -- authdns-update, recdns, NTP (Debian installer), authdns-ns[0-2] -- is now managed via confd/conftcl, for all DNS hosts.

This means to depool a DNS host for maintenance, we can now do:

sudo confctl select 'name=dns2004.wikimedia.org' set/pooled=no

This will depool dns2004 for:

{"/codfw/dnsbox/ntp": ["dns2004.wikimedia.org"], "/codfw/dnsbox/authdns-ns1": ["dns2004.wikimedia.org"], "/codfw/dnsbox/authdns-ns2": ["dns2004.wikimedia.org"], "/codfw/dnsbox/authdns-update": ["dns2004.wikimedia.org"], "/codfw/dnsbox/recdns": ["dns2004.wikimedia.org"]}

Which means depooling it for authdns-ns1, authdns-ns2, recdns, ntp for the services and authdns-updates for not getting zone file updates.

Operationally, what this means for us that we can automate DNS host maintenance for reboots, which was a manual, at-times-error-prone process, which looked something like:

Old Process

  1. Stop bird on the host, depooling all services (not just selectively).
  2. Remove the static routes manually from the core routers and make the NTP DNS changes above. This step used to take the most time in this process.
  3. Remove host from authdns_servers and merge.
  4. Run Puppet agent on all A:dns-rec A:cumin, A:netbox so that the particular DNS host is not in the list of "active" hosts (see caveat below).
  5. Perform maintenance.
  6. Reverse the steps above.

The time it used to take for a single host to be depooled and rebooted this way was around ~30-45 minutes, per host. For a series of reboots such as kernel updates, it used to take Traffic 3-4 days to reboot the entire DNS cluster (14 boxes), one by one and one full engineering resource to make sure that nothing was broken, given the critical nature of the DNS hosts.

With this change, the above has been reduced to:

New (Current) Process

sudo confctl select 'name=dns2004.wikimedia.org' set/pooled=no

This can allow us to quickly depool a DNS host when required, without any user intervention.

Relatedly, we were running into issues with the above old-style of maintenance due to race conditions in the state as known to us and as perceived by the cookbooks, where a DNS host was marked as "active" when it was actually not and had been depooled. The new change alleviates that concern as well once the cookbooks will be updated (that's for another task).

What's left?

There are a few things that are still left that are related to this but not a part of this task, as this task was about regular maintenance and not replacing every existence of authdns_servers from Puppet, which still exists.

  1. For the hosts themselves and systemd-timesyncd, the list of NTP servers looks like (example from a host in eqiad):
[Time]
Servers=dns1004.wikimedia.org dns1005.wikimedia.org dns1006.wikimedia.org dns2004.wikimedia.org dns2005.wikimedia.org dns2006.wikimedia.org

This list is still generated by Puppet through the code in modules/profile/manifests/systemd/timesyncd.pp. This will not change for now but it is fine as the maintenance of the hosts will be one by one. We do have plans to replace this with an anycast-based setup, but that's also for another task.

  1. There are some references to authdns_servers that can be found with a quick grep. For those again, we are working on the above assumption and will remove them one by one and nothing changes for now.
  1. The cookbooks will need to be updated to not use A:dns-rec but instead query the current active state from confctl instead.

Change 1009261 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth: skipping running authdns-update on host if not pooled

https://gerrit.wikimedia.org/r/1009261

Change 1009539 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] DNS-related cookbooks: adapt for conftool state

https://gerrit.wikimedia.org/r/1009539

Change 1012719 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/cookbooks@master] cookbooks.sre.dns: add roll-reboot cookbook

https://gerrit.wikimedia.org/r/1012719

Change 1009261 merged by Ssingh:

[operations/puppet@production] P:dns::auth: skipping running authdns-update on host if not pooled

https://gerrit.wikimedia.org/r/1009261

Even from itself? As in what happens if an operator runs authdns-update on a depooled host?

Thanks for the feedback! We have fixed this by not allowing authdns-update to be run on a given host if it is depooled.

{"dns6001.wikimedia.org": {"weight": 100, "pooled": "no"}, "tags": "dc=drmrs,cluster=dnsbox,service=authdns-update"}
sukhe@dns6001:~$ sudo -i authdns-update 
Unable to run authdns-update from this host: dns6001.wikimedia.org is not pooled. Please pick another host!

Change #1012719 merged by Ssingh:

[operations/cookbooks@master] cookbooks.sre.dns: add roll-reboot cookbook

https://gerrit.wikimedia.org/r/1012719

Change #1009539 merged by jenkins-bot:

[operations/cookbooks@master] DNS-related cookbooks: adapt for conftool state

https://gerrit.wikimedia.org/r/1009539