Page MenuHomePhabricator

AuthDNS CM/CI refactor
Closed, ResolvedPublic

Description

With the addition of DNS Discovery to AuthDNS, some of the complexity and pain-points of our current split CM (and implicitly, CI) arrangement are becoming more-apparent.

I'd like to address this by re-architecting all related things a bit. There's an evolving Proof-of-Concept commit here: https://gerrit.wikimedia.org/r/#/c/342887/ .

From a naive POV, the first question one would ask is "Why aren't we managing this from the puppet repo like any other service, with all configuration and control deployed directly by puppet with appropriate reloads, restarts, and CI checks built in?". The best answer is that it's simply too-critical a service, as it underlies almost everything else. The current design has some advantages in that regard: It allows for deploying DNS data (and some config) changes independently of puppet during puppet-affecting infrastructure outages, it coordinates the updates of DNS data to happen close in time on all DNS servers (instead of randomly spread by puppet runs), and it ensures they're all updating to identical data (by only pulling one server's config from the dns repo and then syncing the others from that server, in case of split-brain/reachability/updatedness on puppetmasters). We do need to maintain those advantages.

These are the perceived problem areas in the current design:

  • Split-repo design issues:
    • While most of the DNS configuration (geoip mapping, zonefiles) is managed via the ops/dns repo, that isn't all of it. The core config file is managed from the puppet repo, as are all the basics like the package installation, basic systems service definition, etc. The scripts which operate on the ops/dns data are also managed and deployed from ops/puppet.
    • Sometimes this makes changes challenging, involving commits on both sides of the fence that are tricky to coordinate (e.g. a DNS software update that wants an ops/puppet change of one of the ops/dns-affecting scripts coordinated with an ops/dns change of a config file). We face these routinely for work that is not changing DNS infra itself as well (how often do we have matching ops/puppet + ops/dns commits with manual commitmsg notes about deploying one before the other for dependency?). Efforts like DNS-Discovery want more DNS server config to be derived from our central shared config data in the puppet repo (e.g. hieradata), which makes this problem more-apparent (we're now duplicating mock definitions on the ops/dns side to make CI pass, which is fakery and can cause lint-success to lead to deploy failure). It also comes up any time we talk about further simplifying and automating the deployment of new services.
    • CI (jenkins) only does authdns-lint on the contents of an ops/dns change. It is oblivious to config changes on the ops/puppet side, even though these can break things. This is again something that wasn't as much an issue before, but with more DNS config on the puppet side tied to hieradata for DNS-Discovery, it becomes an issue. In general though, it seems like a fundamentally-flawed design to have two separate repos contributing to one set of configuration that needs to be CI'd together for sanity, as that always implies the possibility that two or more unmerged changes from two separate repos need to be brought together for one CI check before deployment. How would one automate that even in a perfect world for jenkins checks?
  • The authdns scripts (script code managed from ops/puppet, invoked to manage ops/dns) re-implement a small slice of core puppet functionality, but differently and with missing features. There are also other general issues with them that should be addressed either by design changes or new work on the existing scripts:
    • authdns-gen-zones.py is a custom templating engine built around Python's jinja2. It basically gives ops/dns the equivalent of puppet's ERB templating, but we've got a separate complex script to drive it, and a separate templating language from the one used for the bulk of our templating needs elsewhere in ops. That the templating is in the ops/dns repo of course also implies we can't use the rich data we have in ops/puppet (e.g. hieradata LVS definitions, subnet definitions, etc) to template related bits in ops/dns at all.
    • authdns-git-pull is a custom 91-line shell script for syncing and reviewing ops/dns commits to one of the authdns masters. This in many ways overlaps with "puppet-merge" on the pure-puppet side of things (but again isn't managing the merge of the puppet-driven config file(s), only the ops/dns ones).
    • The current scripts driving authdns-gen-zones for the "normal" data update process don't handle certain kinds of data updates properly. For example, if a commit only affects a helper template file (e.g. to add new language codes for subdomains), it doesn't re-template all affected zones.
    • The current final admin-review (the equivalent of puppet-merge's confirmation, that you see when running authdns-update) only reviews the actual commitdiff from the ops/dns repo (as puppet-merge would have done). This is basically a re-review of the diff output the committer and any reviewer have already seen. What's missing here is a diff review of the final changes to the live DNS config itself, *after* all templating is complete (be it jinja2/authdns-gen-zones or otherwise).
  • The handling of DNS changes under infra outages (e.g. gerrit/git outage in the current case) and recovery afterwards is a bit more complex than it needs to be and not well-documented. This is mostly a documentation problem, but AFAICS the current basic workflow would be (honestly, I've not tested this, but it's my best guess with a few minutes to stare at the scripts):
    • Make a new local commit in /srv/authdns/git on one of the authdns servers (let's say "radon.wikimedia.org")
    • Invoke "authdns-local-update radon.wikimedia.org" on all AuthDNS servers manually one-by-one, including (starting with?) radon itself (no final review on any of them)
    • Recovery to normal state afterwards: stage the temporary commit through the now-working git/gerrit. reset the /srv/authdns/git clone on all authdns servers manually (back to the last gerrit-driven commit). Go through an authdns-update cycle to bring everything back into upstream sync.
    • This seems over-complex and under-documented, especially given that this ability is one of the core reasons for this complex infrastructure in the first place. (The other is avoiding puppetmaster outages, but honestly git/gerrit outages (Java!) seem more likely than our redundant puppetmaster infra).

The refactored design attempts to move past almost all of the above issues. In short, the following changes:

  • Repo unification: ops/dns repo is eliminated, with the data therein moved to the ops/puppet repo.
    • No more split changes where we have to coordinate a logical change across a pair of commits to ops/puppet + ops/dns in the general case. Also, where the ordering of such updates is important, the committer can now define serial, dependent commits in the same repo.
    • It's now possible to have a jenkins CI job fully validate all authdns config before merging, even if we're bringing in complex puppet-data-driven config for services/discovery.
    • This fundamentally removes the custom authdns-git-pull code
  • Templating unification: regular puppet ERB templating drives all config/zone templating for all authdns changes.
    • Now we use the same template language for DNS that we use for everything else.
    • This fundamentally removes the need for authdns-gen-zones.py custom code.
    • Access to hieradata and manifest variables in templates, etc
    • Symlinking system for zone aliases is replaced by proper data-driven aliasing
    • Data updates always work (e.g. adding a language deploys properly to all affected zones under normal workflow)
  • Puppet further divorced from authdns server impact
    • Puppet now doesn't operate on the live configuration *at all*. All of the config/zone data templated out by puppet lands in a separate directory (/srv/authdns/staging-puppet) which the live DNS server knows nothing about. There is no need or desire to ever have puppet drive a service restart. This staging directory serves as the handoff point to the normal update process using "authdns-update".
  • The pre-flight administrative review (confirmation during "authdns-update") now operates post-templating - the diff you see is the full diff of the actual live server config/zones, not the template changes that create them.
  • The normal process when all infrastructure is online remains nearly the same as the original: merge dns-affecting puppet change, run "puppet-merge", then chose one authdns server on which to run "authdns-update" (which mostly just iterates the defined list of authdns servers, sshing into each to run "authdns-local-update", as before).
  • The authdns-local-update script now operates on a two-tier staging design:
    • In the default/argument-free operation mode (how authdns-update invokes it on the initial server), it first safely runs "puppet agent" to deploy the latest merged changes into /srv/authdns/staging-puppet/
    • These changes are then rsynced (with puppet agent disabled to avoid update races) to the local /srv/authdns/staging/
    • Diff review/confirm is done based on the diffs between /srv/authdns/staging/ and the live config (in /etc/)
    • There is a puppet-defined rsync server on all authdns nodes. It exposes /srv/authdns/staging/ as a readonly share to all of our production networks.
    • authdns-local-update $other_server pulls its data from the remote /srv/authdns/staging/ to the local /srv/authdns/staging/, then deploys to live config.

Notes on the negative tradeoffs in the process change:

  • CI will get slower (because it will be part of the normal puppet-repo CI runs which do lots of other things). This could be fixed on the CI end eventually with more infra, etc.
  • Updates under normal conditions are now a little slower to execute. We have to wait on a puppet-merge to complete, as well as a single puppet agent -t on the authdns server one choses to run "authdns-update" from.
  • The list of dependencies for "normal conditions" grows: before it was that git/gerrit were working (as well as other basics like recdns, network, ssh). Now it's also that puppetmasters are working.

Netural-to-positive-ish tradeoffs in the process change:

  • Updates are still gauranteed to stay in sync across servers - puppet's data is only referenced by one node and then shared to the others directly.
  • As before, a final gdnsd checkconfig is done before any attempted reload/restart, but things are simplified and safer. Before, the script would make backups of live config, push the new config into place, then run checkconfig. If checkconfig failed, it would restore the backups. The new method does the final checkconf on the staging directory contents before copying, and simply fails-in-place without copying on failure.

The methods for handling abnormal conditions are simpler and capable of handling a wider range of scenarios:

  • inter-server ssh access with the custom-deployed keys is only used for authdns-update's automated traversal of the server lists to trigger updates. It is not used to manually update one server from another using authdns-local-update (that uses anonymous rsync).
  • Scenario 1: All is normal, deploying a git change that was linted and reviewed:
    • On 1x puppetmaster: puppet-merge
    • On 1x authdns: authdns-update
  • Scenario 2: Just puppet and/or git is broken (but server set unchanged, inter-server ssh works fine, etc), and we need to do a critical dns config/data update during:
    • On 1x authdns: Manually edit dns config/data outputs (which is post-templating) in /srv/authdns/staging/
    • On same server: Run authdns-update --no-puppet, (it does all the same review/validate/push logic as normal, but skips the agent run and the of syncing staging-puppet/ over staging/ at the beginning)
    • Recovery afterwards: Make whatever changes are necessary via normal git/gerrit (if just undoing the temporary changes, there's nothing to do here. If keeping them, upload a similar change). Run the normal update procedure. It will review/confirm a diff vs live (so if no new changes have been uploaded, it will show the diff that removes the temporary local changes. If the change was mirrored to git, it should confirm an empty diff).
  • Scenario 3: Worst-case - puppet/git are dead, puppet may have screwed up inter-authdns ssh keys, and/or we may be bringing up or failing over to new authdns instances that weren't even in the server list the last time puppet was able to run. Our only working assumptions: admin access to authdns servers, rsyncd still running on one of them, basic inter-node networking:
    • On 1x authdns: Manually edit the dns config/data outputs (which is post-templating) in /srv/authdns/staging/
    • On same server: Run authdns-local-update $(hostname -f); this does a puppet-less review/validate/push on the local server only.
    • On other authdns servers: Run authdns-local-update $other_server with the name of the server above, they'll pull from its reviewed rsync contents and deploy as normal. Relies on $other_server still having a working rsync export of the staging directory, but doesn't rely on the inter-node ssh keys or that any servers have pre-configured knowledge of each other.
    • This can also be done for split-brain scenarios: you can do this procedure to update one or more split subsets of servers from one in each split group.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack claimed this task.

Resolving this, as recent work has fixed a lot of it (other than discovery issues specifically), and at this point all the text above is woefully outdated and pointing in Wrong directions. We can file some smaller-scope tasks about remaining cleanups in this space.