Page MenuHomePhabricator

Automate generation of Management DNS records from Netbox
Open, NormalPublic

Description

Get to the "testing" phase of automated generation:

  • Read Netbox API
  • Dump includable records and reverse records for Management interfaces
  • Test and verify produced records against manually maintained records

Details

Related Gerrit Patches:
operations/software/netbox-deploy : masterAdd script to generate DNS records from Netbox
operations/puppet : productionnetbox: Setup automated DNS generation
operations/puppet : productionprofile::authdns: Add automation framework

Event Timeline

crusnov created this task.Sep 18 2019, 3:49 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 18 2019, 3:49 AM

Change 537576 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/puppet@production] profile::authdns: Add automation framework

https://gerrit.wikimedia.org/r/537576

Change 537576 abandoned by CRusnov:
profile::authdns: Add automation framework

Reason:
We will not be deploying the system in this way.

https://gerrit.wikimedia.org/r/537576

Change 539013 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-deploy@master] Add script to generate DNS records from Netbox

https://gerrit.wikimedia.org/r/539013

Change 539182 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/puppet@production] netbox: Setup automated DNS generation

https://gerrit.wikimedia.org/r/539182

Volans added a subscriber: BBlack.Oct 3 2019, 10:44 AM

Monving discussion from https://gerrit.wikimedia.org/r/c/operations/software/netbox-deploy/+/539013 here (+ brandon)

@Volans wrote:

We need to add more validation steps, not only on the gdnsd side, but on the content, to ensure that each $ORIGIN has likely good data (at least N records, etc..).
The zone_validator check on the dns side can only ensure that we have consistent data, would not protect us from no data.
Also we probably need some validation that all generated files are referenced in the DNS repo and all references point to existing file. >> And then define the procedure to add a new ORIGIN block to avoid chicken and egg issues.

In general it's taking quite a bit to follow the logic of the whole thing. Is there a way to test it in a safe environment?

@crusnov wrote:

The actual generation step can be run from netbox1001, it uses the read-only token and outputs to a specified directory so it is inherently safe. I'm still caught up on validation mechanics, so if there are specific set of validations that you'd like to see please enumerate them (i have sent brandon an email going over my investigations and thoughts about this).

Ok I'll test it there, thanks.
As for the validations, here some idea of workflow that includes the addition of a new $ORIGIN. AFAIK gdsnd keeps the comments in the generated files too, and I'm using this as an assumption.

  • For each mgmt network with appropriate state (I guess container and active) in Netbox we generate a snippet file:
    • each snippet file that is empty *must not* be included in the DNS repo (this might not be needed in the end but seems a safe one to have for now)
    • for each snippet file we count the entries in the previously generated one and the entries in the new one and fail/alert/something if the delta in % is too high (threshold TBD)
    • for each snippet file, if possible, we get the count of hosts from Netbox and ensure that the generated record count is the same
    • probably some check of the site is good too, given that each DC has its own space well defined, should be easy to add as a check
  • In each file we add a comment at the top to identify it, probably it will be useful to add the generation datetime too
  • In the DNS repo we'll add includes in each mgmt $ORIGIN to include the relative snippet as the first line after the origin definition to keep them close to the ORIGIN definition.
    • each include *must* reference an existing snippet file that *is not* empty

All the include checks could be done on the generated files using the comments at the top of the snippet or in the source files checking the includes. Let's see what Brandon prefers based on gdnsd update workflow.
Those are for the final state.
For the transition period I'd say that we could have something that checks that each mgmt entry in the DNS repo is present in the appropriate snippet and that each snippet entry is present in the DNS repo.
My 2 cents.

I'm not fully convinced that, in the transitional stage, zone_validator is exactly what we want after going over the code a bit. Especially since we're generating many of the same records that are already present in the dns repository it complains heavily about that when presented with the dns repository with the generated records added (as would be expected). Anyway work is ongoing.

I'm not sure I'm following, of course when a snippet file will be included we'll need to remove the entries from the repo otherwise we end up with duplicated entries. So whatever mechanism we decide for comparison, the one that will include the snippets will need to have the hardcoded entries removed.

Thanks for the extensive feedback & validation suggestions! I'll see what i can come up with.

BBlack added a subscriber: jbond.Oct 3 2019, 6:50 PM

I've been pushing this to my back burner for a few days because it's complicated. My current $0.03 on all related things:

$ORIGIN issues and empty $INCLUDE files, etc

... aren't a huge deal, I think. We have mechanisms for dealing with relative/absolute origin-switching in the outer including file even for aliased zones. I'm guessing we're not at a point (may never be) where we'd say a whole sub-domain is owned by netbox exlclusively, and there may always need to be room for manual add-ons. I don't think pre-defining currently-empty include files and deploying them as empty files is actually an issue, but will have to double-check that (worst case, should be fine if the "empty" include file just contains a comment header). I'm picturing an include scenario like:

The "static" wmnet zone file in ops/dns:

[...]
$ORIGIN mgmt.eqiad.wmnet.
$INCLUDE netbox-exports/mgmt-eqiad-wmnet
foo IN A 192.0.2.1 ; some necessary corner-case manual record

$ORIGIN mgmt.codfw.wmnet.
$INCLUDE netbox-exports/mgmt-codfw-wmnet
; codfw doesn't have any manual records, but same pattern

$ORIGIN whatever
[...]

On the generation and validation of the data on the netbox host

I think there should be some level of validation when the output zonefile fragments are generated on the netbox host. To put that in concrete terms, I would expect a flow something like:

  1. Netbox activity triggers a regeneration because of a data change.
  2. Some script pulls data from netbox and generates all the zonefile fragments to some temporary output directory.
  3. Some kind of sanity-check / validation happens on that temporary output (and if it fails, the process stops here and then we can talk about surfacing and reacting to the issue...)
  4. The sanity-checked outputs are moved atomically into place as the new netbox output (moved into some persistent output directory, say '/srv/netbox-dns-exports/` or whatever it is, which is the canonical output space that other parts of this process operate on.
  5. Possibly at this point, trigger some distribution mechanism.

On the subject of how we validate those files on the netbox host before moving them into the canonical location: mocking their surroundings is probably unreliable and fraught with issues. It's probably better to have the CI from the main ops/dns repo run against the combined dataset here on the netbox host. In other words, the validation step means getting a fresh checkout of ops/dns, deploying it plus the newly-generated outputs into a temporary directory structure together, and running utils/deploy_check.py on that (which requires a non-running installation of the gdnsd package on the netbox host and such), which I think is an acceptable trade. This is basically the same thing that will happen at the end of the rest of the stuff that manages deploying them together, so shared CI/validation here saves a lot of pain and mismatch.

There could be additional pre/post-validation you'd want to do that's more netbox-specific as well (e.g. the stuff mentioned earlier - audit/cross-check host counts, look for massive change percentages, look for empty outputs where unexpected).

On distribution, push/pull, integration with static data, etc

There's a potential chicken-and-egg between defining new include fragments and setting up their inclusion in the ops/dns repo. I think this part works out fine as long as we adhere to a simple workflow: always define the new output file on the netbox side (hopefully with minimal or zero contents) before setting up the $INCLUDE on the dns repo side, and things will work fine even if the initial generation of the new file is effectively un-checked at the DNS level on the netbox side of things for lack of inclusion (yet).

On the push-vs-pull front, an important factor to keep in mind is that our set of public authdns servers is likely to expand dramatically in count, and one can imagine there being 2-3 of the per-site and more sites coming online in the future. Think design-wise about a future world like "I expect there could be ~24 public authdns hosts across 8 wide-area sites", vs the 3-ish we have today. Fixing some related issues with the current authdns-update is already on my short-to-medium radar (making it operate in parallel with less-verbose output in non-failing cases, etc). So this affects one's thinking a bit, too (but then see also further below about splitting up authdns...).

At this point, it's helpful to look at this through the lens of what happens today on each authdns server as authdns-update rolls through the (currently small) fleet with the static data updates. It's a little complicated to break down, but it goes something like this:

  1. A human runs authdns-update on any authdns server of their choosing (we'll call this the temporary "deployment master" for this one execution)
  2. authdns-update is fairly simple: It basically runs ssh $deployment_master authdns-local-update (yes, to itself, to make sure the ssh execution environment is the same), then iterates all the remote authdnses and runs ssh $other_authdns authdns-local-update $deployment_master.
  3. authdns-local-update as executed on the deployment master with no argument, does a pull from the upstream ops/dns gerrit repo to a local git clone, reviews the diffs (human style like puppet-merge), copies the git data plus other locally-deployed data (e.g. puppet-deployed Discovery DNS files) into a temporary directory, runs the CI checks against this combination of data, and then deploys the data from the temporary directory to the live production dataset on this host. If this fails, the other remotes aren't touched at all (and neither is the local live running instance/data).
  4. authdns-local-update on the other hosts, given the $deployment_master argument of the host being deployed from, does almost the exact same thing, but (a) skips the human diff-review step and (b) pulls its git clone directly from the deployment master's git checkout instead of the real upstream gerrit. We do it this way (vs having them all pull git directly) so that we can be sure that a single run of authdns-update deploys the same data to all the hosts (vs racing with several quick updates to the upstream git repo, where one person's update run might deploy different heads to different servers).

In my mind there's two basic directions you can go from here to integrate this with the netbox-driven updates and files:

  1. Add some new data source to the above system so that it can pull the latest netbox files from the netbox server maybe via scp (and similar to what we do with git, we'd pull them directly only on the temporary deployment master, then sync them directly to others from there via e.g. scp for consistency), and have the final step of the netbox regeneration go ssh to an authdns box and kick off a non-human-reviewd authdns-update. The downside here is that an automated run may deploy an ops/dns change that a human was about to deploy, but we could perhaps add an argument to skip the git pull part and just do the rest, for these automated runs driven by netbox.
  2. We could alternatively have authdns-update treat the netbox-exported files as "something else deployed on this host outside of my control, which I need to pull in for local verification" just like the Discovery DNS data currently pushed by puppet, and have a completely different set of scripts/mechanisms for getting those files pushed onto the hosts. This could be cleaner and more modular (and it's definitely ok to have multiple actors running concurrent gdnsd reload/zones-reload commands, etc), but it might result in a lot of duplication of code/effort vs option 1? I tend to think option (1) lends well to sharing a lot more validation/code between all of these things, including the local verification check against ops/dns on the netbox host before the push (or pull) with authdns.

Other Directions and Loose ends

  • It might be nice in general, for all kinds of reasons, for the netbox output (set of zonefile fragment files) to have some history and some revert-ability, instead of just being a single-shot output directory of whatever happened in the most-recent run. You could do a simple version of this just by keeping historical copies (e.g. the deploy/verify script copies old datasets to /srv/netbox-history/2019-09-09T13:44:21/ or whatever, and managing cleanup of ancient ones, etc. You could also (yes I hear you groaning) store the outputs as new commits to a local git repo on the host, which only ever gets these automated commits, which also makes for an easier sync mechanism than scp, perhaps. One could even imagine (here comes the bigger groan) simply having this repo (which isn't in gerrit, although I guess could be mirror-pushed up to there as a secondary place) as a submodule of the ops/dns repo that the deployment-master then naturally pulls in that subdirectory of data from...
  • There has been talk before (hi @jbond ) about splitting our authdns cluster up, which I think is a totally sane idea to explore. The basic shape of these discussions has been the idea that the existing public authdns cluster carries on like before, but only handles the truly-public domains like wikipedia.org, whereas wmnet and such live in a separate authdns cluster which is entirely internal and maybe only lives in the core DCs, and possibly runs completely-different and more generically-capable software like PowerDNS or Knot or BIND or whatever, and potentially unlocks some features that gdnsd will probably never have (like Dynamic DNS support and simple zone data slaving, and native support for certain esoteric RR types, ....). There's some wiggle-room on details of this (e.g. where does wikimedia.org land with all its netboxy things?) which maybe are really highlighting orthogonal issues about naming changes we should be making anyways (wikimedia.net anyone? .. but then also are we stalling to see the output of the Branding discussions first for any impact there?). Bottom line is that if this notion still has traction, it may have an outsized impact on all the above (maybe makes it easier, or harder).
Restricted Application added a project: Operations. · View Herald TranscriptOct 3 2019, 6:51 PM
Volans added a comment.Oct 6 2019, 5:02 PM

Thanks @BBlack for the very detailed and precise summary.

$ORIGIN issues and empty $INCLUDE files, etc

+1

On the generation and validation of the data on the netbox host

I agree here too, I was thinking more or less the same, to reuse the existing CI using a local gdnsd package to perform the full CI run as it does in prod when running authdns-update.

  1. Netbox activity triggers a regeneration because of a data change.
  2. Some script pulls data from netbox and generates all the zonefile fragments to some temporary output directory.

We probably need here some "lock" to make sure concurrent data changes don't step on each other with parallel runs of the regeneration. Alternatively we could start with a timer that runs this every X minutes and continue only if there is any data change.
Going forward there is also the possibility that the addition of a new host into Netbox will be driven by some cookbook to simplify the whole operation, in that case the same cookbook could trigger the generation, that at that point would be human-triggered, resolving also some of the security concerns.

On distribution, push/pull, integration with static data, etc

There's a potential chicken-and-egg between defining new include fragments and setting up their inclusion in the ops/dns repo. I think this part works out fine as long as we adhere to a simple workflow: always define the new output file on the netbox side (hopefully with minimal or zero contents) before setting up the $INCLUDE on the dns repo side, and things will work fine even if the initial generation of the new file is effectively un-checked at the DNS level on the netbox side of things for lack of inclusion (yet).

Totally agree with you, first generate the files on netbox and then add the include. The validation for making sure all generated files are included might either be relaxed to a warning (given it should be temporary before the inclusion in the zonefile) or alert only if the new snippet is not included in the zonefiles after X days/weeks (TBD how to check it).

In my mind there's two basic directions you can go from here to integrate this with the netbox-driven updates and files:

  1. Add some new data source to the above system so that it can pull the latest netbox files from the netbox server maybe via scp (and similar to what we do with git, we'd pull them directly only on the temporary deployment master, then sync them directly to others from there via e.g. scp for consistency), and have the final step of the netbox regeneration go ssh to an authdns box and kick off a non-human-reviewd authdns-update. The downside here is that an automated run may deploy an ops/dns change that a human was about to deploy, but we could perhaps add an argument to skip the git pull part and just do the rest, for these automated runs driven by netbox.
  2. We could alternatively have authdns-update treat the netbox-exported files as "something else deployed on this host outside of my control, which I need to pull in for local verification" just like the Discovery DNS data currently pushed by puppet, and have a completely different set of scripts/mechanisms for getting those files pushed onto the hosts. This could be cleaner and more modular (and it's definitely ok to have multiple actors running concurrent gdnsd reload/zones-reload commands, etc), but it might result in a lot of duplication of code/effort vs option 1? I tend to think option (1) lends well to sharing a lot more validation/code between all of these things, including the local verification check against ops/dns on the netbox host before the push (or pull) with authdns.

I agree with (1) if we skip any human-made modification, to prevent to deploy those automatically and without human supervision.

Other Directions and Loose ends

  • It might be nice in general, for all kinds of reasons, for the netbox output (set of zonefile fragment files) to have some history and some revert-ability, instead of just being a single-shot output directory of whatever happened in the most-recent run. You could do a simple version of this just by keeping historical copies (e.g. the deploy/verify script copies old datasets to /srv/netbox-history/2019-09-09T13:44:21/ or whatever, and managing cleanup of ancient ones, etc. You could also (yes I hear you groaning) store the outputs as new commits to a local git repo on the host, which only ever gets these automated commits, which also makes for an easier sync mechanism than scp, perhaps. One could even imagine (here comes the bigger groan) simply having this repo (which isn't in gerrit, although I guess could be mirror-pushed up to there as a secondary place) as a submodule of the ops/dns repo that the deployment-master then naturally pulls in that subdirectory of data from...

+1 for the history, was more or less already in the plan in some form. TBD how much history we want. A local git repo where the generation process only commits that is replicated across the two netbox hosts might be a good solution, nothing against it from me 😉

  • There has been talk before (hi @jbond ) about splitting our authdns cluster up, which I think is a totally sane idea to explore. The basic shape of these discussions has been the idea that the existing public authdns cluster carries on like before, but only handles the truly-public domains like wikipedia.org, whereas wmnet and such live in a separate authdns cluster which is entirely internal and maybe only lives in the core DCs, and possibly runs completely-different and more generically-capable software like PowerDNS or Knot or BIND or whatever, and potentially unlocks some features that gdnsd will probably never have (like Dynamic DNS support and simple zone data slaving, and native support for certain esoteric RR types, ....). There's some wiggle-room on details of this (e.g. where does wikimedia.org land with all its netboxy things?) which maybe are really highlighting orthogonal issues about naming changes we should be making anyways (wikimedia.net anyone? .. but then also are we stalling to see the output of the Branding discussions first for any impact there?). Bottom line is that if this notion still has traction, it may have an outsized impact on all the above (maybe makes it easier, or harder).

This should affect mostly only the distribution part of this, so I don't think it would be a big effort if we change something in that space in the future, unless we want to go towards a dynamic notification to the DNS server of the changed data.

Some other aspect to take into account:

  • Jenkins CI: to be as effective as now it should be able to pull the latest real Netbox-generated data too, just mocking it seems to reduce a lot the benefits of checks like zone_validator IMHO
  • public vs private data: currently all this data is public in the DNS repo, moving it to Netbox would make it automatically private. We could ofc expose this data. If exposed publicly, then the CI problem above would be solved automatically.
Krenair added a subscriber: Krenair.Oct 6 2019, 5:21 PM

After discussing this a bit and thinking about it quite a lot, I'm highly in favor of a machine git repo for the generated side. This has a nice side-benefit of us being able to easy expose it to the network via https on the netbox servers (and, thus, both publically and to the dns servers).

I have implemented some git repository manipulation in the prototype generator script which I'm pretty happy with that would facilitate this.

ema moved this task from Triage to Watching on the Traffic board.Oct 14 2019, 6:32 PM
jbond moved this task from Unsorted 💣 to Watching 👀 on the User-jbond board.Wed, Oct 30, 6:07 PM
Volans added a comment.Tue, Nov 5, 7:43 PM

@BBlack the current proposal is:

  • On the netbox host(s) there will be a script to generate the snippet files that will perform some standalone validation of the snippet themselves
  • The script will then commit those files into a local git repository
  • The local git repository is automatically synced with the other Netbox host for redundancy like the private puppet repo (post-commit/receive hooks via ssh)
  • The SHA1 of the commit is written to etcd
  • The local git repository is exposed via HTTPS (read only) so then:
    • CI can then clone the repo and use it to continue to perform CI on the dns repo as it's doing right now (always HEAD)
    • the authdns servers would watch the etcd key and when the SHA1 changes pull the new commit, and perform validation and reload

Thoughts?

Volans triaged this task as Normal priority.Tue, Nov 5, 7:43 PM
Volans moved this task from Backlog to In Progress on the SRE-tools board.Wed, Nov 6, 3:18 PM

Seems sane! The only thing I'm a little iffy about iis from the "SHA1 written to etcd" onwards. I'm not sure it's a bad approach, but I'm not sure I've thought through all the implications, either. I think the key things to think about in that part of the flow that might be missing are, is emergency updates to the netbox-defined data. e.g. if all the things are borked and we need to manually edit a DNS entry in the netbox-derived zonefile fragments.... is there a way we can do that from the authdns servers with authdns-update? (e.g. a local commit and an override of the SHA1 argument?).

Seems sane! The only thing I'm a little iffy about iis from the "SHA1 written to etcd" onwards. I'm not sure it's a bad approach, but I'm not sure I've thought through all the implications, either. I think the key things to think about in that part of the flow that might be missing are, is emergency updates to the netbox-defined data. e.g. if all the things are borked and we need to manually edit a DNS entry in the netbox-derived zonefile fragments.... is there a way we can do that from the authdns servers with authdns-update? (e.g. a local commit and an override of the SHA1 argument?).

Thanks for the feedback.

We should have updated this ticket after a discussion in the foundations meeting, but we've essentially normalized on it being a manual step to push the change similar to authdns-update, from a cumin or so server. Everything before the etcd/automated part would essentially remain the same.

Seems sane! The only thing I'm a little iffy about iis from the "SHA1 written to etcd" onwards. I'm not sure it's a bad approach, but I'm not sure I've thought through all the implications, either. I think the key things to think about in that part of the flow that might be missing are, is emergency updates to the netbox-defined data. e.g. if all the things are borked and we need to manually edit a DNS entry in the netbox-derived zonefile fragments.... is there a way we can do that from the authdns servers with authdns-update? (e.g. a local commit and an override of the SHA1 argument?).

So, the script that generates the snippet will be (for now) manually triggered for security reasons, and everything else after that will be automated.
The only part that is not totally clear to me so far and I'd like input is the authdns-update part.
As every authdns host will independently pull the new data when the SHA1 changes in etcd, and given that the snippet will not use (for now?) jinja stuff, I'm wondering if each gdnsd independently could just do a simple reload without running authdns-update when the SHA1 changes or is important to run authdns-update anyway and in that case how we ensure to run it only on one host and waiting that all hosts have fetched the new SHA1 locally.

As for failure scenarios, here some example/options:

  • In case of borked data in Netbox we could do a local commit on the netbox host "master" copy of the git repo and push a new SHA1 to etcd and all will rollout automatically
  • In case of a more borked situation in which the authdns servers are not able to talk to the netbox hosted repo of snippets we can do one of those:
    • a local change on all the authdns servers plus an authdns-update
    • if the authdns servers will keep the capability to ssh between each other is to commit this change locally on one and push/pull it on the other ones and then run authdns update.
    • have on etcd both the SHA1 of the commit and the URL from where to pull, that by default will be the netbox host via HTTPS but could be changed to any of the authdns server via SSH. In this case the procedure would be to change the remote URL in etcd + local commit on one of the authdns server.

A simplified version could be to use a cookbook to couple stuff:

  • have a script on the netbox hosts to generate the snippets from the API and save them locally committing them to the local working copy
  • have a cookbook that ssh into the netbox host, runs a git diff between the remote (the bare repo) and the local and ensures that there aren't any local modifications, showing the user the diff asking for confirmation
  • the user confirm or abort
  • it then push the local checkout to the local remote (bare repo)
  • ssh into the authdns servers and git pull the SHA1 and then does whatever is needed to make gdnsd reload the zonefiles (implementation TBD).

Change 539182 abandoned by CRusnov:
netbox: Setup automated DNS generation

Reason:
This plan has been superseded by a separate deployment plan, see T233183.

https://gerrit.wikimedia.org/r/539182