TL;DR: we are currently DNS depooling sites and managing the state of the admin_state file through Git and in the operations/dns repository. We are planning to move this to confctl to allow for quicker pool/depool and in general streamline this process, alleviating the need for Git commits for both planned and unplanned depools.
Inputs welcome as always -- we have not finalized this but we plan to start work on this soon.
Problem Statement
Currently, we manage the admin_state file through Git in the operations/dns repository. This means that in case of a site depool, we need to perform the following steps:
- create a patch that sets the site in question to be depooled in admin_state
- wait for CI to finish (the change is manually added after all and sometimes during an emergency)
- merge the patch and then run authdns-update to finalize the change
The above takes some time and in some past cases, bystander effect kicked in where we are waiting on each other to create the patch and push it via authdns_update. The TTL for the dyna.wikimedia.org is 5 minutes so most traffic should start falling over to the next site within that time. But the time to create the patch to depool a site and push it through authdns-update can exceed the dyna TTL itself * and this is not ideal.
* - yes, this has actually happened
Most of our depools are planned and can be safely managed via Git. But unplanned and emergency depools should be part of our workflow and the Git-based system does not provide an easy way for us to handle them quickly.
Solution
We should move the state management of admin_state and control it via confctl/confd. This willl allow us to depool a site without any Git changes, allowing for faster and more efficient depooling of sites when required, especially in cases of unplanned depools. The above depool steps can then look like:
confctl select dc=eqiad,cluster=geodns set/pooled=no
... with more control, such as for text-addrs, upload-addrs, etc. and perhaps a wrapper for the above command to make it more succinct.
There should be no need to separately run authdns-update manually.
Other Notes
- We want to keep the schema for this under the Node definition and that means there is no way for us to provide a reason for a change and as we currently do in admin_state by adding a comment above the DOWN line (see example below). To address that, see Idf4d0b85 where we introduce support for passing a reason string in confctl. Which means when depooling a site, we can pass the reason and it will be logged to IRC and SAL.
- The alternative if we really want to have a reason line in the file itself ((if the SAL is not enough) would then involve coming up with a custom schema, which might not be ideal for other reasons.
# T365763: eqsin text cluster drive upgrade <-- reason geoip/generic-map/eqsin => DOWN
- All state will be managed via confctl going forward. There was some feedback on allowing the state to be managed via both confctl and Git but we have decided against that.
- The DNS hosts themselves and the states of various services (recdns, ntp, authdns-update) are managed via confctl. This means that we will have to factor in those states as well when this is migrated.
- Example: if a DNS host is depooled for authdns-udpate, how should we have it handle the admin_state? The other services should not matter here but what about authdns-nsX? (Possibly the same way we handle them after a host is depooled -- running authdns-update manually to bring in the new changes).
- Are there are any cookbook interactions around this? We will need to check.