Page MenuHomePhabricator

Migrate DNS depooling of sites from operations/dns (git) to confctl
Closed, ResolvedPublic

Description

TL;DR: we are currently DNS depooling sites and managing the state of the admin_state file through Git and in the operations/dns repository. We are planning to move this to confctl to allow for quicker pool/depool and in general streamline this process, alleviating the need for Git commits for both planned and unplanned depools.

Inputs welcome as always -- we have not finalized this but we plan to start work on this soon.

Problem Statement

Currently, we manage the admin_state file through Git in the operations/dns repository. This means that in case of a site depool, we need to perform the following steps:

  • create a patch that sets the site in question to be depooled in admin_state
  • wait for CI to finish (the change is manually added after all and sometimes during an emergency)
  • merge the patch and then run authdns-update to finalize the change

The above takes some time and in some past cases, bystander effect kicked in where we are waiting on each other to create the patch and push it via authdns_update. The TTL for the dyna.wikimedia.org is 5 minutes so most traffic should start falling over to the next site within that time. But the time to create the patch to depool a site and push it through authdns-update can exceed the dyna TTL itself * and this is not ideal.

* - yes, this has actually happened

Most of our depools are planned and can be safely managed via Git. But unplanned and emergency depools should be part of our workflow and the Git-based system does not provide an easy way for us to handle them quickly.

Solution

We should move the state management of admin_state and control it via confctl/confd. This willl allow us to depool a site without any Git changes, allowing for faster and more efficient depooling of sites when required, especially in cases of unplanned depools. The above depool steps can then look like:

confctl select dc=eqiad,cluster=geodns set/pooled=no

... with more control, such as for text-addrs, upload-addrs, etc. and perhaps a wrapper for the above command to make it more succinct.

There should be no need to separately run authdns-update manually.

Other Notes

  • We want to keep the schema for this under the Node definition and that means there is no way for us to provide a reason for a change and as we currently do in admin_state by adding a comment above the DOWN line (see example below). To address that, see Idf4d0b85 where we introduce support for passing a reason string in confctl. Which means when depooling a site, we can pass the reason and it will be logged to IRC and SAL.
    • The alternative if we really want to have a reason line in the file itself ((if the SAL is not enough) would then involve coming up with a custom schema, which might not be ideal for other reasons.
# T365763: eqsin text cluster drive upgrade <-- reason
geoip/generic-map/eqsin => DOWN
  • All state will be managed via confctl going forward. There was some feedback on allowing the state to be managed via both confctl and Git but we have decided against that.
  • The DNS hosts themselves and the states of various services (recdns, ntp, authdns-update) are managed via confctl. This means that we will have to factor in those states as well when this is migrated.
    • Example: if a DNS host is depooled for authdns-udpate, how should we have it handle the admin_state? The other services should not matter here but what about authdns-nsX? (Possibly the same way we handle them after a host is depooled -- running authdns-update manually to bring in the new changes).
  • Are there are any cookbook interactions around this? We will need to check.

Event Timeline

Change #1052307 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/software/conftool@master] conftool/cli: add option to log actions with a reason string

https://gerrit.wikimedia.org/r/1052307

ssingh renamed this task from Migrate DNS depooling of sites from operation/dns (git) to confctl to Migrate DNS depooling of sites from operations/dns (git) to confctl.Jul 5 2024, 2:56 PM
ssingh triaged this task as Medium priority.

Change #1052307 merged by Ssingh:

[operations/software/conftool@master] conftool/cli: add option to log actions with a reason string

https://gerrit.wikimedia.org/r/1052307

Change #1052803 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] conftool-data: add geodns schema

https://gerrit.wikimedia.org/r/1052803

Change #1052804 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: maintain admin_state via confd

https://gerrit.wikimedia.org/r/1052804

Change #1052997 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: move all DNS hosts admin_state to confctl/confd

https://gerrit.wikimedia.org/r/1052997

Thanks for the excellent / detailed write-up, @ssingh!

I have a couple of concerns with reuse the node entity type to represent geodns resources as "logical" hosts.

From a practical perspective, as things are right now, potential issues that come to mind are:

  1. possible confusion for folks who think of node entities as "always" representing hosts backing LVS pools
  2. documentation or comments that might assert #1 and need updated
  3. fields that are unused (weight) or may have schema-valid but unclear semantics for geodns (e.g., pooled = inactive)

But, taking a step back, doing this would couple evolution of these two arguably distinct concepts going forward. As in, it would be non-ideal if node starts to gain a mix of admin-state specific fields and additional LVS-pools specific fields.

It feels like the right solution here is a new entity / schema type, rather than reusing node. I don't think doing so would significantly inflate the engineering effort involved.

Thanks for the feedback, @Scott_French! Comments in-line:

Thanks for the excellent / detailed write-up, @ssingh!

I have a couple of concerns with reuse the node entity type to represent geodns resources as "logical" hosts.

Yes, that's fair and it has been on my mind as well. The challenge though was trying to make it work with the existing Node entity so as to avoid having to pass --object-type geodns and using the existing muscle memory we have built so far for confctl (though one can argue that this new change somewhat breaks that). There has been some thought on having a wrapper script for DNS depooling -- so doing dnsdepool eqiad text instead of using confctl set -- but even in Traffic internally, opinions around this vary a bit. We can ponder over these a bit more.

From a practical perspective, as things are right now, potential issues that come to mind are:

  1. possible confusion for folks who think of node entities as "always" representing hosts backing LVS pools

This is a concern indeed.

  1. documentation or comments that might assert #1 and need updated
  2. fields that are unused (weight) or may have schema-valid but unclear semantics for geodns (e.g., pooled = inactive)

As far as weight goes, that doesn't have much relevance for the current dnsbox schema as well. Such as:

sukhe@puppetmaster1001:~$ confctl select "name=dns6001.wikimedia.org"  get
{"dns6001.wikimedia.org": {"weight": 100, "pooled": "yes"}, "tags": "dc=drmrs,cluster=dnsbox,service=authdns-update"}
{"dns6001.wikimedia.org": {"weight": 100, "pooled": "yes"}, "tags": "dc=drmrs,cluster=dnsbox,service=authdns-ns2"}
{"dns6001.wikimedia.org": {"weight": 100, "pooled": "yes"}, "tags": "dc=drmrs,cluster=dnsbox,service=ntp-a"}
{"dns6001.wikimedia.org": {"weight": 100, "pooled": "yes"}, "tags": "dc=drmrs,cluster=dnsbox,service=recdns"}

We assign a weight because it's part of the schema and we might use it in in the future but it can be 0 as well right now and nothing changes. For the current geodns schema, we are only assuming pooled no means DOWN in admin_state (site/resource is depooled); anything else including yes or inactive means the site is up.

But, taking a step back, doing this would couple evolution of these two arguably distinct concepts going forward. As in, it would be non-ideal if node starts to gain a mix of admin-state specific fields and additional LVS-pools specific fields.

It feels like the right solution here is a new entity / schema type, rather than reusing node. I don't think doing so would significantly inflate the engineering effort involved.

On the engineering side, in the first iteration of dnsbox schema, we had a custom one that we reverted for the reason described above. But in that case at least the mapping of hostname matches to what the Node entity should be. I think my worry is mostly having to pass --object-type that most people will not know or forget, so maybe the solution there is to wrap this around a general dnsdepool command.

I will wait for submitting the patches before we decide on the best path forward on this and if there are more opinions.

Is your only concern about a new type the bad UX of needing to pass --object-type, @ssingh ?

If so, I think that we could and should do something about that.

Is your only concern about a new type the bad UX of needing to pass --object-type, @ssingh ?

If so, I think that we could and should do something about that.

I think that's the main concern as it breaks the muscle memory most people have around confctl (which I can be wrong about!). On the flip side though, if we do a wrapper for this, then what happens underneath is not relevant to the operator anyway so we don't have to worry about passing --object-type and can do so in the wrapper instead.

So I guess it comes down to what we think about using Node for this and the name being text-addrs as example and not dns1004.wikimedia.org or some host. If we think that is egregiously wrong (I think it's not nice but personally, I can live with it), then we should do a custom schema, by either alleviating the need for passing --object-type geodns or having a wrapper script. If this is "acceptable", then the current depol for example eqiad looks like (with the current Node-based definition):

confctl --reason "emergency eqiad depool" "name=text-addrs,dc=eqiad" set/pooled=no

Which ignoring the "wrong name" bit here around confctl usage does not seem to require any wrapper scripts and fits in nicely with the current tooling. At least that's the original thought I had and which forms the basis of the current patches -- very much open to your input around confctl conventions and such.

Reworking my own thought process on this, the current definition for this change (assuming Node) looks like:

eqiad:
  geodns:
    generic-map: [geodns]
    text-addrs: [geodns]
    text-next: [geodns]
    upload-addrs: [geodns]
    ncredir-addrs: [geodns]

Which should map out to etcd:

/conftool/v1/pools/eqiad/geodns/geodns/generic-map

This is course notwithstanding the redundancy around the cluster/service being geodns and the wrong usage of the name as expected by conftool/node.py:

Format is:                                                             
dc:                                                                    
  cluster:                                                             
    hostname: <-- :) this is actually the resource here, text-addrs, upload-addrs, etc.                                                        
      - serviceA                                                       
      - serviceB

But on the other hand, the name part for this (hostname) does fit in nicely without any additional custom schema or wrapper scripts, and hence this decision.

(I am not trying to sell this over the proposed custom schema which I am happy to do but just sharing why this was done!)

Thanks, @ssingh!

Ah, interesting - I wasn't aware of the prior art with dnsbox. Indeed, reusing node for a fundamentally "host shaped thing" where (1) you anticipate eventually using as-yet unused fields and (2) do not anticipate ever needing to enrich node with new fields, seems less concerning.

I might be biased by how I'm approaching this, but passing --object-type is already part of the procedure for operating on things like discovery - albeit rarer than operating on node objects. Stated differently, it doesn't seem like a big leap that --object-type is needed.

At the same time, actions that are rarer and / or have a bigger blast radius seem like those that would benefit from following a runbook rather than muscle memory in any case (or use some higher-level tool with guardrails bolted on, like a cookbook).

As @CDanis points out, if indeed --object-type is the tricky part here, then we can probably come up with a way to smooth over that.

Thanks, @ssingh!

Ah, interesting - I wasn't aware of the prior art with dnsbox. Indeed, reusing node for a fundamentally "host shaped thing" where (1) you anticipate eventually using as-yet unused fields and (2) do not anticipate ever needing to enrich node with new fields, seems less concerning.

I might be biased by how I'm approaching this, but passing --object-type is already part of the procedure for operating on things like discovery - albeit rarer than operating on node objects. Stated differently, it doesn't seem like a big leap that --object-type is needed.

That's true, I do see usage of it around discovery records, so it's not as if it is not being used at all (which is what I thought incorrectly).

At the same time, actions that are rarer and / or have a bigger blast radius seem like those that would benefit from following a runbook rather than muscle memory in any case (or use some higher-level tool with guardrails bolted on, like a cookbook).

As @CDanis points out, if indeed --object-type is the tricky part here, then we can probably come up with a way to smooth over that.

Thanks that's good to know and helps.

I think for me it is trying to use the existing Node definition which doesn't require a custom schema, goes in with the existing dnsbox cluster definition (and matches what we are doing for that for the DNS boxes service states). I guess the question is how egregious would using name as the resource be instead of the hostname here is what I am trying to grapple with personally as well.

If the concern is simply that they "always" represent LVS pools, the dnsbox cluster is an exception to that and I am thinking if geodns also should go with that as another exception. If the concern is that this should be a hostname because it should match what Node dictates, then I think I can't see any exception and might as well do the custom schema.

One of the things that I would like to understand for the custom geodns case is what name should mean. Because in some cases (Node), name is the hostname. In other cases (discovery records), it seems to indicate the dc as well, example:

confctl --object-type discovery select 'dnsdisc=wdqs,name=codfw' set/pooled=false

Thanks, @ssingh!

In short, and I realize this doesn't help much, my understanding is that what makes sense as an object name vs. an object tag is really up to you (e.g., ergonomics of tag selectors for common operations).

For example, if you expect that operating on all resources in a site is the most common operation (e.g., full site depool), then which is clearer:

  1. confctl --object-type geodns select 'name=eqsin' set/pooled=false
  2. confctl --object-type geodns select 'dc=eqsin' set/pooled=false

My vote would be #2 (but only weakly). That would align with other conftool use cases where an object represents a semi-global concept (here, the resource) while tags provide scoping by consumer (e.g., similar to mediawiki-config's dbconfig).

A more interesting question might be what the schema looks like for the object itself. I guess this might initially be pooled (boolean) and maybe(?) an optional reason (string) that can be used to record why (i.e., similar to the comment previously possible in the admin_state file).

Thanks, @ssingh!

In short, and I realize this doesn't help much, my understanding is that what makes sense as an object name vs. an object tag is really up to you (e.g., ergonomics of tag selectors for common operations).

For example, if you expect that operating on all resources in a site is the most common operation (e.g., full site depool), then which is clearer:

  1. confctl --object-type geodns select 'name=eqsin' set/pooled=false
  2. confctl --object-type geodns select 'dc=eqsin' set/pooled=false

My vote would be #2 (but only weakly). That would align with other conftool use cases where an object represents a semi-global concept (here, the resource) while tags provide scoping by consumer (e.g., similar to mediawiki-config's dbconfig).

A more interesting question might be what the schema looks like for the object itself. I guess this might initially be pooled (boolean) and maybe(?) an optional reason (string) that can be used to record why (i.e., similar to the comment previously possible in the admin_state file).

Thanks! I think if we want to do a custom schema, it would look something like this:

geodns:
  path: "geodns"
  tags:
    - service
  schema:
    pooled:
      type: "enum:yes|no"
      default: "yes"
    weight:
      type: "int"
      default: 100

And then the data file ( say conftool-data/geodns/services.yaml):

generic-map: &sites [eqiad, codfw, esams, ulsfo, eqsin, drmrs, magru]
text-addrs: *sites
upload-addrs: *sites

This should then give us service=generic-map,name=<site> unless I am mistaken.

As for the reason, I think the concern there is that we cannot do set/pooled=yes with set/reason=foo in one line? But even other than that, we decided to do https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/3 instead where we allow passing of reason, so the final form with the custom schema should look like:

confctl --object-type geodns --reason 'depool eqiad' select 'service=generic-map,name=eqsin' set/pooled=false

Correction: @Joe tells me we can do reason/pooled in one line with: set/pooled=yes;reason=foo.

Change #1052803 abandoned by Ssingh:

[operations/puppet@production] conftool-data: add geodns schema

Reason:

revising this since we will use a custom schema

https://gerrit.wikimedia.org/r/1052803

Change #1052804 abandoned by Ssingh:

[operations/puppet@production] P:dns::auth::update: maintain admin_state via confd

https://gerrit.wikimedia.org/r/1052804

Change #1052997 abandoned by Ssingh:

[operations/puppet@production] P:dns::auth::update: move all DNS hosts admin_state to confctl/confd

https://gerrit.wikimedia.org/r/1052997

Change #1053323 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:conftool: add schema for geodns

https://gerrit.wikimedia.org/r/1053323

Some fragmented thoughts below!


Something worth pointing out: the original, early documentation suggests this usage of Node.

the 'node' object (that is used to describe a server/service within a pool)

is in the beginning of the README.md

Maybe Node would have been better named as a 'Poolable' or similar, to suggest that aspect of the usage without committing to other underlying details. Or it would be nice to have a 'Poolable' type as a sort of mix-in for other entities. (Sorry, this is not really a helpful comment, just food for thought for a 'conftool v2'.)


I think I slightly prefer the dc= tagging that Scott suggested, if only because it's evocative of what you're doing when typing that command. (But see also the next section for a contrary idea.)

I also think that --reason is fine to be an ephemeral thing that is only logged.

That being said -- @ssingh, the syntax is arcane, but you *can* do multiple mutations on one line:

confctl --object-type geodns select 'dc=eqiad' set/pooled=false:reason="well now I guess we have to worry about nested quoting, hmmm..."

We should figure out if that works or not, and possibly change the whole syntax IMO 😅 -- but it's a possibility.


Finally: One other idea I had was adding some support to confctl to autodetect --object-type based on a 'primary' tag name.

For instance we already have dnsdisc= as the primary tag in the discovery object-type space, and I don't think we use it anywhere else.

We could similarly have here a geodns= tag (generic-map / text-addrs / upload-addrs) that we use only in this object type.

Thanks @CDanis!

Some fragmented thoughts below!


Something worth pointing out: the original, early documentation suggests this usage of Node.

the 'node' object (that is used to describe a server/service within a pool)

is in the beginning of the README.md

Maybe Node would have been better named as a 'Poolable' or similar, to suggest that aspect of the usage without committing to other underlying details. Or it would be nice to have a 'Poolable' type as a sort of mix-in for other entities. (Sorry, this is not really a helpful comment, just food for thought for a 'conftool v2'.)


I think I slightly prefer the dc= tagging that Scott suggested, if only because it's evocative of what you're doing when typing that command. (But see also the next section for a contrary idea.)

I also think that --reason is fine to be an ephemeral thing that is only logged.

That being said -- @ssingh, the syntax is arcane, but you *can* do multiple mutations on one line:

confctl --object-type geodns select 'dc=eqiad' set/pooled=false:reason="well now I guess we have to worry about nested quoting, hmmm..."

We should figure out if that works or not, and possibly change the whole syntax IMO 😅 -- but it's a possibility.

@Joe also thinks this should work and I have updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053323 to assume it does; we can revisit it later if required. If this doesn't work, simply passing --reason should also be sufficient -- just as long as we have some record of why a site was depooled.


Finally: One other idea I had was adding some support to confctl to autodetect --object-type based on a 'primary' tag name.

For instance we have dnsdisc= as the primary tag in the discovery object-type space, and I don't think we use it anywhere else.

We could similarly have here a geodns= tag (generic-map / text-addrs / upload-addrs) that we use only in this object type.

This should be very helpful I think so my +1 FWIW :) It also helps shorten the invocation; this is probably me overthinking this but the idea behind this change is to significantly decrease the time it takes for an emergency site depool (the best we can do, knowing that not all traffic will move over in five minutes anyway) and I think having to avoid specifying --object-type helps. I think based on the above discussion, I was wrong that this is not being done anywhere else so I am fine with it but it would be nice to not have to do this.

Cool, it sounds like the conversation has evolved to using a dedicated schema, and we're on the same page that a multi-value set should work (to accommodate reason).

Also, no strong opinion on, e.g., service=generic-map,name=eqsin vs. dc=eqsin,name=generic-map.

Now that I think about it, since a full site depool presumably wouldn't use a site-only select (i.e., operate on all services/maps in etcd) and would instead specify generic-map, the select is going to be sufficiently verbose that it's quite clear what you're operating on.

That also bodes well for the suggestion @CDanis made about potentially auto-detecting the object type based on matching tags with those in schema (as long as the tag name is suitably unique).

I'll follow up on the schema patch shortly :)

Following up on this after some discussion with the Traffic folks:

It seems like our preferred version for this is something like:

confctl --object-type geodns select 'service=generic-map,dc=eqiad' set/pooled=false

The reason for this is that this feels more natural and the labels are clear: service is generic-map (or text-addrs, upload-addrs, etc) and dc is eqiad. There is no ambiguity around the various labels. This also matches the current structure for the cp hosts/Node. I am going to update the patchset and we can review it there, unless you think we are overlooking something here.

Following up on this after some discussion with the Traffic folks:

It seems like our preferred version for this is something like:

confctl --object-type geodns select 'service=generic-map,dc=eqiad' set/pooled=false

The reason for this is that this feels more natural and the labels are clear: service is generic-map (or text-addrs, upload-addrs, etc) and dc is eqiad. There is no ambiguity around the various labels. This also matches the current structure for the cp hosts/Node. I am going to update the patchset and we can review it there, unless you think we are overlooking something here.

How about instead of service we use geodns ?

(service is so generic a word it seems hopeless to ever remove the need for --object-type.)

+1 to using a more descriptive name for the resource operated on.

Since service + dc 100% overlaps with the node schema:

  1. There's not a lot of hope for inferring the object type from tags, as @CDanis notes.
  1. If one forgets to provide --object-type and the default of node is assumed, things get kind of weird. In short, I suspect conftool won't complain - i.e., you don't get a "I don't know this tag" error and instead get a noop, since no entities match (in the happy case, that is; in the unhappy theoretical case, where there is a service name collision, you get a surprise - luckily with at confirmation dialog).

Thanks folks, agreement in Traffic as well so updating the patch to do geodns + dc (geodns is getting a bit redundant here in that sense but I have no better ideas :).

@BBlack is out this week so we might revise it again but for now going with this.

Change #1053929 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: maintain admin_state via confd

https://gerrit.wikimedia.org/r/1053929

A couple of notes:

  • Overriding name to always be the same, and just one object per tag group, makes syncing and querying less efficient and don't really like it. The output of the tool will be also ugly, which I don't think is really what you want. If your concern is the command line you'll use, the solution for that is a cli wrapper (see below) short term.
  • I don't think weight is needed in this schema, or am I missing something?

To clarify a bit more, conftool is always been thought as the low-level swiss army knife with the somewhat awkward but flexible syntax, and I expected people to write wrappers around it to make operations simpler for them.
I would expect a simple geodnsctl script (or whatever name you want) that can be a 10 line wrapper around conftool would be much more user friendly than any option we'd have by stretching how conftool schemas work.

Having said that, I also see that being able to change the label for the object name in a schema could be desirable. I'd welcome patches to add this capability to conftool.

Thanks for the feedback @Joe!

A couple of notes:

  • Overriding name to always be the same, and just one object per tag group, makes syncing and querying less efficient and don't really like it. The output of the tool will be also ugly, which I don't think is really what you want. If your concern is the command line you'll use, the solution for that is a cli wrapper (see below) short term.

Yeah that's fair and one of the solutions proposed in the task. Though the thing I was trying to achieve was doing this without a wrapper (and balancing everything else) since just using confctl seemed to serve the purpose. At that point I felt that there is value in not bringing in a new command and just letting people use something they are already familiar with. There was also a clear preference for geodns=generic-map,dc=site and so I structured it like that.

  • I don't think weight is needed in this schema, or am I missing something?

I added weight for two reasons:

  1. We want to be consistent with all the other Traffic schemas -- cp and dnsbox -- that both have a concept of weight even though it's not used in the dnsbox case, currently.
  2. The weight can come in handy if there is some sort of a priority requirement, both in the templating for admin_state and also in general for the geodns case.

To clarify a bit more, conftool is always been thought as the low-level swiss army knife with the somewhat awkward but flexible syntax, and I expected people to write wrappers around it to make operations simpler for them.
I would expect a simple geodnsctl script (or whatever name you want) that can be a 10 line wrapper around conftool would be much more user friendly than any option we'd have by stretching how conftool schemas work.

Having said that, I also see that being able to change the label for the object name in a schema could be desirable. I'd welcome patches to add this capability to conftool.

IMO, if we do geodns as the tag, that can help achieve what we discussed and @CDanis mentioned -- helping alleviate the need for --object-type geodns. So I think the final output can look like:

confctl --object-type geodns select 'geodns=generic-map,name=eqiad' get

Which if we detect the tag geodns and it being unique enough, can help us avoid --object-type geodns and also the redundancy.

Final (famous last words) form:

confctl --object-type geodns select 'geodns=generic-map,name=eqiad' get

with the key being:

/conftool/v1/geodns/generic-map/eqiad

Sorry if I'm late to the task, I discovered it just today as I was not subscribed to it.

Allow me to be really sad that in this whole discussion about doing a higher level operation (depool a whole site from the DNS) the current existing and well tested framework for orchestrating higher level operations (cookbooks) was timidly mentioned only once as a possible solution for a wrapper instead of being part of the discussion from the get go.

As Giuseppe pointed out conftool has always meant to be a lower level tool for modifying objects in etcd. Abusing it's Node schema just to make the CLI a little shorter seems quite wrong to me.
If there are safe solutions to prevent the need to pass --object-type because it gets automatically guessed via other means, could be a nice addition to conftool in general, but that's unrelated to this specific use case and solution.

That said I strongly think that the higher level operation "depooling a site from DNS" should be a cookbook because it would allow to:

  • Have any SRE be able to run it safely and blindly in any circumstances without too much thinking in an emergency because the tooling itself will prevent them to do harm
  • Implement custom logic to add any safety net you need to prevent harm, like depooling at the same time specific sites (esams and drms for example), requiring to switch to a different geo-maps mapping (like we had for esams depool in the past) or interaction with current a/p services that might be active on the site to be depooled just to name a few.
  • Add now or in the future any additional steps, checks, data gathering, etc. as needed more easily
  • Have any custom SAL message included
  • Have distributed locking
  • Have a simple CLI without the need to force the tool (like cookbook sre.dns.depool esams or cookbook sre.dns.admin depool eqiad or similar). The message to be logged could be a required option or just asked to the user if not passed via the arguments..
  • Have a base CLI that every SRE knows and uses every day
  • Have rollback scenarios support
  • Have DRY-RUN support

Sorry if I'm late to the task, I discovered it just today as I was not subscribed to it.

Allow me to be really sad that in this whole discussion about doing a higher level operation (depool a whole site from the DNS) the current existing and well tested framework for orchestrating higher level operations (cookbooks) was timidly mentioned only once as a possible solution for a wrapper instead of being part of the discussion from the get go.

As Giuseppe pointed out conftool has always meant to be a lower level tool for modifying objects in etcd. Abusing it's Node schema just to make the CLI a little shorter seems quite wrong to me.
If there are safe solutions to prevent the need to pass --object-type because it gets automatically guessed via other means, could be a nice addition to conftool in general, but that's unrelated to this specific use case and solution.

That said I strongly think that the higher level operation "depooling a site from DNS" should be a cookbook because it would allow to:

  • Have any SRE be able to run it safely and blindly in any circumstances without too much thinking in an emergency because the tooling itself will prevent them to do harm
  • Implement custom logic to add any safety net you need to prevent harm, like depooling at the same time specific sites (esams and drms for example), requiring to switch to a different geo-maps mapping (like we had for esams depool in the past) or interaction with current a/p services that might be active on the site to be depooled just to name a few.
  • Add now or in the future any additional steps, checks, data gathering, etc. as needed more easily
  • Have any custom SAL message included
  • Have distributed locking
  • Have a simple CLI without the need to force the tool (like cookbook sre.dns.depool esams or cookbook sre.dns.admin depool eqiad or similar). The message to be logged could be a required option or just asked to the user if not passed via the arguments..
  • Have a base CLI that every SRE knows and uses every day
  • Have rollback scenarios support
  • Have DRY-RUN support

Thanks for the feedback @Volans! I will try to address the main point by saying that the cookbook was the end goal (for when everything else was done) but not explicitly stated because given the extent of this change, there were some intermediate steps involved and an intentional focus on confctl. But given your comment, I think we can revisit that.

Just for more clarity: I think this is a big change and as part of this, I wanted to fix some other stuff related to how we pull in DNS changes as well. Because of that, I wanted to break this up in small parts and those parts were: finalize the schema (which I never expected will take this much time but understandable), move users to something new that is not Git-based changes and then end with a cookbook of some sorts, if desired and if there was something that just confctl was not providing. The other intermediate changes were to look into the Netbox DNS changes issues related to our general DNS workflow. As you might have seen in some other cases as well related to work on the DNS changes that the cookbook comes in the end, perhaps as a natural orders as well. In this case, the question was open-ended.

The point about this not being a cookbook from the very start is fair. In my mind at least, the focus for this task has been to move away from Git to confctl -- which most SREs are familiar with -- and thus the focus on that and trying to come up with a way that makes this shift more natural and allows for the quickest depool of a site, with conftcl currently performing that function and not cookbooks for other hosts. You can argue that cookbooks allow us to do all that and it is true, however, the focus was to make a mapping to what is being currently done now when we want to depool something and the natural solution to that felt like confctl and not a cookbook in this round.

I will discuss with the Traffic team to get more input and we can revisit this. The final design on this very much open-ended so your perspective and input is appreciated. I will discuss this a bit more internally as well and comment on this task again once the confd/confctl bits (the current patches) are finalized.

Thanks for the clarification. I didn't meant to imply that you didn't want a cookbook as end goal (although it was not mentioned).

Because of that, I wanted to break this up in small parts and those parts were: finalize the schema (which I never expected will take this much time but understandable), move users to something new that is not Git-based changes and then end with a cookbook of some sorts, if desired and if there was something that just confctl was not providing.
[...snip...]
In my mind at least, the focus for this task has been to move away from Git to confctl -- which most SREs are familiar with -- and thus the focus on that and trying to come up with a way that makes this shift more natural and allows for the quickest depool of a site, with conftcl currently performing that function and not cookbooks for other hosts.

There are some drawbacks that I see from not including the final solution in the design and planning phase, such as:

  • mis-directed focus: for example a lof of focus in this task has gone towards simplifying the confctl CLI because it has to be used in a simple way. Efforts that would not be really useful if a cookbook will be the final solution. In that case a dedicated object type could have been agreed upon very quickly.
  • multiple transitions: moving from git to confctl to a cookbook requires two transitions for a rather large team that will need to fight their muscle memory twice in a somewhat short amount of time, risking additional confusion and errors. In addition the second transition will not have the previous one fail (once you move from confctl to cookbook, confctl will still "work", just without any additional steps done by the cookbook)

Including the final solution into the design phase allows to design from the get go the basic cookbook UI, requiring just one transition, even if on day one the cookbook will just perform the confctl command. With just that it will already have for free some small additional benefit compared to confctl and will allow to add features gradually and transparently, without the need to coordinate that with the larger SRE team.

Anyway I don't want to derail anymore the original intent of the task :)

Thanks for the clarification. I didn't meant to imply that you didn't want a cookbook as end goal (although it was not mentioned).

That's totally fine and I don't think that was implied. Re: not mentioned, yes, that's correct because like I said, the question of if this should be a full cookbook or confctl was still open ended for me at least, for a variety of reasons. Even in some of the internal discussions within Traffic, we decided that confctl was fine for now, given the usual pairing of such changes: confd for the templating/updates and confctl to control the pooled/depooled state. The cookbook was a thought but it was not as strong a thought as simply using confctl, which also explains all the discussion above.

That being said, after your comment about the cookbook and thinking about it more especially about the positives and muscle memory from that vs using confctl/wrapper, I think this is worth a revisit. I will discuss this with Traffic and follow up here.

  • mis-directed focus: for example a lof of focus in this task has gone towards simplifying the confctl CLI because it has to be used in a simple way. Efforts that would not be really useful if a cookbook will be the final solution. In that case a dedicated object type could have been agreed upon very quickly.
  • multiple transitions: moving from git to confctl to a cookbook requires two transitions for a rather large team that will need to fight their muscle memory twice in a somewhat short amount of time, risking additional confusion and errors. In addition the second transition will not have the previous one fail (once you move from confctl to cookbook, confctl will still "work", just without any additional steps done by the cookbook)

Including the final solution into the design phase allows to design from the get go the basic cookbook UI, requiring just one transition, even if on day one the cookbook will just perform the confctl command. With just that it will already have for free some small additional benefit compared to confctl and will allow to add features gradually and transparently, without the need to coordinate that with the larger SRE team.

Anyway I don't want to derail anymore the original intent of the task :)

This task is meant to a general discussion of this work including input from others in SRE so it's absolutely fine.

Change #1060914 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/cookbooks@master] sre.dns.admin: add cookbook for GeoDNS pool/depool

https://gerrit.wikimedia.org/r/1060914

Change #1053323 merged by Ssingh:

[operations/puppet@production] P:conftool: add schema for geodns

https://gerrit.wikimedia.org/r/1053323

Change #1060914 merged by Ssingh:

[operations/cookbooks@master] sre.dns.admin: add cookbook for GeoDNS pool/depool

https://gerrit.wikimedia.org/r/1060914

Change #1062429 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] Remove admin_state from repository (managed via confd)

https://gerrit.wikimedia.org/r/1062429

Change #1062429 abandoned by Ssingh:

[operations/dns@master] Remove admin_state from repository (managed via confd)

Reason:

already present in the linked patch

https://gerrit.wikimedia.org/r/1062429

Mentioned in SAL (#wikimedia-operations) [2024-08-14T17:30:44Z] <sukhe@cumin1002> START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: testing cookbook, T369366]

Mentioned in SAL (#wikimedia-operations) [2024-08-14T17:30:52Z] <sukhe@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: testing cookbook, T369366]

Mentioned in SAL (#wikimedia-operations) [2024-08-15T13:51:56Z] <sukhe> sudo cumin "A:dnsbox" 'disable-puppet "merging CR 1053929 T369366"'

Change #1053929 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update: maintain admin_state via confd

https://gerrit.wikimedia.org/r/1053929

Change #1063009 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update: fix location of admin_state file

https://gerrit.wikimedia.org/r/1063009

Change #1063009 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update: fix location of admin_state file

https://gerrit.wikimedia.org/r/1063009

Change #1063012 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::update set confd_admin_state true for all DNS boxes

https://gerrit.wikimedia.org/r/1063012

Change #1063012 merged by Ssingh:

[operations/puppet@production] P:dns::auth::update set confd_admin_state true for all DNS boxes

https://gerrit.wikimedia.org/r/1063012

Mentioned in SAL (#wikimedia-operations) [2024-08-15T15:52:48Z] <sukhe> sudo cumin -b1 -s60 "A:dnsbox" "run-puppet-agent --enable 'merging CR 1053929 T369366'": T369366

Mentioned in SAL (#wikimedia-operations) [2024-08-15T17:07:24Z] <sukhe@cumin1002> START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: testing live change, T369366]

Mentioned in SAL (#wikimedia-operations) [2024-08-15T17:07:40Z] <sukhe@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site ulsfo [reason: testing live change, T369366]

Mentioned in SAL (#wikimedia-operations) [2024-08-15T17:22:17Z] <sukhe@cumin1002> START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: testing done, T369366]

Mentioned in SAL (#wikimedia-operations) [2024-08-15T17:22:33Z] <sukhe@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: testing done, T369366]