Page MenuHomePhabricator

Cookbook to depool a site in AuthDNS
Closed, ResolvedPublic

Description

To depool a site prior to maintenance or during an outage, it's currently required to send a Gerrit patch in the DNS repo.
This has both the inconvenients of being slow and error prone.

Instead, there should be a cookbook with safeguards (eg. check that not too many sites are depooled, if eqiad/codfw check that the local appservers are not depooled, etc) and abstracts the depool for SREs.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BCornwall subscribed.

@ayounsi Thanks for the report! I have a naive question: Would it be possible/more correct to interface confctl/etcd rather than a cookbook? That (by my observation) seems to be the typical tooling for that.

I like this direction (etcd). It's not super-trivial, but we've complained a lot even internally about the lack of etcd support for depooling whole sites at the public edge.

The Discovery stuff for services already has a setup close to what we'd need (confd driving files loaded by the gdnsd extfile plugin), but the structural stuff is a little different for this case, especially if we want per-address control (there are 4x public addresses presently that we do this kind of failover for: text-addrs, text-next, upload-addrs, and ncredir-addrs). Then we could relegate admin_state to emergency-use-only, basically (for some unforeseen circumstances).

One other minor point:

[check] if eqiad/codfw check that the local appservers are not depooled

I don't think we want or need to check this, as the front edge pooling is independent of applayer pooling (e.g. eqiad caches will hit codfw applayer if eqiad applayer is depooled in discovery). The "too many sites depooled" thing is potentially useful, although I don't know that it warrants a cookbook just to add that. The threshold would be have to be >=3 sites depooled anyways, as we do sometimes operationally take out two sites.

My suggestion to use a cookbook is because it's what SREs are familiar with, centralized in one place, can be nested for larger scope automation, provide abstraction, etc.

It could just be a wrapper around whichever is the internal mechanism (etcd).

I'm hesitant to the idea of creating an abstraction over an abstraction - I may be an outlier but my experience with depooling has been with confctl rather than cookbooks - for me, cookbooks depool as a part of a larger process (e.g. reimaging) rather than just being one action.

ssingh claimed this task.

Thanks to @ayounsi for reporting this originally! Duplicate is T369366 which is now resolved as well.