Page MenuHomePhabricator

Support listing pooled / active authdns hosts (rather than all)
Open, MediumPublic

Description

Spicerack already provides an authdns_servers accessor that returns a map from hostname to IP (along with its RemoteHosts equivalent, authdns_active_hosts).

This is used in a couple of places, including initializing the set of Resolver instances internal to dnsdisc.Discovery, and in its current form, returns a static set of hosts read from a puppet-managed config file.

In discussion on https://gerrit.wikimedia.org/r/1072612, @ssingh noted that it would be good to provide a way to exclude depooled authdns hosts in certain use cases, particularly ones asserting that resolution DNS returns an expected value - e.g., this was in the context of the sre.discovery.datacenter cookbook, but the same applies to Discovery.check_ttl or similar.

As a starting point, it would be good if Spicerack offered an accessor that returned only those authdns hosts that are (currently) pooled.

@ssingh pointed out some of the changes made in https://gerrit.wikimedia.org/r/1009539 as an example of how this might be done.

Beyond that, there's the question of where / how that should be wired into Discovery (for checks like those described above). In the ideal case, actually querying conftool would happen at time of use, rather than just prior to construction (e.g., for the benefit of long-running cookbooks that might work with a long-lived Discovery instance).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Volans triaged this task as Medium priority.Sep 18 2024, 9:37 AM

Thanks for the task. I think the main decision to make is how fresh the data needs to be. If we opt for refreshed every time it's called then we need to think well how to inject it into spicerack as it would not be possible to just have it as a helper in the Spicerack class.

I've had a look of the current usage in spicerack, authdns_active_hosts is not used and authdns_servers is used by both the discovery and service_catalog accessors and is also cached.

In the cookbooks there is only one use case so far:

  • cookbooks/sre/hosts/decommission.py and cookbooks/sre/hosts/move-vlan.py

It uses self.authdns_hosts = spicerack.authdns_active_hosts but in reality it just needs one authdns host that is pooled and up to date. The current implementation is unnecessarily checking all hosts.

If we want it to be more dynamic, like we probably should IMHO, then we could probably have a Spicerack accessor that queries them in real time and pass that method as function where needed so that it will be called directly.
I'll try to send a patch later today.

Thanks for taking a look, Riccardo. I should mention, this isn't blocking anything on our end, as I can always do something like https://gerrit.wikimedia.org/r/1073524 (and likely will) in the meantime.

As I mentioned, adding an accessor that's conftool-state aware would be a great starting point, and optionally using it in the discovery accessor.

As you point out, I think the main spot where this gets tricky is the case where the authdns hosts are captured for later use, e.g., the way the service_catalog caches them and the returned Catalog can later be used to construct Discovery instances using the stale list.

Thanks @Scott_French for documenting this by filing this task!

I had a brief chat about this with @Volans yesterday on IRC. I will leave the execution details and Spicerack related stuff to his expertise but summarizing the discussion and adding a few new points:

  • If you are looking for a list of active DNS hosts to query against (for any service), then that list should be fetched from confctl/etcd.
    • The above statement becomes a little more specific depending on the type of query you are performing. If you want something from the internal recursor, you should check for the pooled hosts in cluster dnsbox under the recdns service. If you are looking to make zone file changes or anything related to those, you should be looking for the hosts under the authdns-update service. (Such as the netbox cookbook does.)
    • I don't think you will ever have a need for this but note that because of how we announce/define the anycast NTP setup, it warrants that you check against ntp.* and not ntp. (Hosts in the core announce ntp-[abc] while hosts in edges announce ntp-[ab] alternately.)
    • There is some overlap between these services. Some of it was intentional but some of it was unavoidable. For example, if a host is depooled for authdns-update, should it really be responding to queries under recdns or authdns? Similarly, what if it is pooled for recdns but depooled for authdns?
      • To work around these cases, Traffic does not depool services individually unless really required and only if they have no dependencies. This means that if a host is down for a reboot or other long-term maintenance, it is removed for all services. But since you can't "query a host" but can only query individual services, it still makes sense for you to check the type of service that you are running the query against. Because if it is a specific service, then Traffic will depool it. If not, then all services will be depooled and you will again hit the one you care about and get its state.
  • confctl/etcd does not have the IP information for the DNS hosts themselves and that is an intentional decision. Those are statically defined in Puppet, under hieradata/common.yaml and the authdns_servers key.
    • A host will not exist in confd/etcd if it does not exist in that list. When we decommission a host, it is removed from both etcd and Puppet at the same time. This is also true for long-term maintenance: the host will be removed from both places.

Thanks for the summary @ssingh. I have a local proposal that will send out when ready. There is one main point to decide and is the "caching" time of this information:

  • [no cache] always query confctl and read the yaml file for every DNS query done by the discovery module
  • [time cache] re-read the data from confctl+yaml if the cached one is older than X minutes
  • [explicit cache] expose a "refresh recursors" method to explicitly refresh them

Do you have any preference?

Thanks for the summary @ssingh. I have a local proposal that will send out when ready. There is one main point to decide and is the "caching" time of this information:

  • [no cache] always query confctl and read the yaml file for every DNS query done by the discovery module
  • [time cache] re-read the data from confctl+yaml if the cached one is older than X minutes
  • [explicit cache] expose a "refresh recursors" method to explicitly refresh them

Do you have any preference?

I think [no cache] is fine given that the lookups will most likely be infrequent. Is that assumption correct? That also seems to be the easiest way forward for you I think.

Because I think while it's easy to come up with a number for a long-term depools (shutdowns, hardware maintenance), short-term depools are a bit hard to predict and might result in various corner cases. Such corner cases are also present for the no-cache option to be clear but I think the probability is lowered.

My two cents anyway since I think you know more about all the possible usecases of this.

Change #1077410 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] authdns: add new module to get the list of authdns

https://gerrit.wikimedia.org/r/1077410

@ssingh what do you think of the above draft patch proposal? If that works for you I'll complete it and get it included into spicerack.

@ssingh what do you think of the above draft patch proposal? If that works for you I'll complete it and get it included into spicerack.

Thanks for working on the patch -- changes good look and as we discussed! The cluster, service type are correct.

Hi @Volans can I confirm the status of this task?

As noted in description this would be beneficial for the cookbooks we use in our upcoming switchover.

Thanks

@MLechvien-WMF I haven't work on this since the original unplanned effort that generated the above patch. As I have been also moved to a different team since then I don't think I will be able to work on this. The existing patch can surely be taken and carried over by interested parties in the affected teams (serviceops, traffic, I/F) if that approach still makes sense.