Background
This is a follow-up from discussion on https://gerrit.wikimedia.org/r/1072612 and is closely related to the T375014.
In short, for each service operated on, the sre.discovery.datacenter cookbook polls all authdns servers to validate the expected state is returned, before then flushing the discovery record from all recursors.
We should make these operations aware of the conftool pooled-state of the dnsbox hosts, otherwise a host that is unavailable (but depooled) for maintenance would cause either operation to fail. Specifically:
- When polling, we should ignore-but-warn when resolution fails or stale state is returned by a depooled host.
- When flushing, we should ignore-but-warn when the remote command to flush fails for a depooled host.
Where this connects back to T375014 is that there's a lot of overlap between what we need in order to achieve the above and what spicerack might consider supporting (and what we could then adopt).
In addition, there's also a lot of overlap between logic that exists in spicerack.dnsdisc.Discovery and sre.discovery.datacenter that we can potentially deduplicate in this process (e.g., resolve_with_client_ip and Discovery.resolve_with_client_ip) if indeed we want similar behavior.
Current status
As of January 2026, the current plan is wait for some form of functionality for listing only pooled dnsbox hosts in Spicerack (T375014), and then once that lands, integrate it into sre.discovery.datacenter to implement the functionality described above for the polling and flushing cases.
Alternatively, we could revive the work in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1074551, which implements that functionality directly.