Page MenuHomePhabricator

sre.discovery.datacenter should handle depooled dnsbox hosts
Open, LowPublic

Description

Background

This is a follow-up from discussion on https://gerrit.wikimedia.org/r/1072612 and is closely related to the T375014.

In short, for each service operated on, the sre.discovery.datacenter cookbook polls all authdns servers to validate the expected state is returned, before then flushing the discovery record from all recursors.

We should make these operations aware of the conftool pooled-state of the dnsbox hosts, otherwise a host that is unavailable (but depooled) for maintenance would cause either operation to fail. Specifically:

  • When polling, we should ignore-but-warn when resolution fails or stale state is returned by a depooled host.
  • When flushing, we should ignore-but-warn when the remote command to flush fails for a depooled host.

Where this connects back to T375014 is that there's a lot of overlap between what we need in order to achieve the above and what spicerack might consider supporting (and what we could then adopt).

In addition, there's also a lot of overlap between logic that exists in spicerack.dnsdisc.Discovery and sre.discovery.datacenter that we can potentially deduplicate in this process (e.g., resolve_with_client_ip and Discovery.resolve_with_client_ip) if indeed we want similar behavior.

Current status

As of January 2026, the current plan is wait for some form of functionality for listing only pooled dnsbox hosts in Spicerack (T375014), and then once that lands, integrate it into sre.discovery.datacenter to implement the functionality described above for the polling and flushing cases.

Alternatively, we could revive the work in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1074551, which implements that functionality directly.

Event Timeline

Change #1073524 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/cookbooks@master] sre.discovery.datacenter: restrict checks to active authdns hosts

https://gerrit.wikimedia.org/r/1073524

Change #1074551 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/cookbooks@master] sre.discovery.datacenter: handle depooled dnsboxen

https://gerrit.wikimedia.org/r/1074551

If this is not super urgent, do you think it could wait an "upstream" solution in spicerack as discussed in T375014?

@Volans - Yes, in fact that would be ideal. I went ahead and drafted https://gerrit.wikimedia.org/r/1074551 mainly to sketch out the specific behaviors we're looking for in sre.discovery.datacenter, but that patch should be able to easily adapt to the conftool-aware accessors in spicerack once they're ready. Thanks!

@Blake @Scott_French could you assess what part of this is still valid (please edit the description) and see what we can schedule this quarter?

From what I can see in the linked tickets, it looks like there are 2 relevant, outstanding patches. Scott's makes the pooled-state check in the cookbook, and Riccardo's makes the check as part of a new Spicerack accessor.

It sounds like the consensus is that the completion Riccardo's patch is the long-term path forward, so it's not clear to me that there's outstanding Serviceops work here at the moment (unless it's urgent that we finish Scott's patch to have an interim solution deployed).

Makes sense, then I'll close this one as duplicate of https://phabricator.wikimedia.org/T375014 and follow up on the other ticket. @Scott_French or @Blake feel free to reopen if you disagree

Reopening, since we'll likely need to make changes to sre.discovery.datacenter to adopt the functionality discussed in T375014 once it lands in Spicerack. I'm making that dependency explicit now, and will update the description shortly.

Scott_French triaged this task as Low priority.

Triaging as "Low" since, in practice, the main issue we've run into historically is DNS hosts that do not respond at all, which is (now) addressed by setting proper query timeouts. Also moving to backlog.

Scott_French renamed this task from sre.discovery.datacenter should handle depooled authdns hosts to sre.discovery.datacenter should handle depooled dnsbox hosts.Wed, Jan 21, 8:06 PM
Scott_French updated the task description. (Show Details)

Change #1073524 abandoned by Scott French:

[operations/cookbooks@master] sre.discovery.datacenter: restrict checks to active authdns hosts

Reason:

Only addresses the authdns polling case, not recdns flush case. See I67e2f950631c2ed53fab9fb022ca7d27a6467315 for an alternative.

https://gerrit.wikimedia.org/r/1073524