Page MenuHomePhabricator

Implement machine-local forwarding DNS caches
Open, MediumPublic

Description

I think we should move forward on this.

The arguments we've heard against this are:

  1. We already have (or will have, in the case of ulsfo) DC-local recursive caches which are <0ms away from all hosts, so there's little speed advantage.
  2. We occasionally have to make an unpredicted change to local DNS data and then go execute rec_control to purge a record quickly from the DC-local pdns recursors, and that problem gets much trickier if we also have to purge it from host-level caches as well.

The arguments for it at this point are:

  1. We do hit issues in various places in our software stacks where code spams DNS requests and/or is very latency/failure -sensitive about them. It's much easier to mitigate these issues en-masse with host-local caches than it is to chase down every such case (cf T171048 T151643 T171318 etc).
  2. As low as our DC-local latency is, it's still slower than local host memory (and again, various bits of our software stacks likely have synchronous DNS stalls, spam DNS requests, or even both in combination).
  3. Without host-level caches, it's hard to work around the common issues with glibc resolver failover behavior (long delays in common operations like gethostbyname() on minor packet loss when more than one IP is specified).
  4. If any of our processes (even emergency ones) really require rec_control purging, we have two saner ways to deal with that:
    1. Design our DNS data and use of it better, so that TTLs work appropriately, and turn them down ahead of planned changes.
    2. Use pdns_recursor or another solution with per-record purging for the host-local implementation and use cumin when we need to purge widely.

There are many forwarding cache options out there in the world, but barring strong arguments for them the simplest path would be to either configure systemd-resolved's caching, or to deploy forwarding cache configurations of pdns_recursor (existing software) everywhere.

Any other debate points here? Does anyone strongly feel we really shouldn't go down this road?

Event Timeline

I think this is a good idea overall and that we should be doing that. A few points:

  • I'm worried a little bit that this will hide issues like the ones you mentioned under the carpet. The cases where services are latency/failure-sensitive especially are issues we should be fixing. I'm worried that with a local recursor we'll just make them manifest even less often and in even more corner-cases :/
  • For the other case of services flooding our recursors, we should probably be gathering statistics from the local recursor and monitor them in a similar fashion as we do in the "central" recursors, right?
  • The glibc resolver issues with multiple recursors/timeouts is something we can't get around from addressing I think :( The local recursor can fail (and will regularly fail when e.g. restarting it), so the system needs to operate even without it...
  • I think designing our DNS data in a way where we never need to flush caches is a bit too optimistic, but I think the proposed solution of just using cumin for this use case sounds like a perfect fit. I wonder if we could get away with just flushing the whole cache altogether rather than flushing specific records and thus potentially put systemd-resolved back on the table?
  • I'm worried a little bit that this will hide issues like the ones you mentioned under the carpet. The cases where services are latency/failure-sensitive especially are issues we should be fixing. I'm worried that with a local recursor we'll just make them manifest even less often and in even more corner-cases :/

I think it probably will, but I just don't think it's realistic that we can chase all of these down (or that we even care, so long as it doesn't rise to a notable issue). We have a ton of ancillary software in play, and it's common for software authors to take a naive view of the DNS, causing scenarios like these. It's just a lot of long-tail deep work for very little real-world gain.

  • For the other case of services flooding our recursors, we should probably be gathering statistics from the local recursor and monitor them in a similar fashion as we do in the "central" recursors, right?

Yeah we could, just like we gather per-host TCP metrics and so-on, there's no reason we shouldn't be gathering per-host recursor stats (if they're even available with some solutions).

  • The glibc resolver issues with multiple recursors/timeouts is something we can't get around from addressing I think :( The local recursor can fail (and will regularly fail when e.g. restarting it), so the system needs to operate even without it...

I'll address this separately in a follow-up post in a few, along with systemd-resolved to some degree....

  • I think designing our DNS data in a way where we never need to flush caches is a bit too optimistic

Perhaps "never" is a strong word - there will probably always be emergency cases. But we can at least avoid designs that explicitly rely on purging for routine operations (e.g. planned DC switches).

but I think the proposed solution of just using cumin for this use case sounds like a perfect fit. I wonder if we could get away with just flushing the whole cache altogether rather than flushing specific records and thus potentially put systemd-resolved back on the table?

I think for the per-host stub caches, flushing the whole cache should be fine.

I think this is actually fairly orthogonal to some of the other improvements. Not sure what current/modern thinking is on this either, probably needs re-evaluation. My gut feeling it to lean against bothering with this right now.

In these past couple of weeks we've had a real about-face on this issue, and I think there's a pretty strong consensus and rationale to pursue some kind of host-level caching, but there are details to sort out. Some of the data points to bring this argument up to speed:

  • Our eqiad recursors were recently seeing query traffic on the order of a combined ~150K/sec under "normal"-ish conditions. This is a high enough pps rate that it was causing issues for the bog-standard onboard Tigon3 card of a single recdns box while the other was depooled for reimaging. [This has been mitigated by stopping one of the single largest sources of that traffic for now in T239862 , and one of the more-proximate causes was that php isn't doing some internal dns caching the way hhvm was , but either way it's a more-general problem than this one case, in the long term]
  • Our DNS infra has in general been refactored and re-architected a bit lately. All recursors machines now have host-local authservers on the same box. This means the shared recursors incur very little cost (almost completely negligible) on cache misses of our own internal domainnames.
  • Given the above, it's now reasonable to dramatically lower authdns TTLs (carefully over time just in case) for internal-only domains. Lowering these TTLs doesn't change the rate at which non-caching end hosts query the shared recursors, only the rate at which the shared recursors query the authservers (which is now always over the loopback interface and super-cheap).
  • With sufficiently-low TTLs for these internal domains (I'm sure we can get sub-30s without issue, maybe as low as 5-10s?), the cache-wiping problems for caching stubs at a per-host level mostly evaporate as well. We'd still have that problem for public hostnames (e.g. in wikimedia.org), but that's a smaller problem and frankly our own rec-control wipes were never a great solution there either vs properly managing them (because they do have public cache visibility outside of our control). While a per-host stub caching for a very short full TTL value doesn't cause wiping problems, what it does do is prevent the common spam cases where an application daemon makes hundreds of lookups per second on the same hostname, reducing those queries to the shared recursors down to ~1/TTL/host.

All of that being said, the next question becomes about the host-level caching stub implementation. Injecting full-featured DNS cache software on every host is theoretically doable, but it gets a little messy. They're a little more full-featured and resource-hungry than we need, and hooking them up over some loopback IP via /etc/resolv.conf gets tricky to do in a way that avoids racy failures on machine start. systemd-resolved has a number of nice properties in this regard - it can get used via nss hookups for standard glibc gethostbyname() traffic and such without touching resolv.conf, and the nss hookup mechanism allows glibc to fall back quickly to querying the resolv.conf remote server when systemd-resolved isn't yet running or crashes or whatever. One of the downsides we've talked about with this approach is it has almost no configurability, including no way to apply a max cap to cache records (but with the above internal TTL reductions, I think that problem is sufficiently squished),. The documentation talks about some strange things that probably only makes sense for laptops and might hurt us (e.g. if queried for a single-label hostname, it may spam multicast DNS lookups out the interfaces). We're not sure yet if there's indirect ways to tell it to never try these multicast things, or how much of a real problem they are).

I think it's worth investigating the host stub cache scene with some priority for now, while concurrently also working on internal domain TTL reductions.

Change 678907 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] (WIP) systemd::resolved: start work on puppet module for systemd-resolved

https://gerrit.wikimedia.org/r/678907

Change 678907 merged by Jbond:

[operations/puppet@production] (WIP) systemd::resolved: start work on puppet module for systemd-resolved

https://gerrit.wikimedia.org/r/678907

Change 690515 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] O:base::resolving: drop the domain keyword and use the domain fact

https://gerrit.wikimedia.org/r/690515

Change 690529 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] O:base::resolving: make nameservers mandatory

https://gerrit.wikimedia.org/r/690529

Change 690522 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] O:base::resolver: unify resolv.con templates

https://gerrit.wikimedia.org/r/690522

Change 691080 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] resolvconf: create new class

https://gerrit.wikimedia.org/r/691080

jbond moved this task from Patch for Review to planned on the User-jbond board.

Change 690515 merged by Jbond:

[operations/puppet@production] O:base::resolving: drop the domain keyword and use the domain fact

https://gerrit.wikimedia.org/r/690515

Change 690529 merged by Jbond:

[operations/puppet@production] O:base::resolving: make nameservers mandatory

https://gerrit.wikimedia.org/r/690529

Change 690522 merged by Jbond:

[operations/puppet@production] O:base::resolver: unify resolv.conf templates

https://gerrit.wikimedia.org/r/690522

Change 691080 merged by Jbond:

[operations/puppet@production] resolvconf: create new class

https://gerrit.wikimedia.org/r/691080

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!