I think we should move forward on this.
The arguments we've heard against this are:
- We already have (or will have, in the case of ulsfo) DC-local recursive caches which are <0ms away from all hosts, so there's little speed advantage.
- We occasionally have to make an unpredicted change to local DNS data and then go execute rec_control to purge a record quickly from the DC-local pdns recursors, and that problem gets much trickier if we also have to purge it from host-level caches as well.
The arguments for it at this point are:
- We do hit issues in various places in our software stacks where code spams DNS requests and/or is very latency/failure -sensitive about them. It's much easier to mitigate these issues en-masse with host-local caches than it is to chase down every such case (cf T171048 T151643 T171318 etc).
- As low as our DC-local latency is, it's still slower than local host memory (and again, various bits of our software stacks likely have synchronous DNS stalls, spam DNS requests, or even both in combination).
- Without host-level caches, it's hard to work around the common issues with glibc resolver failover behavior (long delays in common operations like gethostbyname() on minor packet loss when more than one IP is specified).
- If any of our processes (even emergency ones) really require rec_control purging, we have two saner ways to deal with that:
- Design our DNS data and use of it better, so that TTLs work appropriately, and turn them down ahead of planned changes.
- Use pdns_recursor or another solution with per-record purging for the host-local implementation and use cumin when we need to purge widely.
There are many forwarding cache options out there in the world, but barring strong arguments for them the simplest path would be to either configure systemd-resolved's caching, or to deploy forwarding cache configurations of pdns_recursor (existing software) everywhere.
Any other debate points here? Does anyone strongly feel we really shouldn't go down this road?