Page MenuHomePhabricator

Investigate better DNS cache/lookup solutions
Closed, ResolvedPublic

Description

Copied from T103921:

Given we're in a closed, controlled environment and only harming our own DNS caches, I think we'd be better served by a more aggressive and redundant strategy where we don't use LVS, specify the recdns servers in resolv.conf directly, and our resolver fires off parallel queries to them with aggressive timeouts and accepts the first legit response it gets. I think the reason glibc doesn't implement an option for this is to protect all the random caches in the world from excess load.

In related IRC discussion last night, @MoritzMuehlenhoff pointed out an existing NSS module for similar stuff here: https://github.com/grobian/dnspq . That code is a little dated, and only supports A-records (then falls back to glibc for anything else - we'd need at least AAAA added to it). But it's very short and simple. We could perhaps audit it and update it a bit to be sure it can do exactly what we want and need in a github fork (and pullreq the changes back in case the author wants it too) and experiment with packaging and using this on the fleet.

Related Objects

StatusSubtypeAssignedTask
ResolvedBBlack
ResolvedRobH
ResolvedBBlack
DeclinedNone
ResolvedBBlack
ResolvedRobH
ResolvedRobH
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
ResolvedRobH
ResolvedBBlack
ResolvedRobH
ResolvedRobH
DeclinedNone
DeclinedNone
DeclinedNone
ResolvedRobH
ResolvedNone
Resolvedfgiunchedi
ResolvedRobH
ResolvedBBlack
ResolvedRobH
OpenNone
Resolvedayounsi
Resolvedayounsi
Resolvedayounsi

Event Timeline

BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added a project: acl*sre-team.
BBlack added subscribers: BBlack, MoritzMuehlenhoff.

re: dnspq upstream is responsive and open to improvements, I've sent some myself for carbon-c-relay. re: the general topic, would a local full-fletched caching resolver help? (and 127.0.0.1 in resolv.conf)

So, to kind of recap implicit things: in the general case, we definitely don't want to put recursive resolvers on all the machines. Most of the machines don't have public routing for that anyways, and it would be pretty wasteful and inefficient regardless.

It would arguably be beneficial to have forwarding-only caching resolvers locally on each host: for repeat lookups with the TTL, it would give very slightly faster responses, and it would reduce (but not eliminate, because data still has to refresh on TTLs) our exposure to random UDP loss which can lead to very slow lookup times to the calling application while it does the retry logic with timeouts.

One of the downsides to doing that, however, is it would be much harder to purge local caches. For example, sometimes we switch a record in our authdns with a 1H TTL and then salt to the recdns boxes to purge it so we can move faster and easier on some local work. Now we'd have to include in that process the purging of any relevant host-local caches as well, and we'd have to try to get the ordering right (purge from the real caches first, so that the host-local resolvers don't immediately pick up the old record again in a race).

IMHO, the better option is still to solve this at the glibc level. Ideally glibc would have some built-in settings to operate in "datacenter mode" a little better, but barring that we're looking at an NSS module. If we had that working, we could probably eliminate recdns from LVS as well (which is also basically a hack to try to reduce the impact of glibc slow timeout/fallback, from a certain point of view).

I think in this ideal world, we wouldn't ever configure fallback recdns to a remote DC in resolv.conf like we do today, and we'd directly configure the local 2-3x recdns machines in the resolv.conf (or equivalent), and the resolvers' behavior would be aggressive and redundant: fire off queries to all listed servers immediately, then do timing backoff if no response at all, and always take whichever answer eventually arrives first. The backoff would be tuned down to appropriate values for DC-local too, so that the first few retries on random loss are very very fast, and we never wait insanely long times before simply failing.

Forwarding-only caching resolvers would help with issues such as T171048 and T151643.

Add T171318 to the list too. There's doubtless a long tail of issues we'll never fully realize that would be helped by work here. Part of the reason this ticket's still idling so long is that it doesn't offer any simple path forward, just problems and problematic solutions. So let's step through things here:

  1. Per-site DNS caches: ulsfo will eventually get DC-level forwarding caches with its hardware refresh (hardware already arrived, but re-deploy is a bit stalled out at the moment), and all future edge DCs should get them as well as part of a standard design - already covered in T164327 and related.
  2. Anycasting the redundant per-site DNS caches: will eventually happen and covered in T98006.
  3. Machine-local caches: There are a couple of arguments against this that have kept us from pursuing it, but at this point I think I'm favor of it. I'll open a separate subtask about the pros/cons/work for this.
  4. glibc resolver failover strategy issues - probably not worth pursuing this at all. We don't need to hack on glibc/nss issues if we fix all of the above (or really, even just any 2 of the 3 fixes above).

So to recap a small part of IRC discussion today in the wake of issues with rebooting hydrogen, I think our short-term improvement plan looks like this:

  1. Implement OPS (one-packet-scheduler) in pybal ( already merged to master at https://gerrit.wikimedia.org/r/#/c/367903 ), package->deploy that, and configure it for dns_rec_udp. This should eliminate at least one class of issues we have when depooling and/or taking down recdns servers today.
  2. We should push through a puppet patch to use $nameserver_override on the recdns boxes themselves much like we do for the LVS boxes, but excluding themselves. In other words, it should have two IPs: the opposite local recdns box, and the LVS recdns IP from the opposite DC. This way they don't depend on themselves during startup or daemon-restart and subtly break things on themselves.
  3. We should document clearly that, for now, we need to carefully edit (e.g. via puppet, or manually if it's faster in an outage) the nameservers_override values in resolv.conf on the LVSes + recdnses to avoid reference to a down (or to be downed/rebooted) recdns machine.

Hopefully this will mitigate a lot of the worst fallouts we're seeing, and simplify further debugging while we work on longer-term solutions to get off of LVS-based recdns in general.

Change 367924 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] 1.13.10: Add support for One-packet scheduling (OPS)

https://gerrit.wikimedia.org/r/367924

Change 367925 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.13] 1.13.10: Add support for One-packet scheduling (OPS)

https://gerrit.wikimedia.org/r/367925

Change 367927 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] recdns: do not use self in local resolv.conf

https://gerrit.wikimedia.org/r/367927

Change 367924 merged by Ema:
[operations/debs/pybal@master] 1.13.10: Add support for One-packet scheduling (OPS)

https://gerrit.wikimedia.org/r/367924

Change 367925 merged by Ema:
[operations/debs/pybal@1.13] 1.13.10: Add support for One-packet scheduling (OPS)

https://gerrit.wikimedia.org/r/367925

Mentioned in SAL (#wikimedia-operations) [2017-07-27T09:48:49Z] <ema> pybal 1.13.10 (one-packet-scheduling) built and uploaded to apt.w.o T104442

Change 368162 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] pybal: one-packet-scheduling for dns_rec_udp

https://gerrit.wikimedia.org/r/368162

Mentioned in SAL (#wikimedia-operations) [2017-07-27T13:31:56Z] <ema> lvs1009, lvs1010: upgrade to pybal 1.13.10 (one-packet-scheduling) T104442

Change 367927 merged by BBlack:
[operations/puppet@production] recdns: do not use self in local resolv.conf

https://gerrit.wikimedia.org/r/367927

Change 368162 merged by Ema:
[operations/puppet@production] pybal: one-packet-scheduling for dns_rec_udp

https://gerrit.wikimedia.org/r/368162

Mentioned in SAL (#wikimedia-operations) [2017-08-01T07:35:34Z] <ema> lvs4003, lvs4004 (ulsfo secondaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T07:40:44Z] <ema> lvs4001, lvs4002 (ulsfo primaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T08:03:23Z] <ema> lvs3*: upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T08:32:04Z] <ema> lvs2004-2006 (codfw secondaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T08:55:34Z] <ema> lvs2001-2003 (codfw primaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T14:28:08Z] <ema> lvs1004-1006 (eqiad secondaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T14:45:18Z] <ema> lvs1001-1003 (eqiad primaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-09-01T09:04:05Z] <ema> lvs1007 upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

BBlack claimed this task.

With anycast recdns deployed at all sites with fallback routing towards the cores (or to the opposite core, as the case may be), I think we're in pretty good shape here at this point. If there are other specific improvements we want to make, they should probably be re-evaluated in current context and considered in smaller-scoped tickets like T171498