Investigate better DNS cache/lookup solutions
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BBlack
	Jul 1 2015, 12:24 PM

Description

Copied from T103921:

Given we're in a closed, controlled environment and only harming our own DNS caches, I think we'd be better served by a more aggressive and redundant strategy where we don't use LVS, specify the recdns servers in resolv.conf directly, and our resolver fires off parallel queries to them with aggressive timeouts and accepts the first legit response it gets. I think the reason glibc doesn't implement an option for this is to protect all the random caches in the world from excess load.

In related IRC discussion last night, @MoritzMuehlenhoff pointed out an existing NSS module for similar stuff here: https://github.com/grobian/dnspq . That code is a little dated, and only supports A-records (then falls back to glibc for anything else - we'd need at least AAAA added to it). But it's very short and simple. We could perhaps audit it and update it a bit to be sure it can do exactly what we want and need in a github fork (and pullreq the changes back in case the author wants it too) and experiment with packaging and using this on the fleet.

Details

Subject	Repo	Branch	Lines +/-
pybal: one-packet-scheduling for dns_rec_udp	operations/puppet	production	+2 -0
recdns: do not use self in local resolv.conf	operations/puppet	production	+23 -1
1.13.10: Add support for One-packet scheduling (OPS)	operations/debs/pybal	1.13	+6 -0
1.13.10: Add support for One-packet scheduling (OPS)	operations/debs/pybal	master	+6 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	BBlack	T104442 Investigate better DNS cache/lookup solutions
Resolved	RobH	T164327 replace ulsfo aging servers
Resolved	BBlack	T164610 Unprovision cache_misc @ ulsfo
Declined	None	T151275 cp4008 and cp4012 running on single PSU
Resolved	BBlack	T151273 lvs4002 power supply failure
Resolved	RobH	T169020 Decommission cp400[1-4]
Resolved	RobH	T167377 Decommission cp4011, cp4012, cp4019, cp4020
Resolved	BBlack	T171966 setup/install cp402[34]
Resolved	BBlack	T171967 setup/install cp4022
Resolved	BBlack	T172198 setup/install cp402[5-8].ulsfo.wmnet
Resolved	RobH	T178423 rack/setup/install cp40(29\|3[012]).ulsfo.wmnet
Resolved	BBlack	T178436 rack/setup/install lvs400[567].ulsfo.wmnet
Resolved	RobH	T178535 decommission lvs400[1-4].ulsfo.wmnet
Resolved	RobH	T177622 Multiple systems in ulsfo 1.22 showing PSU failures
Declined	None	T177623 check lvs4002 power supply redundancy
Declined	None	T177624 check cp4007 power supply redundancy
Declined	None	T177625 check cp4008 power supply redundancy
Resolved	RobH	T178085 Check cp4026 power supply redundancy
Resolved	None	T178815 decom cp40(09\|1[078])
Resolved	fgiunchedi	T179050 setup bast4002/WMF7218
Resolved	RobH	T178592 decommission/replace bast4001.wikimedia.org
Resolved	BBlack	T179204 setup/deploy dns400[12]/wmf721[56]
Resolved	RobH	T180077 apply hostname labels to dns400[12]
Open	None	T171498 Implement machine-local forwarding DNS caches
Resolved	ayounsi	T186550 Anycast recdns
Resolved	ayounsi	T228190 Roll out Anycast RecDNS to more servers
Resolved	ayounsi	T230955 Configure Layer3 hashing for router ECMP (for anycast DNS)
		Restricted Task
		Restricted Task

Event Timeline

BBlack created this task.Jul 1 2015, 12:24 PM

BBlack raised the priority of this task from to Medium.

BBlack updated the task description. (Show Details)

BBlack added a project: acl*sre-team.

BBlack added subscribers: BBlack, MoritzMuehlenhoff.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 1 2015, 12:24 PM

re: dnspq upstream is responsive and open to improvements, I've sent some myself for carbon-c-relay. re: the general topic, would a local full-fletched caching resolver help? (and 127.0.0.1 in resolv.conf)

So, to kind of recap implicit things: in the general case, we definitely don't want to put recursive resolvers on all the machines. Most of the machines don't have public routing for that anyways, and it would be pretty wasteful and inefficient regardless.

It would arguably be beneficial to have forwarding-only caching resolvers locally on each host: for repeat lookups with the TTL, it would give very slightly faster responses, and it would reduce (but not eliminate, because data still has to refresh on TTLs) our exposure to random UDP loss which can lead to very slow lookup times to the calling application while it does the retry logic with timeouts.

One of the downsides to doing that, however, is it would be much harder to purge local caches. For example, sometimes we switch a record in our authdns with a 1H TTL and then salt to the recdns boxes to purge it so we can move faster and easier on some local work. Now we'd have to include in that process the purging of any relevant host-local caches as well, and we'd have to try to get the ordering right (purge from the real caches first, so that the host-local resolvers don't immediately pick up the old record again in a race).

IMHO, the better option is still to solve this at the glibc level. Ideally glibc would have some built-in settings to operate in "datacenter mode" a little better, but barring that we're looking at an NSS module. If we had that working, we could probably eliminate recdns from LVS as well (which is also basically a hack to try to reduce the impact of glibc slow timeout/fallback, from a certain point of view).

I think in this ideal world, we wouldn't ever configure fallback recdns to a remote DC in resolv.conf like we do today, and we'd directly configure the local 2-3x recdns machines in the resolv.conf (or equivalent), and the resolvers' behavior would be aggressive and redundant: fire off queries to all listed servers immediately, then do timing backoff if no response at all, and always take whichever answer eventually arrives first. The backoff would be tuned down to appropriate values for DC-local too, so that the first few retries on random loss are very very fast, and we never wait insanely long times before simply failing.

BBlack added a project: Traffic.Apr 29 2016, 3:21 PM

BBlack moved this task from Backlog to Some old column on the Traffic board.Sep 30 2016, 2:09 PM

• ema subscribed.Jul 4 2017, 6:08 AM

Forwarding-only caching resolvers would help with issues such as T171048 and T151643.

elukey subscribed.Jul 20 2017, 2:15 PM

Add T171318 to the list too. There's doubtless a long tail of issues we'll never fully realize that would be helped by work here. Part of the reason this ticket's still idling so long is that it doesn't offer any simple path forward, just problems and problematic solutions. So let's step through things here:

Per-site DNS caches: ulsfo will eventually get DC-level forwarding caches with its hardware refresh (hardware already arrived, but re-deploy is a bit stalled out at the moment), and all future edge DCs should get them as well as part of a standard design - already covered in T164327 and related.
Anycasting the redundant per-site DNS caches: will eventually happen and covered in T98006.
Machine-local caches: There are a couple of arguments against this that have kept us from pursuing it, but at this point I think I'm favor of it. I'll open a separate subtask about the pros/cons/work for this.
glibc resolver failover strategy issues - probably not worth pursuing this at all. We don't need to hack on glibc/nss issues if we fix all of the above (or really, even just any 2 of the 3 fixes above).

BBlack added subtasks: T98006: Anycast AuthDNS, T164327: replace ulsfo aging servers.Jul 24 2017, 5:34 PM

BBlack created subtask T171498: Implement machine-local forwarding DNS caches.Jul 24 2017, 5:50 PM

So to recap a small part of IRC discussion today in the wake of issues with rebooting hydrogen, I think our short-term improvement plan looks like this:

Implement OPS (one-packet-scheduler) in pybal ( already merged to master at https://gerrit.wikimedia.org/r/#/c/367903 ), package->deploy that, and configure it for dns_rec_udp. This should eliminate at least one class of issues we have when depooling and/or taking down recdns servers today.
We should push through a puppet patch to use $nameserver_override on the recdns boxes themselves much like we do for the LVS boxes, but excluding themselves. In other words, it should have two IPs: the opposite local recdns box, and the LVS recdns IP from the opposite DC. This way they don't depend on themselves during startup or daemon-restart and subtly break things on themselves.
We should document clearly that, for now, we need to carefully edit (e.g. via puppet, or manually if it's faster in an outage) the nameservers_override values in resolv.conf on the LVSes + recdnses to avoid reference to a down (or to be downed/rebooted) recdns machine.

Hopefully this will mitigate a lot of the worst fallouts we're seeing, and simplify further debugging while we work on longer-term solutions to get off of LVS-based recdns in general.

Change 367924 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] 1.13.10: Add support for One-packet scheduling (OPS)

https://gerrit.wikimedia.org/r/367924

Change 367925 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.13] 1.13.10: Add support for One-packet scheduling (OPS)

https://gerrit.wikimedia.org/r/367925

Change 367927 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] recdns: do not use self in local resolv.conf

https://gerrit.wikimedia.org/r/367927

Change 367924 merged by Ema:
[operations/debs/pybal@master] 1.13.10: Add support for One-packet scheduling (OPS)

https://gerrit.wikimedia.org/r/367924

Change 367925 merged by Ema:
[operations/debs/pybal@1.13] 1.13.10: Add support for One-packet scheduling (OPS)

https://gerrit.wikimedia.org/r/367925

Mentioned in SAL (#wikimedia-operations) [2017-07-27T09:48:49Z] <ema> pybal 1.13.10 (one-packet-scheduling) built and uploaded to apt.w.o T104442

Change 368162 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] pybal: one-packet-scheduling for dns_rec_udp

https://gerrit.wikimedia.org/r/368162

Mentioned in SAL (#wikimedia-operations) [2017-07-27T13:31:56Z] <ema> lvs1009, lvs1010: upgrade to pybal 1.13.10 (one-packet-scheduling) T104442

Change 367927 merged by BBlack:
[operations/puppet@production] recdns: do not use self in local resolv.conf

https://gerrit.wikimedia.org/r/367927

Change 368162 merged by Ema:
[operations/puppet@production] pybal: one-packet-scheduling for dns_rec_udp

https://gerrit.wikimedia.org/r/368162

Mentioned in SAL (#wikimedia-operations) [2017-08-01T07:35:34Z] <ema> lvs4003, lvs4004 (ulsfo secondaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T07:40:44Z] <ema> lvs4001, lvs4002 (ulsfo primaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T08:03:23Z] <ema> lvs3*: upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T08:32:04Z] <ema> lvs2004-2006 (codfw secondaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T08:55:34Z] <ema> lvs2001-2003 (codfw primaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T14:28:08Z] <ema> lvs1004-1006 (eqiad secondaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-08-01T14:45:18Z] <ema> lvs1001-1003 (eqiad primaries): upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

Mentioned in SAL (#wikimedia-operations) [2017-09-01T09:04:05Z] <ema> lvs1007 upgrade to pybal 1.13.11 - one-packet-scheduling, instrumentation fixes. T104442, T103882

RobH closed subtask T164327: replace ulsfo aging servers as Resolved.Oct 3 2018, 5:51 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:10 PM

jbond subscribed.Apr 29 2019, 4:27 PM

With anycast recdns deployed at all sites with fallback routing towards the cores (or to the opposite core, as the case may be), I think we're in pretty good shape here at this point. If there are other specific improvements we want to make, they should probably be re-evaluated in current context and considered in smaller-scoped tickets like T171498

BBlack removed a subtask: T98006: Anycast AuthDNS.Nov 5 2019, 6:08 PM

Investigate better DNS cache/lookup solutionsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate better DNS cache/lookup solutions
Closed, ResolvedPublic
Actions

Related Objects
Search...