I've been trying to track down the source of intermittent PyBal failures with the IdleConnectionMonitor healthchecks. The backend servers do not seem to be at fault here. These happen often for ~1-3 random idle monitors every few minutes on the eqiad high-traffic1 LVS (haven't looked much elsewhere, but on a quick look esams doesn't seem to have the same problem...). Usually they quickly come back a few seconds later and repool. Once in a blue moon, whatever is causing this gets into some catastrophic coincedence scenario and tries to depool multiple servers from one pool all at once, which will then temporarily hit the depool_threshold, etc. All in all, it's not good.
This very-filtered log snippet captures a single minute in time, for just textsvc_80/cp1065, when one of these events happened for that (plus another around the same timeframe, unrelated):
2015-06-25 20:03:05.751902 [textsvc_80 IdleConnection] cp1065.eqiad.wmnet (enabled/up/pooled): Connection established. 2015-06-25 20:03:11.954722 [textsvc_80 IdleConnection] cp1065.eqiad.wmnet (enabled/up/pooled): Connection established. 2015-06-25 20:03:18.559001 [textsvc_80 IdleConnection] cp1065.eqiad.wmnet (enabled/up/pooled): Connection established. 2015-06-25 20:03:24.763018 [textsvc_80 IdleConnection] cp1065.eqiad.wmnet (enabled/up/pooled): Connection established. 2015-06-25 20:03:34.167234 [textsvc_80] Monitoring instance IdleConnection reports server cp1065.eqiad.wmnet (enabled/up/pooled) down: User timeout caused connection failure. 2015-06-25 20:03:34.167455 ['-d -t 10.2.2.25:80 -r 10.64.0.102'] 2015-06-25 20:03:34.174287 [textsvc_80 IdleConnection] cp1065.eqiad.wmnet (enabled/down/not pooled): Connection failed. 2015-06-25 20:03:36.533444 [textsvc_80] Server cp1065.eqiad.wmnet (enabled/partially up/not pooled) is up 2015-06-25 20:03:36.533681 ['-a -t 10.2.2.25:80 -r 10.64.0.102 -w 1'] 2015-06-25 20:03:36.541297 [textsvc_80 IdleConnection] cp1065.eqiad.wmnet (enabled/up/pooled): Connection established. 2015-06-25 20:03:42.574250 [textsvc_80 IdleConnection] cp1065.eqiad.wmnet (enabled/up/pooled): Connection established. 2015-06-25 20:03:48.980416 [textsvc_80 IdleConnection] cp1065.eqiad.wmnet (enabled/up/pooled): Connection established. 2015-06-25 20:03:55.182072 [textsvc_80 IdleConnection] cp1065.eqiad.wmnet (enabled/up/pooled): Connection established.
I've had many wild theories over the past day while staring at related aspects of this. The only one that has held up well is that this is somehow related to pybal making runtime DNS queries. I can't say which DNS queries for what - it may be that delays in other unrelated parts of pybal or unrelated healthchecks are just slowing down the eventloop enough to cause random harm to unrelated IdleConnection stuff.
What I have been able to very reliably reproduce is the following test on lvs100[14] (high-traffic1 in eqiad):
- Under it's normal resolv.conf settings (which is eqiad then codfw recdns LVS addresses), the intermittent failures are there
- Editing resolv.conf to just the eqiad recdns LVS, still failures are there
- Editing resolv.conf to just the codfw recdns LVS, no failure (!)
- Editing resolv.conf to the eqiad recursors directly (chromium, hydrogen), no failure
- Editing resolv.conf to the eqiad recursors directly (hydrogen, chromium), seen 1-2 fails here, but way better?
I've re-run those scenarios countless times and it's very consistent. Also, it doesn't require a pybal restart to take effect, implying this is libc-level resolver code that's making the difference I think. Hydrogen seems to be at fault here somehow, and somehow more-impactful when via-LVS? Honestly, there's probably too many other variables over the test period to know. What is consistent is that ditching eqiad-LVS-recdns from the list helps a lot, usually completely.
One layer of bug here is that pybal is apparently very sensitive to whatever this DNS hiccup is. We could maybe address that with future code revs. The other layer is finding and solving the hiccup with pdns_recursor service on hydrogen (I didn't try to restart it yet, because I want to be able to dig on this further later).
We could also explore making DNS resolution more reliable than the way glibc does it, for all of the fleet. glibc is conversative: it tries the first server on the list, times out for 3 seconds, then tries the next, and so-on. Given random machine/service hiccups, unreliable UDP, random packet loss, etc, that's not a great strategy, with or without LVS helping us.
Given we're in a closed, controlled environment and only harming our own DNS caches, I think we'd be better served by a more aggressive and redundant strategy where we don't use LVS, specify the recdns servers in resolv.conf directly, and our resolver fires off parallel queries to them with aggressive timeouts and accepts the first legit response it gets. I think the reason glibc doesn't implement an option for this is to protect all the random caches in the world from excess load.
In related IRC discussion last night, @MoritzMuehlenhoff pointed out an existing NSS module for similar stuff here: https://github.com/grobian/dnspq . That code is a little dated, and only supports A-records (then falls back to glibc for anything else - we'd need at least AAAA added to it). But it's very short and simple. We could perhaps audit it and update it a bit to be sure it can do exactly what we want and need in a github fork (and pullreq the changes back in case the author wants it too) and experiment with packaging and using this on the fleet.
In the meantime, I'd like to propose a couple of fixes to clean up the LVS fail-rate in eqiad regardless of the above:
- Reduce the resolv.conf timeout from 3s to 1s, so that it takes less time to fail down the list.
- Switch all the LVSes to using local recursors directly rather than via-LVS, as that seems more stable and direct. The downside there is that if the first recdns machine on the list fails, the fallout might be worse than the current status quo with an LVS depool. The LVSes that actually serve recdns already set up direct, just not the others.