Page MenuHomePhabricator

Pybal not happy with DNS delays
Closed, ResolvedPublic

Description

During the T154758 incident, I noticed that lvs2006 was marking all of the backends down, not just the row A ones. While investigating, I noticed that the first nameserver in resolv.conf is acamar, which is in row A and is thus unreachable by LVSes (which sit on all of the subnets). I guessed this may be causing issues, so I commented that out, and things started to recover for Pybal shortly after that.

This didn't cause any actual problems this time but the situation would be different if this was A2 instead (and so lvs2003) and would result into multiple services failing at codfw.

So, our resolv.conf has timeout:1 attempts:3 but it seems that even that isn't enough for Pybal. I'd say that this is something that we should probably fix in Pybal (being more resilient to DNS delays) but perhaps we can do something to alleviate this in the infrastructure as well. (Before anyone says "anycast", note that we have anycasted dns-rec-lb but that goes via LVS, so LVS servers are the only ones not using it because of the catch-22).

Event Timeline

Pybal not failing over to the next DNS server in resolv.conf has been mentioned in T83662 as well.

From what I understand, this happened again today during asw-a-codfw's upgrade (T168462), i.e. similar circumstances -> similar symptoms. We should probably prioritize a fix or workaround for this, otherwise we'll just hit again at the next scheduled upgrade (or worse, during an outage).

Twisted uses ThreadedResolver (internet/base.py) as the default resolver (I got sidetracked for a while and thought that ResolverBase in names/common.py was the one, but no). The getHostByName implementation of ThreadedResolver is as follows as of Twisted 16.2.0:

def getHostByName(self, name, timeout = (1, 3, 11, 45)):
    """
    See L{twisted.internet.interfaces.IResolverSimple.getHostByName}.

    Note that the elements of C{timeout} are summed and the result is used
    as a timeout for the lookup.  Any intermediate timeout or retry logic
    is left up to the platform via L{socket.gethostbyname}.
    """
    if timeout:
        timeoutDelay = sum(timeout)
    else:
        timeoutDelay = 60
    userDeferred = defer.Deferred()
    lookupDeferred = threads.deferToThreadPool(
        self.reactor, self.reactor.getThreadPool(),
        socket.gethostbyname, name)
    cancelCall = self.reactor.callLater(
        timeoutDelay, self._cleanup, name, lookupDeferred)
    self._runningQueries[lookupDeferred] = (userDeferred, cancelCall)
    lookupDeferred.addBoth(self._checkTimeout, name, lookupDeferred)
    return userDeferred

So essentially socket.gethostbyname is used, which correctly reads the various resolv.conf(5) options and falls back after 1s to the second nameserver listed in resolv.conf if the first one is unreachable. I've confirmed the behavior on pybal-test2001.

Obviously, this is not what we have seen in production. Perhaps the issue only shows up with higher load / number of checks? Is the default thread pool size (10 I think) a problem here? Do we need to look into replacing this code with a non-threaded getHostByName implementation?

Meanwhile, following @faidon's idea of looking for infra-level approaches to alleviate the problem, here's a few options:

  • add options rotate to resolv.conf to reduce a bit the impact of one name server going down and to balance load among resolvers (wouldn't this be a good thing for all other hosts anyways?)
  • whenever a resolver is depooled for maintenance, remove it from $nameservers_override in manifests/site.pp
  • read $nameservers_override from etcd instead of hardcoding the list in site.pp

If we use rotate in resolv.conf, I guess we'd need to couple it with "options timeout:1", otherwise it wouldn't be very effeective with the default timeout of 5 seconds.

If we use rotate in resolv.conf, I guess we'd need to couple it with "options timeout:1", otherwise it wouldn't be very effeective with the default timeout of 5 seconds.

We already do: https://github.com/wikimedia/puppet/blob/production/modules/base/templates/resolv.conf.erb#L7

@mark's comment on a related issue (T103921#1417034) made me go dig a bit deeper into our IdleConnection monitor implementation. The version currently in prod tries to get a random IPv4 address before calling connectTCP. It should thus avoid a DNS lookup altogether. However, we also fallback to the hostname if we fail getting a random IP. I've patched idleconnection.py on pybal-test2001 to make it log any exceptions thrown there, and surprisingly we do consistently fail getting a random IP with 'set' object does not support indexing. Hence, it looks like we do resolve the hostname at every check.

An IPv6-related patch merged in master does change the behavior mentioned above, but it's not in prod yet. Note that the patch also fixes T82747.

Change 363188 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] Add IPv6 support to all monitors, improve IdleConnection logging

https://gerrit.wikimedia.org/r/363188

Change 363188 merged by Ema:
[operations/debs/pybal@master] Add IPv6 support to all monitors, improve IdleConnection logging

https://gerrit.wikimedia.org/r/363188

Change 363193 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.13] Add IPv6 support to all monitors, improve IdleConnection logging

https://gerrit.wikimedia.org/r/363193

Change 363193 merged by Ema:
[operations/debs/pybal@1.13] Add IPv6 support to all monitors, improve IdleConnection logging

https://gerrit.wikimedia.org/r/363193

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2017-07-04T14:57:00Z] <ema> pybal 1.13.7 uploaded to apt.w.o, testing it on pybal-test2001 T82747 T154759

Mentioned in SAL (#wikimedia-operations) [2017-07-12T16:06:20Z] <ema> lvs1006, lvs1010: upgrade pybal to 1.13.7 T82747 T154759

Mentioned in SAL (#wikimedia-operations) [2017-07-12T16:14:45Z] <ema> downgrade pybal to 1.13.6 on lvs1010 T82747 T154759 (1.13.7 throwing exceptions)

Mentioned in SAL (#wikimedia-operations) [2017-07-13T10:04:56Z] <ema> lvs[12]006: upgrade pybal to 1.13.9 T82747 T154759

Mentioned in SAL (#wikimedia-operations) [2017-07-17T08:17:15Z] <ema> lvs100[39]: upgrade pybal to 1.13.9 T82747 T154759

Mentioned in SAL (#wikimedia-operations) [2017-07-18T10:32:34Z] <ema> lvs100[45] upgrade pybal to 1.13.9 T82747 T154759

Mentioned in SAL (#wikimedia-operations) [2017-07-18T10:43:57Z] <ema> lvs100[12] upgrade pybal to 1.13.9 T82747 T154759

Mentioned in SAL (#wikimedia-operations) [2017-07-18T10:52:23Z] <ema> lvs400[34] upgrade pybal to 1.13.9 T82747 T154759

Mentioned in SAL (#wikimedia-operations) [2017-07-18T10:54:16Z] <ema> lvs400[12] upgrade pybal to 1.13.9 T82747 T154759

Mentioned in SAL (#wikimedia-operations) [2017-07-18T10:59:15Z] <ema> lvs200[45] upgrade pybal to 1.13.9 T82747 T154759

Mentioned in SAL (#wikimedia-operations) [2017-07-18T11:00:36Z] <ema> lvs200[12] upgrade pybal to 1.13.9 T82747 T154759

ema claimed this task.

All LVSs upgraded to pybal 1.13.9, which fixes this bug.