Page MenuHomePhabricator

Internal DNS look-ups fail every once in a while
Closed, ResolvedPublic

Description

DNS look-ups from Labs instances for the IP addresses of Labs instances fail every once in a while (all times UTC):

tools-webproxy.eqiad.wmflabs : Aug 24 20:09:33 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs
tools-webproxy.eqiad.wmflabs : Aug 25 18:23:00 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs
tools-webproxy.eqiad.wmflabs : Aug 25 23:28:06 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs
Date: Sat, 23 Aug 2014 20:27:11 +0000 (3 days, 5 hours ago)
error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name
Date: Tue, 26 Aug 2014 06:39:11 +0000 (19 hours, 30 minutes ago)
error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name

Version: unspecified
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:30 AM
bzimport added a project: Cloud-VPS.
bzimport set Reference to bz70076.

This happens several times each day during the beta-scap-eqiad Jenkins job [0] as well:

ssh: Could not resolve hostname deployment-mediawiki02.eqiad.wmflabs: Name or service not known

Ths failing host name varies from run to run and generally works if the job is re-run immediately upon notification of failure.

[0]: https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/

Today the frequency of those occurences has increased quite a bit at Tools:

tools-webproxy.eqiad.wmflabs : Aug 27 00:03:35 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs
tools-webproxy.eqiad.wmflabs : Aug 27 01:25:37 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs
tools-webproxy.eqiad.wmflabs : Aug 27 02:25:43 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs
tools-webproxy.eqiad.wmflabs : Aug 27 03:40:40 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs
tools-webproxy.eqiad.wmflabs : Aug 27 06:24:43 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs
tools-webproxy.eqiad.wmflabs : Aug 27 11:39:49 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs
tools-webproxy.eqiad.wmflabs : Aug 27 14:25:02 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs
Date: Wed, 27 Aug 2014 07:05:11 +0000 (8 hours, 15 minutes ago)
error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name

I don't remember consciously that hosts besides tools-master and tools-webproxy were affected in the past in a major way, and with tools-webproxy probably doing a lot of lookups for the log files, my assumption is that the DNS server/OpenStack's/Ubuntu's network layer has some throttling in place per client host.

It may be worth a test to install a caching dnsmasq locally to see if that solves the problem. In that case, non-cached queries need to be forwarded to the Labs DNS server so that the special rewrites in openstack::network-service are honoured.

dnsmasq is a [bleep]ing piece of unreliable [bleep] that crumbles under the lightest load. I've been meaning to have a real DNS server in labs for a while now, and this increase in failures just bumped that up in priority.

I ran "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all Tools instances to get an idea of the order of magnitude between tools-master, tools-webproxy and the rest. To display:

scfc@tools-login:~$ sudo iptables -nvxL
Chain INPUT (policy ACCEPT 20004 packets, 10193318 bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 19480 packets, 4383218 bytes)
pkts bytes target prot opt in out source destination
139 9403 udp -- * * 0.0.0.0/0 10.68.16.1 udp dpt:53
scfc@tools-login:~$

Replacing dnsmasq is... more complicated than reasonable because of the way it's being invoked and managed by Openstack.

As a first pass, I'm going to enable local caching of name resolution; I expect this will lighten the load on it by an order of magnitude or more and will make resolution more robust even if it does falter.

Change 157816 had a related patch set uploaded by coren:
Labs: provide saner nscd defaults

https://gerrit.wikimedia.org/r/157816

Change 157816 merged by coren:
Labs: provide saner nscd defaults

https://gerrit.wikimedia.org/r/157816

I reset the counters at 16:15Z because after the merge data from before and after is hard to compare :-).

I suggest that we revisit this after Thursday (2014-09-04); if the error mails have stopped then (or significantly decreased), I would consider this issue fixed.

I haven't seen any errors since Tuesday morning, so the change of the nscd configuration seems to have fixed the issue.

(In reply to jeremyb from comment #10)

Just recurred

To be precise:

[...]
Date: Sat, 06 Sep 2014 19:02:02 +0000 (50 minutes, 59 seconds ago)
error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name
[...]

If that would remain the only occurence, I still would consider this fixed.

We've seen frequent errors with the beta-scap-eqiad again in the last week or two that look to be related to DNS failures for internal hosts:

22:05:23 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', 'deployment-rsync01.eqiad.wmflabs']
  on deployment-videoscaler01.eqiad.wmflabs returned [255]: 
  ssh: Could not resolve hostname deployment-videoscaler01.eqiad.wmflabs: Temporary failure in name resolution

I looked at Coren original patch for labs nscd.conf https://gerrit.wikimedia.org/r/#/c/157816/1/modules/ldap/files/nscd-labs.conf

On deployment-bastion we have:

enable-cache          hosts    yes
positive-time-to-live hosts    3600
negative-time-to-live hosts    30

suggested-size        hosts    983

check-files           hosts    yes
persistent            hosts    yes
shared                hosts    yes
max-db-size           hosts    33554432

So as I understand it, successful requests are cached for 3600 seconds (1 hour) and the hash cache can holds 983 entries up to ~ 33MBytes

The statistics for the hosts cache can be obtained via sudo nscd --statistics and yields:

hosts cache:

            yes  cache is enabled
            yes  cache is persistent
            yes  cache is shared
            983  suggested size
        1006592  total data pool size
           2384  used data pool size
           3600  seconds time to live for positive entries
             30  seconds time to live for negative entries
         456908  cache hits on positive entries
              1  cache hits on negative entries
         125451  cache misses on positive entries
          18320  cache misses on negative entries
             76% cache hit rate
             15  current number of cached values
             29  maximum number of cached values
              5  maximum chain length searched
              8  number of delays on rdlock
             45  number of delays on wrlock
              0  memory allocations failed
            yes  check /etc/hosts for changes

I am not sure how to interprets them though. There is a lot of negative entries for sure.

Seems to be failing some more - lots of transient puppet failures because of that.

Those stats surprise me a little, as I would not have expected DNS queries to be that varied - if I had to venture a guess I would suspect reverse queries for hostnames of connecting clients is the source of most of those, and that may have cache pollution effects.

This means that, in practice, increasing the size of the cache is not going to improve things.

The cache hit rate of ~75% is pretty good, and if representative it means we reduced the load on the DNS subsystem by that much - which explains why things got better for a long time but are now degrading again (labs use has nearly doubled since that patch).

They're failing a *lot* more now - so much puppet failure spam, and intermittent issues with tools that can't resolve domains (IRC reports). Bumping priority.

Change 183544 had a related patch set uploaded (by Yuvipanda):
labs: Make dnsmasq return non-zero TTL for lookups

https://gerrit.wikimedia.org/r/183544

Patch-For-Review

Change 183544 merged by Yuvipanda:
labs: Make dnsmasq return non-zero TTL for lookups

https://gerrit.wikimedia.org/r/183544

dnsmasq was overloaded because it was setting TTL to 0 for all responses, which sounds rather terrible and stupid. Have set it to 300s now, and let's see how this goes.

Indeed, TTL of 0 lies deep in the middle of batsh*t crazy territory - even single requests can be duplicated needlessly (because glue records).

There will be some failures for the next hour or so - if local nscd has cached an nxdomain for something in the few seconds the dns server was down. You can wait for the cache to expire, or invalidate it with

sudo nscd --invalidate=hosts

And try again.

yuvipanda raised the priority of this task from High to Unbreak Now!.Jan 26 2015, 6:10 PM

Seems to be acting up again

I increased the limit for connection tracking with sysctl -w net.netfilter.nf_conntrack_max=131072 (doubling the value). That seems to have stabilized somewhere around 86066 and reduced the DNS failures as well.

The hand tuning has made things fairly stable. Still need to puppetize, though. @akosiaris can you help? Need to set notrack on the DNS ports on a machine without ferm...

yuvipanda lowered the priority of this task from Unbreak Now! to Medium.Feb 1 2015, 3:41 AM

Update from @akosiaris - he said he will try to get to this after Tuesday, which is today!

The TTL issue seems to have fixed the DNS response for a while now.

Since last Friday, it is flapping against causing multiple issue on beta cluster and continuous integration projects. I am assuming it is related to virt1000 being overloaded by too many services. I could not find the task that ask to redistribute services out of virt1000, should be made a blocker of this DNS issue and acted promptly imho.

coren raised the priority of this task from Medium to Unbreak Now!.Feb 11 2015, 2:12 PM

This has gotten to "serious showstopper" level and I'm going to drop what I'm doing to fix this.

coren edited projects, added Cloud-Services; removed Patch-For-Review.
coren moved this task from Triage to In Progress on the Cloud-Services board.

Just an update on the sysctl side. I worked with @yuvipanda to puppetize a firewall rule for NOTRACK for DNS so that the net.netfilter.nf_conntrack_max solution is not needed. Unfortunately seems like nova-network on a network node (labnet1001) does not allow us to emded custom firewall rules which makes it impossible to do that. So the fallback scenario now is to puppetize @yuvipanda's sysctl change, assuming of course it has indeed fixed things which may not be the case.

I've managed to reduce the pressure on dnsmasq enough to unbreak the immediate major fail, but there are still issues (there was a bounce storm on the toolserver.org MX caused by a joe job; I've tightened the rules a bit to help)

@coren, I am not sure I follow, how did you reduce the pressure on dnsmasq ? And which rules did you tighten to account for the toolserver.org MX joe job ?

A huge pile of incoming bounces were stuck in the queue waiting to be forwarded to mostly bad email addresses alongside genuine toollab users, making exim do a lot of bad resolution (mostly in .tw) - I've flushed those bounces from the queue and closed off the overly permissive default exim4.minimal.erb

Inspection of the networking stack's live state show that @yuvipanda's tweak to the size of connection tracking helps a lot for the typical case, but remains vulnerable to bursts of DNS requests to highly variable domains (as this causes many more "connections" to be established) - it's likely that general network activity also exacerbates the issue to some degree.

This also means that adding a distinct recursor may mitigate the issue, but won't fix it entirely either since traffic to the recursor necessarily must go through labnet1001 as well.

Since absolutely all traffic from labs creates a conntrack entry on labnet1001, this is a problem that will recur and get more frequent as total usage increases. I'm going to experimentally increase the size of the conntrack table by another order of magnitude and create a monitoring rule for it so that we can keep an eye on it.

gerritbot added a subscriber: gerritbot.

Change 190214 had a related patch set uploaded (by coren):
Labs: Increase labnet1001 conntrack tables

https://gerrit.wikimedia.org/r/190214

Patch-For-Review

coren lowered the priority of this task from Unbreak Now! to High.Feb 12 2015, 3:23 PM

(The above patch is hotfixed on labnet1001)

This stabilizes things, and DNS resolves properly in Labs at this time. Still needs monitoring, and a careful eye on future behaviour but is no longer a panic point.

Slightly related, whenever OpenStack's dnsmasq is manually restarted, there is a short interval where various stuff complains because the sole DNS server isn't there. So for this alone it would be nice to set up a second, caching/mirroring DNS server that could be added to /etc/resolv.conf (perhaps the order determined by the oddness of $ipaddress) and then would offer the opportunity to shut down one server for maintenance without an adrenalin rush.

Buf if the underlying issue is in networking, I assume that needs to be resolved first.

I guess we want a task to request a second DNS resolver in the wmflabs :]

Done with T90234, here as a blocked task on the assumption that this task is about fixing the networking bottleneck.

Change 190214 merged by coren:
Labs: Increase labnet1001 conntrack tables

https://gerrit.wikimedia.org/r/190214

coren removed coren as the assignee of this task.Mar 26 2015, 1:45 PM

could this have caused T92351 ?

No, that's dnsmasq being a bad DNS server in general.

Either way, @Andrew's work on Designate is about to solve both problems for good.

coren claimed this task.

That should no longer be an issue as the DNS server has been replaced and appears to be working correctly. Reopen if issues recur.