Page MenuHomePhabricator

dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN
Closed, ResolvedPublic

Description

This has been consistently happening for the oojs-core-npm gate job for several days now:

https://integration.wikimedia.org/ci/job/oojs-core-npm/449/console

00:01:06.099 Running "karma:ci1" (karma) task
00:01:06.108 INFO [karma]: Karma v0.12.31 server started at http://localhost:9876/
00:01:06.113 INFO [launcher]: Starting browser chrome on SauceLabs
00:01:06.115 INFO [launcher]: Starting browser firefox on SauceLabs
00:01:06.117 INFO [launcher]: Starting browser internet explorer 11 (Windows 7) on SauceLabs
00:01:08.166 ERROR [launcher.sauce]: Can not start chrome
00:01:08.589   Failed to start Sauce Connect:
00:01:08.589   Error: GET https://saucelabs.com/rest/v1/oojs/tunnels?full=1: Couldn't resolve host name.
00:01:08.589 ERROR [launcher.sauce]: Can not start firefox
00:01:08.590   Failed to start Sauce Connect:
00:01:08.590   Error: GET https://saucelabs.com/rest/v1/oojs/tunnels?full=1: Couldn't resolve host name.
00:01:08.590 ERROR [launcher.sauce]: Can not start internet explorer 11 (Windows 7)
00:01:08.591   Failed to start Sauce Connect:
00:01:08.591   Error: GET https://saucelabs.com/rest/v1/oojs/tunnels?full=1: Couldn't resolve host name.
00:01:08.591 Warning: Task "karma:ci1" failed. Use --force to continue.
00:01:08.591 
00:01:08.591 Aborted due to warnings.

I was able to consistently reproduce this when running grunt ci on latest master of oojs-core via jenkins-deploy@integration-slave1401:/mnt/jenkins-workspace/workspace/oojs-core-npm.

However it's working fine from localhost for @Jdforrester-WMF and myself.

Did something change recently in the firewall or DNS configuration of WMFLabs or Eqiad that might cause this?

It last worked on Feb 24. This first failed on March 6.

Event Timeline

Krinkle raised the priority of this task from to Unbreak Now!.
Krinkle updated the task description. (Show Details)
Krinkle added subscribers: Krinkle, Jdforrester-WMF.
Krinkle set Security to None.

related to the DNS work on labs i would suspect.

https://phabricator.wikimedia.org/T72076#1059041 or related

the timeframe you describe when it stopped working and the last comment/merge there roughly match

I don't think so because that was merged earlier.

But on March 6th https://gerrit.wikimedia.org/r/#/c/194858/ was merged; however a):

scfc@tools-login:~$ cat /etc/resolv.conf; host saucelabs.com
## THIS FILE IS MANAGED BY PUPPET
##
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain eqiad.wmflabs
search eqiad.wmflabs labs.eqiad.wmnet
options timeout:5 ndots:2
nameserver 10.68.16.1
saucelabs.com has address 162.222.73.28
saucelabs.com mail is handled by 5 ALT1.ASPMX.L.GOOGLE.com.
saucelabs.com mail is handled by 10 ASPMX3.GOOGLEMAIL.com.
saucelabs.com mail is handled by 1 ASPMX.L.GOOGLE.com.
saucelabs.com mail is handled by 5 ALT2.ASPMX.L.GOOGLE.com.
saucelabs.com mail is handled by 10 ASPMX2.GOOGLEMAIL.com.
scfc@tools-login:~$

and b) if that change would have caused DNS failures, it would be very strange if that would have been limited to saucelabs.com, and I haven't heard more complains about DNS failures in general after that change got merged.

The only net effect the change can make is that iff the fqdn has exactly one dot (as is the case here) then the local domains would be checked before (but not instead of) the DNS roots. It can't make a name that once resolved no longer do so.

[16:03 CET] krinkle at KrinkleMac in ~
$ host saucelabs.com
saucelabs.com has address 162.222.73.28
saucelabs.com mail is handled by 10 ASPMX3.GOOGLEMAIL.com.
saucelabs.com mail is handled by 10 ASPMX2.GOOGLEMAIL.com.
saucelabs.com mail is handled by 5 ALT2.ASPMX.L.GOOGLE.com.
saucelabs.com mail is handled by 5 ALT1.ASPMX.L.GOOGLE.com.
saucelabs.com mail is handled by 1 ASPMX.L.GOOGLE.com.
$
[00:08 UTC] krinkle at integration-slave1401.eqiad.wmflabs in ~
$ cat /etc/resolv.conf
## THIS FILE IS MANAGED BY PUPPET
##
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain eqiad.wmflabs
search eqiad.wmflabs labs.eqiad.wmnet
options timeout:5 ndots:2
nameserver 10.68.16.1

$ host saucelabs.com
Host saucelabs.com.eqiad.wmflabs not found: 2(SERVFAIL)
1 $

I found the root cause to be this option in /etc/resolv.conf

options timeout:5 ndots:2

specifically it's the "ndots" thing. if i comment that out, it works:

 host saucelabs.com
saucelabs.com has address 162.222.73.28

in combination with

domain eqiad.wmflabs
search eqiad.wmflabs labs.eqiad.wmnet

it is the reason why it is appending eqiad.wmflabs.

This is a puppetized file though, so needs a patch.


snippet from man page of resolv.conf

options ndots:n below to avoid man-in-the-middle attacks and unnecessary traffic for the root-dns-servers. Note that this process may be slow and will generate a lot of network traffic if the servers for the listed domains are not local, and that queries will time out if no server is available for one of the domains.

another option is if you just replace saucelabs.com with www.saucelabs.com , it is the same IP and works without changes.

Change 196731 had a related patch set uploaded (by Dzahn):
don't use 'ndots: 2' in labs resolv.conf

https://gerrit.wikimedia.org/r/196731

ndots:2 is necessary for something else, the actual bug is that the dnsmasq server should emphathically not respond with SERVFAIL allowing resolution to proceed.

*headdesk* *mumble, mumble, dnsmasq*

Can you use www.saucelabs.com as a workaround? I'm going to bump up the priority of having a proper DNS server for Labs that is anything but dnsmasq.

The error doesn't seem to lie with dnsmasq. On tools-login, the look-up succeeds, on tools-trusty, it fails. So it seems to be related to some change in Ubuntu Trusty.

No, it's just that the Precise libresolv seems to be a little more forgiving and skips over the SERVFAIL - which just hides the actual issue that dnsmasq really shouldn't be returning a SERVFAIL at all in the first place.

To wit:

marc@tools-trusty:~$ host notexist
Host notexist.eqiad.wmflabs not found: 2(SERVFAIL)

Which is clearly incorrect, and happens because:

marc@tools-login:~$ dig @10.68.16.1 notexist.eqiad.wmflabs

; <<>> DiG 9.8.1-P1 <<>> @10.68.16.1 notexist.eqiad.wmflabs
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 2875
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
[...]

Which is completely broken.

I suspect this error got introduced when I switched over the CI pool from the Trusty instances we created October/December 2014 to the re-created ones from this month.

@coren: But there are you querying the Labs server, and (I think) dnsmasq just passes the request upstream:

scfc@tools-login:~$ dig @10.68.16.1 notexist.eqiad.wmflabs

; <<>> DiG 9.8.1-P1 <<>> @10.68.16.1 notexist.eqiad.wmflabs
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 1760
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;notexist.eqiad.wmflabs.                IN      A

;; Query time: 1447 msec
;; SERVER: 10.68.16.1#53(10.68.16.1)
;; WHEN: Sat Mar 14 01:33:57 2015
;; MSG SIZE  rcvd: 40

scfc@tools-login:~$ dig @10.68.16.1 notexist.eqiad.wmflabs.de

; <<>> DiG 9.8.1-P1 <<>> @10.68.16.1 notexist.eqiad.wmflabs.de
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 54424
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;notexist.eqiad.wmflabs.de.     IN      A

;; AUTHORITY SECTION:
de.                     3600    IN      SOA     f.nic.de. its.denic.de. 2015031405 7200 7200 3600000 7200

;; Query time: 2 msec
;; SERVER: 10.68.16.1#53(10.68.16.1)
;; WHEN: Sat Mar 14 01:34:09 2015
;; MSG SIZE  rcvd: 95

scfc@tools-login:~$

So (from a distance) it appears as if the WMF nameserver that dnsmasq is referring to for .wmflabs is returning SERVFAIL.

And:

scfc@tools-login:~$ dig @10.68.16.1 tools-login.eqiad.wmflabs

; <<>> DiG 9.8.1-P1 <<>> @10.68.16.1 tools-login.eqiad.wmflabs
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37165
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;tools-login.eqiad.wmflabs.     IN      A

;; ANSWER SECTION:
tools-login.eqiad.wmflabs. 300  IN      A       10.68.16.7

;; Query time: 0 msec
;; SERVER: 10.68.16.1#53(10.68.16.1)
;; WHEN: Sat Mar 14 01:37:41 2015
;; MSG SIZE  rcvd: 59

scfc@tools-login:~$

Could it be that .eqiad.wmflabs is configured to query LDAP/OpenStack, and if it is passed a name with a dot in it, that causes a failure => SERVFAIL?

(Or a host name that does not exist.)

So (from a distance) it appears as if the WMF nameserver that dnsmasq is referring to for .wmflabs is returning SERVFAIL.

It's supposed to be authoritative for it. :-)

http://www.linuxquestions.org/questions/linux-networking-3/powerdns-servfail-945615/ (NB: MySQL) suggests that an SOA record in LDAP may be missing; AFAICS, default-soa-name is set to labs-ns0.wikimedia.org.

about workarounds:

/etc/nsswitch says: hosts: files dns
so to first check files and then dns

if we add saucelabs.com to /etc/hosts:

162.222.73.28 saucelabs.com

that will not fix it with the host command or dig because they will ask DNS directly anyways, but anything using gethostbyname() should use the entry from the file. would that unbreak the oojs-core-npm gate job for now?

getent ahosts saucelabs.com
162.222.73.28   STREAM saucelabs.com

what also works: adding a trailing dot:

root@integration-slave1401:~# host saucelabs.com.
saucelabs.com has address 162.222.73.28

but i guess that is the same problem as with adding www.

Change 196775 had a related patch set uploaded (by Yuvipanda):
labs: set resolf.conf ndots to 1

https://gerrit.wikimedia.org/r/196775

Change 196775 abandoned by Yuvipanda:
labs: set resolf.conf ndots to 1

https://gerrit.wikimedia.org/r/196775

Has someone looked at whether there is an SOA record in LDAP? If that is the source of the problem, fixing it would be much easier than working around it with resolv.conf.

coren lowered the priority of this task from Unbreak Now! to Medium.Mar 16 2015, 1:42 PM

The issue has been worked around on the CI side, but the underlying issue with dnsmasq giving incorrect answers remains. Changing priority and editing ticket to match.

coren renamed this task from Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." to dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN.Mar 16 2015, 1:43 PM

Interestingly, dig saucelabs.com on tools-trusty works fine.

@scfc The SOA records are there; though it's not immediately clear that they work properly either way.

@yuvipanda: That's not so much "interesting" as "expected". Dig ignores the search order so it never asks dnsmasq for a name that doesn't exist and never gets SERVFAIL. :-)

So what does dig notexist.eqiad.wmflabs return on the server where dnsmasq is running?

If I query labs-ns0/labs-ns1 externally, they return NXDOMAIN, e. g.:

[tim@passepartout ~]$ dig @labs-ns0.wikimedia.org notexist.eqiad.wmflabs                                                                                     
                                                                                                                                                             
; <<>> DiG 9.9.3-rl.13207.22-P2-RedHat-9.9.3-16.P2.fc19 <<>> @labs-ns0.wikimedia.org notexist.eqiad.wmflabs
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 6277
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;notexist.eqiad.wmflabs.                IN      A

;; AUTHORITY SECTION:
eqiad.wmflabs.          3600    IN      SOA     virt1000.wikimedia.org. hostmaster.wikimedia.org. 1426372504 1800 3600 86400 7200

;; Query time: 148 msec
;; SERVER: 208.80.154.19#53(208.80.154.19)
;; WHEN: Mo Mär 16 14:49:39 UTC 2015
;; MSG SIZE  rcvd: 109

[tim@passepartout ~]$

Change 196731 abandoned by Dzahn:
don't use 'ndots: 2' in labs resolv.conf

https://gerrit.wikimedia.org/r/196731

This has been worked around in beta, and the new DNS server (see T87280) will make bugs in dnsmasq irrelevant.

scfc changed the task status from Resolved to Declined.Mar 26 2015, 11:26 PM

That may be, but this is certainly not resolved, and whether it is a bug in dnsmasq is unclear at least to me (and if, it should be filed upstream, etc.).

That may still be an option, once he have at least one that actually works right. :-)

https://integration.wikimedia.org/ci/job/npm/2590/console

00:01:18.606 ERROR [launcher.sauce]: Can not start chrome
00:01:18.606   Failed to start Sauce Connect:
00:01:18.606   14 Apr 15:16:40 - Error: GET https://saucelabs.com/rest/v1/oojs/tunnels?full=1: Couldn't resolve host name.
00:01:18.607 14 Apr 15:16:40 - Goodbye.

Change 196731 restored by Hashar:
don't use 'ndots: 2' in labs resolv.conf

https://gerrit.wikimedia.org/r/196731

On CI we do DNS requests for pubic DNS entry so we apparently need to remove the ndots:2 option. Hence @Krinkle reapplied https://gerrit.wikimedia.org/r/196731 on the integration puppet master.

scfc changed the task status from Resolved to Declined.Jun 5 2015, 1:52 AM

(AFAIUI, the underlying issue has not been researched or resolved.)

AFAICT, this problem solved itself (as expected) since we switched to a properly functionning DNS server.

coren changed the task status from Declined to Resolved.Jul 8 2015, 7:55 PM

Indeed it has:

marc@tools-bastion-01:~$ host notexist
Host notexist not found: 3(NXDOMAIN)

and

marc@tools-bastion-01:~$ host saucelabs.com
saucelabs.com has address 162.222.73.28
[...]

which is the correct behaviour (and ndots:2 is on)

Too bad for the readers Google will bring here in the future: Nearly four months of investigation, no explanation why dnsmasq (allegedly) misbehaves, no link to a bug report upstream, just "use a properly functioning DNS server" as take-away. But probably worse for the developers whose free software we were using successfully for so long.

Change 196731 abandoned by Dzahn:
base: Don't use 'ndots: 2' in labs resolv.conf

https://gerrit.wikimedia.org/r/196731

We still have that the Gerrit patch applied on the integration labs project. Filled T105297 to get it removed and thus reenable ndots: 2.

Too bad for the readers Google will bring here in the future: Nearly four months of investigation, no explanation why dnsmasq (allegedly) misbehaves, no link to a bug report upstream, just "use a properly functioning DNS server" as take-away. But probably worse for the developers whose free software we were using successfully for so long.

Your comment is not helping anyone. We long wanted to drop dnsmasq because it has a bunch of other issue and for the purpose of this bug the workaround was quiet easily : just remove ndots:2.

Overall this issue and others prompted ops to finally migrate out of dnsmasq to a better DNS system/architecture for labs. So that is a net win for us. I don't think it was worth our time to investigate a software we already planned to drop entirely.

I don't mind if ops wouldn't have investigated the issue. But doing so, claiming that dnsmasq is at fault, but not offering any explanation or chance for developers to fix a (perceived) error is slanderous to me.

So I do hope that my comment is helpful so that for example when someone finds an (alleged) error in one of your scripts and just deletes it with the comment "@hashar wrote it", you know that outside the bubble you won't be harassed.

But doing so, claiming that dnsmasq is at fault, but not offering any explanation or chance for developers to fix a (perceived) error is slanderous to me.

We did enough debugging to determine without a doubt that dnsmasq returned SRVFAIL rather than NXDOMAIN for entries which it did not know of over a domain it claimed it was authoritative for. This is incorrect behaviour.

We did not investigate the cause of that incorrect behaviour because dnsmasq was already causing a number of other (unrelated) issues, and it was known that it would be discareded in the short term and replaced. Spending time debugging a system you know is on its way out is - at best - futile when a workaround exists.