Page MenuHomePhabricator

check_dns needs to be rewritten
Closed, ResolvedPublic

Description

As part of T124680 I noticed that check_dns (nagios-plugins-standard) is always succeeding when looking at resolution for the pdns server.

This check exists:

labs-ns0.wikimedia.org

Auth DNS for labs pdns
	
	OK 	2016-04-27 17:35:24 	2d 1h 26m 6s 	1/3 	DNS OK: 0.064 seconds response time. nagiostest.eqiad.wmflabs returns

however:

dig nagiostest.eqiad.wmflabs @labs-ns0.wikimedia.org
->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5263

Our current check is by coincidence looking for an invalid hostname. One of the reasons I think the pdns failures in labs have been so difficult to understand is that our monitoring is faulty.


After much digging and confirmation with @BBlack that I'm not losing it:

  • /usr/lib/nagios/plugins/check_dns shells out to nslookup for actual resolution
execve("/usr/lib/nagios/plugins/check_dns", ["/usr/lib/nagios/plugins/check_dn"..., "-H", "foo.eqiad.wmflabs", "-s", "labs-ns0.wikimedia.org"], [/* 19 vars */]) = 0
[pid 32512] execve("/usr/bin/nslookup", ["/usr/bin/nslookup", "-sil", "foo.eqiad.wmflabs", "labs-ns0.wikimedia.org"], [/* 1 var */]) = 0

and nslookup returns exit 0 pretty much no matter what as long as the command is valid, it seems to have nothing to do w/ the contexual outcome of the operation

nslookup foo; echo $?
0

  • to counter this nslookup behavior check_dns has white listed failure criteria via a whole bunch of string matching and the pdns failure for record not found is slightly different enough to always succeed

https://sourcecodebrowser.com/nagios-plugins/1.4.11/check__dns_8c.html

/usr/bin/nslookup -sil foo.eqiad.wmflabs labs-ns0.wikimedia.org; echo $?
Server:		labs-ns0.wikimedia.org
Address:	208.80.155.117#53

Non-authoritative answer:
*** Can't find foo.eqiad.wmflabs: No answer

0

vs

/usr/bin/nslookup -sil foo.wikimedia.org ns0.wikimedia.org; echo $?
Server:		ns0.wikimedia.org
Address:	208.80.154.238#53

** server can't find foo.wikimedia.org: NXDOMAIN

0

check-dns.c clip

/* Connection was refused */
 else if (strstr (input_buffer, "Connection refused") ||
    strstr (input_buffer, "Couldn't find server") ||
          strstr (input_buffer, "Refused") ||
          (strstr (input_buffer, "** server can't find") &&
           strstr (input_buffer, ": REFUSED")))
   die (STATE_CRITICAL, _("Connection to DNS %s was refused\n"), dns_server);
  • any failures check_dns finds are basically incidental based on expected failure modes.

dig

dig nagiostest.eqiad.wmflabs @labs-ns0.wikimedia.org

; <<>> DiG 9.8.3-P1 <<>> nagiostest.eqiad.wmflabs @labs-ns0.wikimedia.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 27370
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
dig foo.wikimedia.org @ns0.wikimedia.org

; <<>> DiG 9.8.3-P1 <<>> foo.wikimedia.org @ns0.wikimedia.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 56236
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; WARNING: recursion requested but not available

vs check_dns

/usr/lib/nagios/plugins/check_dns -H foo.wikimedia.org -s ns0.wikimedia.org -v
/usr/bin/nslookup -sil foo.wikimedia.org ns0.wikimedia.org
Server:		ns0.wikimedia.org
Address:	208.80.154.238#53

** server can't find foo.wikimedia.org: NXDOMAIN
Domain foo.wikimedia.org was not found by the server
/usr/lib/nagios/plugins/check_dns -H foo.eqiad.wmflabs -s labs-ns0.wikimedia.org -v
/usr/bin/nslookup -sil foo.eqiad.wmflabs labs-ns0.wikimedia.org
Server:		labs-ns0.wikimedia.org
Address:	208.80.155.117#53

Non-authoritative answer:
*** Can't find foo.eqiad.wmflabs: No answer

DNS OK: 0.024 seconds response time. foo.eqiad.wmflabs returns |time=0.023596s;;;0.000000

Event Timeline

A small addendum in case someone else runs into it. I was initially confused by the difference in behavior here:

dig blah @labs-ns0.wikimedia.org

; <<>> DiG 9.8.3-P1 <<>> blah @labs-ns0.wikimedia.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43435
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;blah.				IN	A

;; Query time: 46 msec
;; SERVER: 208.80.155.117#53(208.80.155.117)
;; WHEN: Wed Apr 27 12:55:08 2016
;; MSG SIZE  rcvd: 22

vs.

dig blah @ns0.wikimedia.org

; <<>> DiG 9.8.3-P1 <<>> blah @ns0.wikimedia.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 42989
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;blah.				IN	A

;; Query time: 39 msec
;; SERVER: 208.80.154.238#53(208.80.154.238)
;; WHEN: Wed Apr 27 12:55:15 2016
;; MSG SIZE  rcvd: 22

i.e. NOERROR vs REFUSED

It is outlined from powerdns this is expected https://blog.powerdns.com/2015/03/02/from-noerror-to-refused/

In the short term maybe it makes sense just to switch to /usr/lib/nagios/plugins/check_dig which seems semi sane in current needed behavior. Even though it does shell out to dig and uses a similar methodology (https://sourcecodebrowser.com/nagios-plugins/1.4.11/check__dig_8c.html). check built around http://www.dnspython.org/examples.html would be nice.

Sticking the Traffic tag on because this affects monitoring of the production DNS authservers too, and that check_dns utility is awful to be relying on for monitoring something so critical.

ema claimed this task.
ema subscribed.

check_dns v1.5 (nagios-plugins 1.5) seems to be doing the right thing currently:

14:25:09 ema@labservices1001.wikimedia.org:~
$ /usr/lib/nagios/plugins/check_dns -H foo.eqiad.wmflabs -s labs-ns0.wikimedia.org -v
/usr/bin/nslookup -sil foo.eqiad.wmflabs labs-ns0.wikimedia.org
Server:		labs-ns0.wikimedia.org
Address:	208.80.155.117#53

** server can't find foo.eqiad.wmflabs: NXDOMAIN
Domain foo.eqiad.wmflabs was not found by the server
14:25:39 ema@labservices1001.wikimedia.org:~
$ echo $?
2

Closing as agreed with @chasemp on irc.