As part of T124680 I noticed that check_dns (nagios-plugins-standard) is always succeeding when looking at resolution for the pdns server.
labs-ns0.wikimedia.org
Auth DNS for labs pdns OK 2016-04-27 17:35:24 2d 1h 26m 6s 1/3 DNS OK: 0.064 seconds response time. nagiostest.eqiad.wmflabs returns
however:
dig nagiostest.eqiad.wmflabs @labs-ns0.wikimedia.org
->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5263
Our current check is by coincidence looking for an invalid hostname. One of the reasons I think the pdns failures in labs have been so difficult to understand is that our monitoring is faulty.
After much digging and confirmation with @BBlack that I'm not losing it:
- /usr/lib/nagios/plugins/check_dns shells out to nslookup for actual resolution
execve("/usr/lib/nagios/plugins/check_dns", ["/usr/lib/nagios/plugins/check_dn"..., "-H", "foo.eqiad.wmflabs", "-s", "labs-ns0.wikimedia.org"], [/* 19 vars */]) = 0 [pid 32512] execve("/usr/bin/nslookup", ["/usr/bin/nslookup", "-sil", "foo.eqiad.wmflabs", "labs-ns0.wikimedia.org"], [/* 1 var */]) = 0
and nslookup returns exit 0 pretty much no matter what as long as the command is valid, it seems to have nothing to do w/ the contexual outcome of the operation
nslookup foo; echo $?
0
- to counter this nslookup behavior check_dns has white listed failure criteria via a whole bunch of string matching and the pdns failure for record not found is slightly different enough to always succeed
https://sourcecodebrowser.com/nagios-plugins/1.4.11/check__dns_8c.html
/usr/bin/nslookup -sil foo.eqiad.wmflabs labs-ns0.wikimedia.org; echo $? Server: labs-ns0.wikimedia.org Address: 208.80.155.117#53 Non-authoritative answer: *** Can't find foo.eqiad.wmflabs: No answer 0
vs
/usr/bin/nslookup -sil foo.wikimedia.org ns0.wikimedia.org; echo $? Server: ns0.wikimedia.org Address: 208.80.154.238#53 ** server can't find foo.wikimedia.org: NXDOMAIN 0
check-dns.c clip
/* Connection was refused */ else if (strstr (input_buffer, "Connection refused") || strstr (input_buffer, "Couldn't find server") || strstr (input_buffer, "Refused") || (strstr (input_buffer, "** server can't find") && strstr (input_buffer, ": REFUSED"))) die (STATE_CRITICAL, _("Connection to DNS %s was refused\n"), dns_server);
- any failures check_dns finds are basically incidental based on expected failure modes.
dig
dig nagiostest.eqiad.wmflabs @labs-ns0.wikimedia.org ; <<>> DiG 9.8.3-P1 <<>> nagiostest.eqiad.wmflabs @labs-ns0.wikimedia.org ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 27370 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
dig foo.wikimedia.org @ns0.wikimedia.org ; <<>> DiG 9.8.3-P1 <<>> foo.wikimedia.org @ns0.wikimedia.org ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 56236 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 ;; WARNING: recursion requested but not available
vs check_dns
/usr/lib/nagios/plugins/check_dns -H foo.wikimedia.org -s ns0.wikimedia.org -v /usr/bin/nslookup -sil foo.wikimedia.org ns0.wikimedia.org Server: ns0.wikimedia.org Address: 208.80.154.238#53 ** server can't find foo.wikimedia.org: NXDOMAIN Domain foo.wikimedia.org was not found by the server
/usr/lib/nagios/plugins/check_dns -H foo.eqiad.wmflabs -s labs-ns0.wikimedia.org -v /usr/bin/nslookup -sil foo.eqiad.wmflabs labs-ns0.wikimedia.org Server: labs-ns0.wikimedia.org Address: 208.80.155.117#53 Non-authoritative answer: *** Can't find foo.eqiad.wmflabs: No answer DNS OK: 0.024 seconds response time. foo.eqiad.wmflabs returns |time=0.023596s;;;0.000000