Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs
Open, NormalPublic

Description

https://gerrit.wikimedia.org/r/327734 and https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&type=revision&diff=1168635&oldid=1125290 added some ferm rules which appear on 12 deployment-prep boxes (deployment-imagescaler[01-02].deployment-prep.eqiad.wmflabs,deployment-jobrunner02.deployment-prep.eqiad.wmflabs,deployment-mediawiki[04-07].deployment-prep.eqiad.wmflabs,deployment-mira.deployment-prep.eqiad.wmflabs,deployment-snapshot01.deployment-prep.eqiad.wmflabs,deployment-tin.deployment-prep.eqiad.wmflabs,deployment-tmh01.deployment-prep.eqiad.wmflabs,deployment-videoscaler01.deployment-prep.eqiad.wmflabs):

10_prometheus-apache_exporter:&R_SERVICE(tcp, 9117, (@resolve((deployment-prometheus01.deployment-prep.eqiad.wmflabs)) @resolve((deployment-prometheus01.deployment-prep.eqiad.wmflabs), AAAA)));
10_prometheus-hhvm-exporter:&R_SERVICE(tcp, 9192, (@resolve((deployment-prometheus01.deployment-prep.eqiad.wmflabs)) @resolve((deployment-prometheus01.deployment-prep.eqiad.wmflabs), AAAA)));

Which all seems fine, but puppet showed ferm failing to start. Indeed it fails manually too:

root@deployment-mediawiki06:~# /usr/sbin/ferm /etc/ferm/ferm.conf
Error in /etc/ferm/conf.d/10_prometheus-apache_exporter line 4:
                deployment-prometheus01.deployment-prep.eqiad.wmflabs 
            ) 
            , AAAA 
        ) 
        <--
DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs' failed: NXDOMAIN

But, that query doesn't return NXDOMAIN:

root@deployment-mediawiki06:/etc/ferm/conf.d# host deployment-prometheus01.deployment-prep.eqiad.wmflabs
deployment-prometheus01.deployment-prep.eqiad.wmflabs has address 10.68.20.247
Krenair created this task.Dec 16 2016, 4:14 PM
Restricted Application added a project: Traffic. · View Herald TranscriptDec 16 2016, 4:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krenair added a comment.EditedDec 16 2016, 4:18 PM
root@deployment-mediawiki06:/etc/ferm/conf.d# perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'AAAA'); print \$resolver->errorstring"
NXDOMAIN
root@deployment-mediawiki06:/etc/ferm/conf.d# perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'A'); print \$resolver->errorstring"
NOERROR

but...

root@deployment-mediawiki06:/etc/ferm/conf.d# dig deployment-prometheus01.deployment-prep.eqiad.wmflabs AAAA
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53078
root@deployment-mediawiki06:/etc/ferm/conf.d# dig deployment-prometheus01.deployment-prep.eqiad.wmflabs A
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64731

So I added debug => 1 to the Net::DNS::Resolver constructor call and found it gets NOERROR from the actual domain, but NXDOMAIN from deployment-prometheus01.deployment-prep.eqiad.wmflabs.deployment-prep.eqiad.wmflabs and deployment-prometheus01.deployment-prep.eqiad.wmflabs.eqiad.wmflabs - those are done because of lack of a trailing full stop.

But then I added a trailing full stop to the configured hostname for the AAAA lookup, and:

root@deployment-mediawiki06:/etc/ferm/conf.d# /usr/sbin/ferm /etc/ferm/ferm.conf
Error in /etc/ferm/conf.d/10_prometheus-apache_exporter line 4:
                deployment-prometheus01.deployment-prep.eqiad.wmflabs. 
            ) 
            , AAAA 
        ) 
        <--
DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs.' failed: NOERROR

wtf?

krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$query = \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'A'); print \$query eq undef"
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$query = \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'AAAA'); print \$query eq undef"
1

It feels like Net::DNS is taking a NOERROR-but-empty-resultset response as an indicator it should try adding the search domains (if no trailing full stop)?
It also feels like Net::DNS is taking a NOERROR-but-empty-resultset response as an excuse to return undef from Net::DNS::Resolver->search when it should give an object that has an empty answer list?

(My Perl terminology may be off here)

Restricted Application added a project: Operations. · View Herald TranscriptDec 16 2016, 4:18 PM
Krenair added a comment.EditedDec 16 2016, 5:00 PM

Ferm will shut up if you make this change:

diff --git a/src/ferm b/src/ferm
index 67f5c89..5874642 100755
--- a/src/ferm
+++ b/src/ferm
@@ -1127,7 +1127,8 @@ sub resolve($\@$) {
 
         my $query = $resolver->search($hostname, $type);
         error("DNS query for '$hostname' failed: " . $resolver->errorstring)
-          unless $query;
+          unless $query || $resolver->errorstring eq 'NOERROR';
+        next unless $query;
 
         foreach my $rr ($query->answer) {
             next unless $rr->type eq $type;

This is assuming the trailing full stop config change, which I think would require a Net::DNS change to do better...

ema moved this task from Triage to DNS Infra on the Traffic board.Dec 19 2016, 9:45 AM

Could report/send the patch upstream?

Krenair added a project: Upstream.EditedJan 17 2017, 1:18 PM

Yeah. The real bug seems to be in Net::DNS, the trailing full stop config change + patch above for ferm is just a workaround. I don't really do much perl yet and don't plan to jump right into the depths of that library, but I could email net-dns-users I guess.

You have successfully confirmed your subscription request to the mailing list net-dns-users, however final approval is required from the list moderator before you will be subscribed. Your request has been forwarded to the list moderator, and you will be notified of the moderator's decision.

MoritzMuehlenhoff triaged this task as Normal priority.Jan 19 2017, 1:07 PM

Gave up waiting for that (it's been almost a year), sent a message anyway and it's been held for moderation.

Krenair renamed this task from Ferm/DNS library weirdness on deployment-mediawiki boxes to Ferm/DNS library weirdness causing puppet errors on 12 deployment-prep instances.Jan 5 2018, 11:03 PM
Krenair updated the task description. (Show Details)
Krenair renamed this task from Ferm/DNS library weirdness causing puppet errors on 12 deployment-prep instances to Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances.Jan 5 2018, 11:30 PM

Change 381073 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] prometheus: make ferm DNS record type configurable

https://gerrit.wikimedia.org/r/381073

hashar added a subscriber: hashar.Jan 8 2018, 11:15 AM

I had the issue a while back T176314#3640963 and went with a workaround of s/AAAA/A/ https://gerrit.wikimedia.org/r/#/c/381073/

hashar removed a subscriber: hashar.Mar 19 2018, 10:03 PM

Change 381073 abandoned by Hashar:
prometheus: make ferm DNS record type configurable

https://gerrit.wikimedia.org/r/381073

Change 381073 restored by Krinkle:
prometheus: make ferm DNS record type configurable

Reason:
Restoring because a version of this is still live on beta.

https://gerrit.wikimedia.org/r/381073

Mentioned in SAL (#wikimedia-releng) [2018-09-22T18:00:20Z] <Krenair> rm deployment-snapshot01:/etc/ferm/conf.d/10_prometheus-nutcracker-exporter as it was breaking ferm from starting (T153468), puppet has not re-created it so I assume it was historical (shouldn't puppet be purging such files?)

Mentioned in SAL (#wikimedia-releng) [2018-09-22T18:01:57Z] <Krenair> rm deployment-maps03:/etc/ferm/conf.d/10_redis_exporter_6379 as it was breaking ferm from starting (T153468), puppet has not re-created it so I assume it was historical (shouldn't puppet be purging such files?)

Mentioned in SAL (#wikimedia-releng) [2018-09-22T18:20:58Z] <Krenair> removed ferm package from deployment-snapshot01 as it appeared unmanaged by puppet and was causing problems with SSH access from the current deployment hosts (previous logs referenced T153468, this just explains why puppet hadn't purged stuff)

Krenair added a comment.EditedSep 23 2018, 2:19 PM

It looks like my net-dns-users subscription got approved on the 20th of August this year. I've made a post about this there: https://www.nlnetlabs.nl/pipermail/net-dns-users/2018/000288.html

And actually now that I've done this and investigated some more I'm confident enough to just report this as a straight up bug(s). :|

Krenair added a comment.EditedSep 23 2018, 3:34 PM

Okay here's what I'm gonna send to upstream bug tracking:

Allow me to prefix this by saying that I'm not a Perl programmer so apologise if I've screwed up terminology etc.
After a bit of investigation in https://phabricator.wikimedia.org/T153468, I believe the following when Net::DNS::Resolver->search gets a NOERROR result with no records:
* if there is no trailing full stop in the domain, it tries adding the search domain(s) to the end, leading to NXDOMAIN.
* if there is a trailing full stop in the domain, it returns undef, when it should give a Net::DNS::Packet=HASH with an empty ->answer.

Here's some examples. You should be able to get these records from labs-ns0.wikimedia.org while outside our network.
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'A'); print \$resolver->errorstring"
NOERROR
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'AAAA'); print \$resolver->errorstring"
NXDOMAIN
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver(debug => 1); \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'AAAA'); print \$resolver->errorstring"
;; search( deployment-prometheus01.deployment-prep.eqiad.wmflabs AAAA )
[...]
;;	ra = 1	z  = 0	ad = 0	cd = 0	rcode  = NOERROR
[...]
;; deployment-prometheus01.deployment-prep.eqiad.wmflabs.	IN	AAAA
;; ANSWER SECTION (0 records)
[...]
;; search( deployment-prometheus01.deployment-prep.eqiad.wmflabs.deployment-prep.eqiad.wmflabs AAAA )
[...]
;;	ra = 1	z  = 0	ad = 0	cd = 0	rcode  = NXDOMAIN
[...]
;; deployment-prometheus01.deployment-prep.eqiad.wmflabs.deployment-prep.eqiad.wmflabs.	IN	AAAA
[...]
;; search( deployment-prometheus01.deployment-prep.eqiad.wmflabs.eqiad.wmflabs AAAA )
[...]
;;	ra = 1	z  = 0	ad = 0	cd = 0	rcode  = NXDOMAIN
[...]
;; deployment-prometheus01.deployment-prep.eqiad.wmflabs.eqiad.wmflabs.	IN	AAAA
[...]
NXDOMAIN

krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver(); \$query = \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'A'); print \$query"
Net::DNS::Packet=HASH(0x562dad5f1318)
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver(); \$query = \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'AAAA'); print \$resolver->errorstring . \"\n\" . (\$query eq undef)"
NOERROR
1
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver(debug => 1); \$query = \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'AAAA'); print \$resolver->errorstring . \"\n\" . (\$query eq undef)"
;; search( deployment-prometheus01.deployment-prep.eqiad.wmflabs. AAAA )
[...]
;;	ra = 1	z  = 0	ad = 0	cd = 0	rcode  = NOERROR
[...]
;; ANSWER SECTION (0 records)
[...]
NOERROR
1
Krenair moved this task from Backlog to Reported Upstream on the Upstream board.EditedSep 23 2018, 3:37 PM

Upstreamed, skipping ferm and going straight to Net::DNS as it doesn't appear to be a problem with what ferm is doing: https://rt.cpan.org/Ticket/Display.html?id=127182

Krenair renamed this task from Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances to Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs.Sep 23 2018, 3:38 PM
Krenair added a comment.EditedSep 23 2018, 9:31 PM

A comment on the Net::DNS ticket seems to indicate we may have to pursue at least one of the problems (return of undef when answer section is empty) with the Ferm developers. Seeking clarification on the other (addition of search domains when answer section is empty).

Ferm PR: https://github.com/MaxKellermann/ferm/pull/37
Still need to find out whether the search domain addition is Net::DNS upstream problem or not. Even if it is I suggest we append trailing full stops via puppet seeing as we likely wouldn't see a patched version for years through the debian packages.

Krenair added a comment.EditedOct 26 2018, 4:59 PM

Ferm merged my pull request for handling of NOERROR empty responses. So now if we can get the next release of ferm, and add the trailing full stops (it doesn't look like there'll be any movement on that problem inside Net::DNS any time soon), we should be good.

Krenair renamed this task from Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs to Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs.
fgiunchedi moved this task from Backlog to Radar on the monitoring board.Oct 29 2018, 2:00 PM

Change 381073 abandoned by Hashar:
prometheus: make ferm DNS record type configurable

Reason:
That is probably outdated, I lack bandwith to rebase it properly.

https://gerrit.wikimedia.org/r/381073

Ferm merged my pull request for handling of NOERROR empty responses. So now if we can get the next release of ferm, and add the trailing full stops (it doesn't look like there'll be any movement on that problem inside Net::DNS any time soon, if ever), we should be good.

What do you mean with trailing full stops? What change would need to be made to a ferm rule to fall back gracefully in combination with the patch of yours which was merged upstream?

Ferm merged my pull request for handling of NOERROR empty responses. So now if we can get the next release of ferm, and add the trailing full stops (it doesn't look like there'll be any movement on that problem inside Net::DNS any time soon, if ever), we should be good.

What do you mean with trailing full stops? What change would need to be made to a ferm rule to fall back gracefully in combination with the patch of yours which was merged upstream?

the dot at the end of a FQDN, e.g. test-instance.eqiad.wmflabs. instead of test-instance.eqiad.wmflabs
this prevents Net::DNS from seeing the empty NOERROR response and going through the search domains, which will lead to NXDOMAIN

Krenair added a comment.EditedNov 9 2018, 9:01 PM

it doesn't look like there'll be any movement on that problem inside Net::DNS any time soon

I was wrong! it didn't look like there was RT activity but I just got https://www.nlnetlabs.nl/pipermail/net-dns-users/2018/000290.html

Haven't tested it though

So that seems to work. If we can get updated ferm and libnet-dns-perl packages this might be done.

Indeed, thanks! I've tested this on mw2151 with a the ferm upstream change applied on top of our current package and a backport of libnet-dns-perl 1.17 plus upstream revision r1717. Going to to a few more tests with libnet-dns-perl reverse dependencies and then I'll rollout this fleet-wide.

Status update: Faidon created a patch for Ferm to address this at https://github.com/MaxKellermann/ferm/pull/41. It's not yet reviewed/merged upstream and too intrusive to backport. We'll revisit when a new upstream release with the patches is available (and ideally it lands in buster as well).

This was already addressed in ferm upstream. I think this problem is solvable now :/