Page MenuHomePhabricator

Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs
Closed, ResolvedPublic

Description

https://gerrit.wikimedia.org/r/327734 and https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&type=revision&diff=1168635&oldid=1125290 added some ferm rules which appear on 12 deployment-prep boxes (deployment-imagescaler[01-02].deployment-prep.eqiad.wmflabs,deployment-jobrunner02.deployment-prep.eqiad.wmflabs,deployment-mediawiki[04-07].deployment-prep.eqiad.wmflabs,deployment-mira.deployment-prep.eqiad.wmflabs,deployment-snapshot01.deployment-prep.eqiad.wmflabs,deployment-tin.deployment-prep.eqiad.wmflabs,deployment-tmh01.deployment-prep.eqiad.wmflabs,deployment-videoscaler01.deployment-prep.eqiad.wmflabs):

10_prometheus-apache_exporter:&R_SERVICE(tcp, 9117, (@resolve((deployment-prometheus01.deployment-prep.eqiad.wmflabs)) @resolve((deployment-prometheus01.deployment-prep.eqiad.wmflabs), AAAA)));
10_prometheus-hhvm-exporter:&R_SERVICE(tcp, 9192, (@resolve((deployment-prometheus01.deployment-prep.eqiad.wmflabs)) @resolve((deployment-prometheus01.deployment-prep.eqiad.wmflabs), AAAA)));

Which all seems fine, but puppet showed ferm failing to start. Indeed it fails manually too:

root@deployment-mediawiki06:~# /usr/sbin/ferm /etc/ferm/ferm.conf
Error in /etc/ferm/conf.d/10_prometheus-apache_exporter line 4:
                deployment-prometheus01.deployment-prep.eqiad.wmflabs 
            ) 
            , AAAA 
        ) 
        <--
DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs' failed: NXDOMAIN

But, that query doesn't return NXDOMAIN:

root@deployment-mediawiki06:/etc/ferm/conf.d# host deployment-prometheus01.deployment-prep.eqiad.wmflabs
deployment-prometheus01.deployment-prep.eqiad.wmflabs has address 10.68.20.247

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
root@deployment-mediawiki06:/etc/ferm/conf.d# perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'AAAA'); print \$resolver->errorstring"
NXDOMAIN
root@deployment-mediawiki06:/etc/ferm/conf.d# perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'A'); print \$resolver->errorstring"
NOERROR

but...

root@deployment-mediawiki06:/etc/ferm/conf.d# dig deployment-prometheus01.deployment-prep.eqiad.wmflabs AAAA
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53078
root@deployment-mediawiki06:/etc/ferm/conf.d# dig deployment-prometheus01.deployment-prep.eqiad.wmflabs A
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64731

So I added debug => 1 to the Net::DNS::Resolver constructor call and found it gets NOERROR from the actual domain, but NXDOMAIN from deployment-prometheus01.deployment-prep.eqiad.wmflabs.deployment-prep.eqiad.wmflabs and deployment-prometheus01.deployment-prep.eqiad.wmflabs.eqiad.wmflabs - those are done because of lack of a trailing full stop.

But then I added a trailing full stop to the configured hostname for the AAAA lookup, and:

root@deployment-mediawiki06:/etc/ferm/conf.d# /usr/sbin/ferm /etc/ferm/ferm.conf
Error in /etc/ferm/conf.d/10_prometheus-apache_exporter line 4:
                deployment-prometheus01.deployment-prep.eqiad.wmflabs. 
            ) 
            , AAAA 
        ) 
        <--
DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs.' failed: NOERROR

wtf?

krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$query = \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'A'); print \$query eq undef"
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$query = \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'AAAA'); print \$query eq undef"
1

It feels like Net::DNS is taking a NOERROR-but-empty-resultset response as an indicator it should try adding the search domains (if no trailing full stop)?
It also feels like Net::DNS is taking a NOERROR-but-empty-resultset response as an excuse to return undef from Net::DNS::Resolver->search when it should give an object that has an empty answer list?

(My Perl terminology may be off here)

Ferm will shut up if you make this change:

diff --git a/src/ferm b/src/ferm
index 67f5c89..5874642 100755
--- a/src/ferm
+++ b/src/ferm
@@ -1127,7 +1127,8 @@ sub resolve($\@$) {
 
         my $query = $resolver->search($hostname, $type);
         error("DNS query for '$hostname' failed: " . $resolver->errorstring)
-          unless $query;
+          unless $query || $resolver->errorstring eq 'NOERROR';
+        next unless $query;
 
         foreach my $rr ($query->answer) {
             next unless $rr->type eq $type;

This is assuming the trailing full stop config change, which I think would require a Net::DNS change to do better...

Yeah. The real bug seems to be in Net::DNS, the trailing full stop config change + patch above for ferm is just a workaround. I don't really do much perl yet and don't plan to jump right into the depths of that library, but I could email net-dns-users I guess.

You have successfully confirmed your subscription request to the mailing list net-dns-users, however final approval is required from the list moderator before you will be subscribed. Your request has been forwarded to the list moderator, and you will be notified of the moderator's decision.

Gave up waiting for that (it's been almost a year), sent a message anyway and it's been held for moderation.

Krenair renamed this task from Ferm/DNS library weirdness on deployment-mediawiki boxes to Ferm/DNS library weirdness causing puppet errors on 12 deployment-prep instances.Jan 5 2018, 11:03 PM
Krenair updated the task description. (Show Details)
Krenair renamed this task from Ferm/DNS library weirdness causing puppet errors on 12 deployment-prep instances to Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances.Jan 5 2018, 11:30 PM

Change 381073 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] prometheus: make ferm DNS record type configurable

https://gerrit.wikimedia.org/r/381073

I had the issue a while back T176314#3640963 and went with a workaround of s/AAAA/A/ https://gerrit.wikimedia.org/r/#/c/381073/

Change 381073 abandoned by Hashar:
prometheus: make ferm DNS record type configurable

https://gerrit.wikimedia.org/r/381073

Change 381073 restored by Krinkle:
prometheus: make ferm DNS record type configurable

Reason:
Restoring because a version of this is still live on beta.

https://gerrit.wikimedia.org/r/381073

Mentioned in SAL (#wikimedia-releng) [2018-09-22T18:00:20Z] <Krenair> rm deployment-snapshot01:/etc/ferm/conf.d/10_prometheus-nutcracker-exporter as it was breaking ferm from starting (T153468), puppet has not re-created it so I assume it was historical (shouldn't puppet be purging such files?)

Mentioned in SAL (#wikimedia-releng) [2018-09-22T18:01:57Z] <Krenair> rm deployment-maps03:/etc/ferm/conf.d/10_redis_exporter_6379 as it was breaking ferm from starting (T153468), puppet has not re-created it so I assume it was historical (shouldn't puppet be purging such files?)

Mentioned in SAL (#wikimedia-releng) [2018-09-22T18:20:58Z] <Krenair> removed ferm package from deployment-snapshot01 as it appeared unmanaged by puppet and was causing problems with SSH access from the current deployment hosts (previous logs referenced T153468, this just explains why puppet hadn't purged stuff)

It looks like my net-dns-users subscription got approved on the 20th of August this year. I've made a post about this there: https://www.nlnetlabs.nl/pipermail/net-dns-users/2018/000288.html

And actually now that I've done this and investigated some more I'm confident enough to just report this as a straight up bug(s). :|

Okay here's what I'm gonna send to upstream bug tracking:

Allow me to prefix this by saying that I'm not a Perl programmer so apologise if I've screwed up terminology etc.
After a bit of investigation in https://phabricator.wikimedia.org/T153468, I believe the following when Net::DNS::Resolver->search gets a NOERROR result with no records:
* if there is no trailing full stop in the domain, it tries adding the search domain(s) to the end, leading to NXDOMAIN.
* if there is a trailing full stop in the domain, it returns undef, when it should give a Net::DNS::Packet=HASH with an empty ->answer.

Here's some examples. You should be able to get these records from labs-ns0.wikimedia.org while outside our network.
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'A'); print \$resolver->errorstring"
NOERROR
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver; \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'AAAA'); print \$resolver->errorstring"
NXDOMAIN
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver(debug => 1); \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs', 'AAAA'); print \$resolver->errorstring"
;; search( deployment-prometheus01.deployment-prep.eqiad.wmflabs AAAA )
[...]
;;	ra = 1	z  = 0	ad = 0	cd = 0	rcode  = NOERROR
[...]
;; deployment-prometheus01.deployment-prep.eqiad.wmflabs.	IN	AAAA
;; ANSWER SECTION (0 records)
[...]
;; search( deployment-prometheus01.deployment-prep.eqiad.wmflabs.deployment-prep.eqiad.wmflabs AAAA )
[...]
;;	ra = 1	z  = 0	ad = 0	cd = 0	rcode  = NXDOMAIN
[...]
;; deployment-prometheus01.deployment-prep.eqiad.wmflabs.deployment-prep.eqiad.wmflabs.	IN	AAAA
[...]
;; search( deployment-prometheus01.deployment-prep.eqiad.wmflabs.eqiad.wmflabs AAAA )
[...]
;;	ra = 1	z  = 0	ad = 0	cd = 0	rcode  = NXDOMAIN
[...]
;; deployment-prometheus01.deployment-prep.eqiad.wmflabs.eqiad.wmflabs.	IN	AAAA
[...]
NXDOMAIN

krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver(); \$query = \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'A'); print \$query"
Net::DNS::Packet=HASH(0x562dad5f1318)
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver(); \$query = \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'AAAA'); print \$resolver->errorstring . \"\n\" . (\$query eq undef)"
NOERROR
1
krenair@deployment-deploy01:~$ perl -e "require Net::DNS; my \$resolver = new Net::DNS::Resolver(debug => 1); \$query = \$resolver->search('deployment-prometheus01.deployment-prep.eqiad.wmflabs.', 'AAAA'); print \$resolver->errorstring . \"\n\" . (\$query eq undef)"
;; search( deployment-prometheus01.deployment-prep.eqiad.wmflabs. AAAA )
[...]
;;	ra = 1	z  = 0	ad = 0	cd = 0	rcode  = NOERROR
[...]
;; ANSWER SECTION (0 records)
[...]
NOERROR
1

Upstreamed, skipping ferm and going straight to Net::DNS as it doesn't appear to be a problem with what ferm is doing: https://rt.cpan.org/Ticket/Display.html?id=127182

Krenair renamed this task from Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances to Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs.Sep 23 2018, 3:38 PM

A comment on the Net::DNS ticket seems to indicate we may have to pursue at least one of the problems (return of undef when answer section is empty) with the Ferm developers. Seeking clarification on the other (addition of search domains when answer section is empty).

Ferm PR: https://github.com/MaxKellermann/ferm/pull/37
Still need to find out whether the search domain addition is Net::DNS upstream problem or not. Even if it is I suggest we append trailing full stops via puppet seeing as we likely wouldn't see a patched version for years through the debian packages.

Ferm merged my pull request for handling of NOERROR empty responses. So now if we can get the next release of ferm, and add the trailing full stops (it doesn't look like there'll be any movement on that problem inside Net::DNS any time soon), we should be good.

Krenair renamed this task from Ferm's upstream Net::DNS Perl library bad handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs to Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs.Oct 26 2018, 5:06 PM
Krenair moved this task from Reported Upstream to Patch merged upstream on the Upstream board.

Change 381073 abandoned by Hashar:
prometheus: make ferm DNS record type configurable

Reason:
That is probably outdated, I lack bandwith to rebase it properly.

https://gerrit.wikimedia.org/r/381073

Ferm merged my pull request for handling of NOERROR empty responses. So now if we can get the next release of ferm, and add the trailing full stops (it doesn't look like there'll be any movement on that problem inside Net::DNS any time soon, if ever), we should be good.

What do you mean with trailing full stops? What change would need to be made to a ferm rule to fall back gracefully in combination with the patch of yours which was merged upstream?

Ferm merged my pull request for handling of NOERROR empty responses. So now if we can get the next release of ferm, and add the trailing full stops (it doesn't look like there'll be any movement on that problem inside Net::DNS any time soon, if ever), we should be good.

What do you mean with trailing full stops? What change would need to be made to a ferm rule to fall back gracefully in combination with the patch of yours which was merged upstream?

the dot at the end of a FQDN, e.g. test-instance.eqiad.wmflabs. instead of test-instance.eqiad.wmflabs
this prevents Net::DNS from seeing the empty NOERROR response and going through the search domains, which will lead to NXDOMAIN

it doesn't look like there'll be any movement on that problem inside Net::DNS any time soon

I was wrong! it didn't look like there was RT activity but I just got https://www.nlnetlabs.nl/pipermail/net-dns-users/2018/000290.html

Haven't tested it though

So that seems to work. If we can get updated ferm and libnet-dns-perl packages this might be done.

Indeed, thanks! I've tested this on mw2151 with a the ferm upstream change applied on top of our current package and a backport of libnet-dns-perl 1.17 plus upstream revision r1717. Going to to a few more tests with libnet-dns-perl reverse dependencies and then I'll rollout this fleet-wide.

Status update: Faidon created a patch for Ferm to address this at https://github.com/MaxKellermann/ferm/pull/41. It's not yet reviewed/merged upstream and too intrusive to backport. We'll revisit when a new upstream release with the patches is available (and ideally it lands in buster as well).

This was already addressed in ferm upstream. I think this problem is solvable now :/

This is still causing problems. The upstream patch is merged.

Added a new deployment-prep puppet cherry-pick until our copy of ferm is fixed:

commit e50dbbf2514b1e78e729bf3441af247ce98769e0
Author: Alex Monk <krenair@gmail.com>
Date:   Wed Feb 27 19:19:50 2019 +0000

    Crappy T153468 workaround: Drop AAAA rules
    
    This is so other Ferm rules can be applied successfully.

diff --git a/modules/profile/manifests/mediawiki/php/monitoring.pp b/modules/profile/manifests/mediawiki/php/monitoring.pp
index 0e1c0dcb4e..15346e8994 100644
--- a/modules/profile/manifests/mediawiki/php/monitoring.pp
+++ b/modules/profile/manifests/mediawiki/php/monitoring.pp
@@ -39,7 +39,7 @@ class profile::mediawiki::php::monitoring(
         mode    => '0440'
     }
 
-    $ferm_srange = "(@resolve((${prometheus_nodes_str})) @resolve((${prometheus_nodes_str}), AAAA))"
+    $ferm_srange = "(@resolve((${prometheus_nodes_str})))"
     ferm::service { 'prometheus-php-cache-exporter':
         proto  => 'tcp',
         port   => $admin_port,
diff --git a/modules/profile/manifests/prometheus/elasticsearch_exporter.pp b/modules/profile/manifests/prometheus/elasticsearch_exporter.pp
index c2f789d410..b5b8dd34f3 100644
--- a/modules/profile/manifests/prometheus/elasticsearch_exporter.pp
+++ b/modules/profile/manifests/prometheus/elasticsearch_exporter.pp
@@ -27,7 +27,7 @@ define profile::prometheus::elasticsearch_exporter(
     ferm::service { "prometheus_elasticsearch_exporter_${prometheus_port}":
         proto  => 'tcp',
         port   => $prometheus_port,
-        srange => "(@resolve((${prometheus_nodes_ferm})) @resolve((${prometheus_nodes_ferm}), AAAA))",
+        srange => "(@resolve((${prometheus_nodes_ferm})))",
     }
 
 }
diff --git a/modules/profile/manifests/prometheus/php_fpm_exporter.pp b/modules/profile/manifests/prometheus/php_fpm_exporter.pp
index 5830e76d95..a7848767f9 100644
--- a/modules/profile/manifests/prometheus/php_fpm_exporter.pp
+++ b/modules/profile/manifests/prometheus/php_fpm_exporter.pp
@@ -18,7 +18,7 @@ class profile::prometheus::php_fpm_exporter (
         fcgi_endpoint => $fcgi_endpoint
     }
     $prometheus_ferm_nodes = join($prometheus_nodes, ' ')
-    $ferm_srange = "(@resolve((${prometheus_ferm_nodes})) @resolve((${prometheus_ferm_nodes}), AAAA))"
+    $ferm_srange = "(@resolve((${prometheus_ferm_nodes})))"
 
     ferm::service { 'prometheus-php-fpm-exporter':
         proto  => 'tcp',
diff --git a/modules/profile/manifests/prometheus/statsd_exporter.pp b/modules/profile/manifests/prometheus/statsd_exporter.pp
index 9aab84526b..d97104b8bc 100644
--- a/modules/profile/manifests/prometheus/statsd_exporter.pp
+++ b/modules/profile/manifests/prometheus/statsd_exporter.pp
@@ -9,7 +9,7 @@ class profile::prometheus::statsd_exporter (
     }
 
     $prometheus_ferm_nodes = join($prometheus_nodes, ' ')
-    $ferm_srange = "(@resolve((${prometheus_ferm_nodes})) @resolve((${prometheus_ferm_nodes}), AAAA))"
+    $ferm_srange = "(@resolve((${prometheus_ferm_nodes})))"
 
     ferm::service { 'prometheus-statsd-exporter':
         proto  => 'tcp',

Ran into this in T171188: Move the main WMCS puppetmaster into the Labs realm. Luckily I have an internal puppetmaster controlling that where I can remove the AAAA entry too. Also I think the patch above for deployment-prep is outdated as I recall having to add more stuff to it.

Mentioned in SAL (#wikimedia-operations) [2019-04-23T10:44:20Z] <moritzm> uploaded ferm 2.4-1+wmf2+deb10u1 to buster-wikimedia (T153468)

All buster hosts have been upgraded to a package of the current 2.4.2pre git with the patches made by Faidon which fixes this for good. I'll probably also upgrade stretch to 2.4.2, but that needs a few more tests as stretch currently uses 2.3.

Change 381073 restored by Dzahn:
prometheus: make ferm DNS record type configurable

https://gerrit.wikimedia.org/r/381073

I've built the 2.4.2pre package for stretch-wikimedia and tested it on a few servers successfully (also comparing iptables output to spot any potential regression from the 2.3->2.4 move). I'll upload that on Monday as it needs some coordination with Arturo to syncronise the rollout to Cloud VPS (as the ferm package ships ferm.conf as a conffile, to rule out issues with unattended-upgrades overwriting the puppetised version)

Jessie won't be fixed, it's going away in the next six months and the difference to 2.4.2pre is even bigger (plus it doesn't use a systemd unit yet)

Mentioned in SAL (#wikimedia-operations) [2019-09-30T09:30:48Z] <moritzm> uploading ferm 2.4.1+wmf2+deb9u1 for stretch-wikimedia, fixes AAAA lookups (T153468)

Mentioned in SAL (#wikimedia-cloud) [2019-09-30T09:33:08Z] <arturo> force update ferm cloud-wide (in all VMs) for T153468

Mentioned in SAL (#wikimedia-operations) [2019-10-01T09:19:50Z] <moritzm> upgrading ferm on a number of systems to 2.4.2pre T153468

Mentioned in SAL (#wikimedia-operations) [2019-10-01T10:17:15Z] <moritzm> upgrading ferm on remaining mw servers 2.4.2pre T153468

MoritzMuehlenhoff claimed this task.

This is fixed for Buster and Stretch, the remaining ~ 100 Jessie hosts won't get fixed, they'll vanish in the next six months anyway.

Change 381073 abandoned by Hashar:
prometheus: make ferm DNS record type configurable

https://gerrit.wikimedia.org/r/381073

Mentioned in SAL (#wikimedia-releng) [2021-12-23T12:59:09Z] <majavah> dropped T153468 workaround patch from deployment-prep as unnecessary with modern ferm versions