Page MenuHomePhabricator

Should VPS puppetmasters include labs-recursor0/ns-1 in their resolv.confs?
Open, MediumPublic

Description

Sometimes a puppet class (or .erb file) wants to resolve a hostname. That works fine in production, and works fine on a VPS-local puppetmaster. But when we try to compile a manifest on labs-puppetmaster it can't resolve anything under .wmflabs

It would probably be simple to have the labs puppetmasters know to look at labs-ns* to resolve things under .wmflabs. Is there any reason not to do that?

Event Timeline

Andrew created this task.Oct 11 2017, 3:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2017, 3:53 PM

If lab-ns* servers are down and labpuppetmaster can't resolve anything, what would be the impact?

In any case, /etc/resolv.conf supports up to 3 nameservers so I think it should be fine to have lab-ns0, lab-ns1 and, as failover with degraded functionality because it can't resolve *.wmflabs, one of the traditional recursive resolvers (currently 208.80.154.254 and 208.80.153.254). It might hurt the principle of least surprise but someone troubleshooting DNS failures would in theory test each of the three resolvers and see that the first two are broken and the last one can't resolve .wmflabs. Maybe add a comment on /etc/resolv.conf about that.

If lab-ns* servers are down and labpuppetmaster can't resolve anything, what would be the impact?

I'm pretty sure that the behavior in this case would be the same as the behavior we're seeing now, if the lookups are indeed hierarchical as you suggest. Things outside of .wmflabs would get resolved by the normal (prod-ish) recursors and things in .wmflabs would fail, which as far as I know they're doing now anyway.

To test this I had to use labs-recursor* because recursive queries don't seem to be enabled on labs-ns*.

Here's the output of my test (blocking each labs-recursor* in sequence):

root@labpuppetmaster1001:~# cat /etc/resolv.conf 
#####################################################################
#### THIS FILE IS MANAGED BY PUPPET
####  as template('base/resolv.conf.erb')
#####################################################################
# Resolver configuration for site eqiad
search wikimedia.org
options timeout:1 attempts:3
nameserver 208.80.155.118
nameserver 208.80.154.20
nameserver 208.80.154.254
#nameserver 208.80.153.254

root@labpuppetmaster1001:~# nslookup shinken-02.shinken.eqiad.wmflabs
Server:		208.80.155.118
Address:	208.80.155.118#53

Non-authoritative answer:
Name:	shinken-02.shinken.eqiad.wmflabs
Address: 10.68.21.15

root@labpuppetmaster1001:~# iptables -A OUTPUT -d 208.80.155.118 -p udp --dport 53 -j REJECT

root@labpuppetmaster1001:~# nslookup shinken-02.shinken.eqiad.wmflabs
Server:		208.80.154.20
Address:	208.80.154.20#53

Non-authoritative answer:
Name:	shinken-02.shinken.eqiad.wmflabs
Address: 10.68.21.15

root@labpuppetmaster1001:~# iptables -A OUTPUT -d 208.80.154.20 -p udp --dport 53 -j REJECT

root@labpuppetmaster1001:~# nslookup shinken-02.shinken.eqiad.wmflabs
Server:		208.80.154.254
Address:	208.80.154.254#53

** server can't find shinken-02.shinken.eqiad.wmflabs: NXDOMAIN

Now when both labs-recursors* are down and we have to use the 3rd nameservers, there is an added delay while it waits for the timeout before moving one.

root@labpuppetmaster1001:~# time nslookup wikimedia.org
Server:		208.80.154.254
Address:	208.80.154.254#53

Non-authoritative answer:
Name:	wikimedia.org
Address: 208.80.154.224


real	0m2.015s
user	0m0.004s
sys	0m0.012s

root@labpuppetmaster1001:~# iptables -D OUTPUT -d 208.80.155.118 -p udp --dport 53 -j REJECT
root@labpuppetmaster1001:~# iptables -D OUTPUT -d 208.80.154.20 -p udp --dport 53 -j REJECT

root@labpuppetmaster1001:~# time nslookup wikimedia.org
Server:		208.80.155.118
Address:	208.80.155.118#53

Non-authoritative answer:
Name:	wikimedia.org
Address: 208.80.154.224


real	0m0.016s
user	0m0.012s
sys	0m0.004s

Hm, I vaguely think that we should always use the recursors rather than the auth in this case since we're generating IPs for use on a VM, so any IP-swizzling that we do in puppet should be the same as on the VM (which only knows about the recursors).

What was happening in your test case that got you NXDOMAIN?

When I got NXDOMAIN, It was querying the 3rd nameserver (dns-rec-lb.eqiad.wikimedia.org -- 208.80.154.254), which is the first today.

Ah, ok. So it sounds this works! Do you have any concerns?

Andrew renamed this task from Should VPS puppetmasters include labs-ns0/ns-1 in their resolv.confs? to Should VPS puppetmasters include labs-recursor0/ns-1 in their resolv.confs?.Oct 24 2018, 3:29 PM

No concerns, this seems reasonable.

GTirloni claimed this task.Nov 6 2018, 3:03 PM
GTirloni triaged this task as Medium priority.
GTirloni moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Change 474923 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] labpuppetmaster: Resolve .wmflabs addresses

https://gerrit.wikimedia.org/r/474923

Change 474923 merged by GTirloni:
[operations/puppet@production] labpuppetmaster: Resolve .wmflabs addresses

https://gerrit.wikimedia.org/r/474923

faidon added a subscriber: faidon.Nov 21 2018, 6:44 PM

If this is about labspuppetmaster1xxx, I have concerns with having a production host use a non-standard recursor, as well having cross-realm DNS queries like that. I can't offer any practical attack vectors right now, but I'd like to ask to block this for now -- preferrably until puppetmasters themselves move to WMCS and this gets implicitly fixed by extension :)

Change 475129 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] Revert "labpuppetmaster: Resolve .wmflabs addresses"

https://gerrit.wikimedia.org/r/475129

Change 475129 merged by GTirloni:
[operations/puppet@production] Revert "labpuppetmaster: Resolve .wmflabs addresses"

https://gerrit.wikimedia.org/r/475129

I've reverted the change for now based on these concerns. This seems related to what's being discussed in T171188 and T207536.

GTirloni removed GTirloni as the assignee of this task.Dec 11 2018, 10:21 AM
GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:11 PM