Page MenuHomePhabricator

Some VPS instances still using ns-recursor0
Open, MediumPublic

Description

I noticed looking at another issue that some CloudVPS instances are still sending DNS queries to the ns-recursor0.openstack.eqiad1.wikimediacloud.org IP:

root@cloudnet1005:~# tcpdump -i qr-defc9d1d-40 -l -p -nn host 172.16.0.17 and port 53 
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on qr-defc9d1d-40, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:05:40.835546 IP 172.16.0.17.39310 > 208.80.154.143.53: 37405+ A? syslogaudit2.svc.eqiad1.wikimedia.cloud. (57)
23:05:40.835856 IP 208.80.154.143.53 > 172.16.0.17.39310: 37405 1/0/0 A 172.16.5.118 (73)

These need to be changed to use ns-recursor.openstack.eqiad1.wikimediacloud.org / 172.20.255.1 before we move cloudservices1005.

Related Objects

StatusSubtypeAssignedTask
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedayounsi
Resolvedcmooney
ResolvedPapaul
Resolvedcmooney
Resolvedcmooney
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedtaavi
Opencmooney
Resolvedaborrero
Opencmooney
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
OpenAndrew
OpenAndrew
ResolvedAndrew
Resolvedaborrero
In Progresstaavi
OpenNone
OpenNone
Resolvedaborrero
Resolvedcmooney
Resolvedaborrero
Openaborrero

Event Timeline

cmooney triaged this task as Medium priority.Sep 15 2023, 8:29 AM
cmooney created this task.

I did a bit of research and found a bunch of VMs that don't run puppet since a long time ago and didn't change their /etc/resolv.conf

Example:

arturo@nostromo:~ $ ssh root@dns1.ldap-dev.eqiad1.wikimedia.cloud
Linux dns1 5.16.0-0.bpo.4-cloud-amd64 #1 SMP PREEMPT Debian 5.16.12-1~bpo11+1 (2022-03-08) x86_64
Debian GNU/Linux 11 (bullseye)
dns1 is a Authoritative DNS server (dns::auth)
The last Puppet run was at Wed Jul 20 20:38:37 UTC 2022 (606984 minutes ago). Puppet is disabled. blargh - jhathaway
Last puppet commit: (ad699665880) root - moar hacks
Last login: Mon Sep 11 19:04:52 2023

root@dns1:~# cat /etc/resolv.conf 
#####################################################################
#### THIS FILE IS MANAGED BY PUPPET
####  as template('resolvconf/resolv.conf.erb')
#####################################################################
search ldap-dev.eqiad1.wikimedia.cloud 
options timeout:1 attempts:3 ndots:1
nameserver 208.80.154.143
nameserver 208.80.154.24

I guess the options here are:

  • contact the owners and let them know we have a new DNS recursor
  • fix them ourselves
  • let them break
  • some combination of the above

Change 957902 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: route using NAT queries to legacy DNS recursors to the new

https://gerrit.wikimedia.org/r/957902

Change 957902 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: route using NAT queries to legacy DNS recursors to the new

https://gerrit.wikimedia.org/r/957902

Mentioned in SAL (#wikimedia-cloud) [2023-09-15T11:43:46Z] <arturo> merging NAT change for T346426 in cloudgw

Change 957909 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: don't restrict compat DNS NAT to VMs without floating IPs

https://gerrit.wikimedia.org/r/957909

Change 957909 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: don't restrict compat DNS NAT to VMs without floating IPs

https://gerrit.wikimedia.org/r/957909

aborrero claimed this task.

we should undo the NAT changes sometime.

Sent an email requesting users to fix their VMs: https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/5LLLYEOSYV7GVW5RZUJEGTXH2PNLSSGP/

I updated resolv.conf in all VMs that cumin can reach.

Change 998780 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: drop temporal NAT for legacy DNS resolvers

https://gerrit.wikimedia.org/r/998780

Re-opening as we clearly still have traffic using the old recursor IPs

cmooney@cloudgw1001:~$ sudo nft list ruleset | grep 208.80.154.143
		ip daddr { 208.80.154.24, 208.80.154.143 } udp dport 53 counter packets 3860328 bytes 316392714 dnat ip to 172.20.255.1 comment "compat DNS resolver"

@aborrero not sure what the best thing to do is? Possibly if we can log the exceptions via nftables we can identify them?

Actually just did a quick tcpdump and at least these are using them so we can start there.

172.16.0.81	pontoon-os-collector-01.monitoring.eqiad1.wikimedia.cloud.
172.16.1.41	mwoffliner3.mwoffliner.eqiad1.wikimedia.cloud.
172.16.1.248	traffic-cache-atstext-buster.traffic.eqiad1.wikimedia.cloud.
172.16.2.80	rel2.search.eqiad1.wikimedia.cloud.
172.16.2.197	mwoffliner1.mwoffliner.eqiad1.wikimedia.cloud.
172.16.2.233	dns1.ldap-dev.eqiad1.wikimedia.cloud.
172.16.3.34	k8s-master-01.appservers.eqiad1.wikimedia.cloud.
172.16.3.68	webserver.qrank.eqiad1.wikimedia.cloud.
172.16.3.137	pontoon-arclamp-01.monitoring.eqiad1.wikimedia.cloud.
172.16.3.168	glitchtip.twl.eqiad1.wikimedia.cloud.
172.16.3.180	neon.rcm.eqiad1.wikimedia.cloud.
172.16.4.93	quarry-nfs-dev-02.quarry.eqiad1.wikimedia.cloud.
172.16.4.124	k8s-pontoonlb-01.appservers.eqiad1.wikimedia.cloud.
172.16.4.182	k8s-node-01.appservers.eqiad1.wikimedia.cloud.
172.16.5.11	hupu2.wikidocumentaries.eqiad1.wikimedia.cloud.
172.16.5.108	wikilabels-03.wikilabels.eqiad1.wikimedia.cloud.
172.16.5.109	
172.16.6.46	integration-agent-docker-1040.integration.eqiad1.wikimedia.cloud.
172.16.6.208	mwoffliner4.mwoffliner.eqiad1.wikimedia.cloud.
172.16.6.234	ty-analytics.teyora.eqiad1.wikimedia.cloud.
172.16.7.147	striker-docker-01.striker.eqiad1.wikimedia.cloud.

Can we assume the affected VMs that you discovered are either unmaintained, or have some special configuration? Maybe we just got them broken when we introduced the DNS change back in the day.

They were likely broken enough for cumin to not reach them. I'll nonetheless work on that list a bit.

According to cumin:

  • Two of those hosts have ns0 in their resolv.conf,
  • One of them is unreachable (quarry-nfs-dev-02.quarry.eqiad1.wikimedia.cloud)
  • 13 have the right value (172.20.255.1) in resolv.conf
  • 5 have 127.0.0.53 in resolv.conf

I will fix the first two.

The five with 127.0.0.53 are on their own -- they're probably pontoon or some other SRE-guided self reliance.

The 13 with the proper reesolv.conf presumably have nameservers cached or configured someplace else. I'm not sure anything can/should be done about them. They are:

dns1.ldap-dev.eqiad1.wikimedia.cloud,glitchtip.twl.eqiad1.wikimedia.cloud,hupu2.wikidocumentaries.eqiad1.wikimedia.cloud,integration-agent-docker-1040.integration.eqiad1.wikimedia.cloud,mwoffliner[1,3-4].mwoffliner.eqiad1.wikimedia.cloud,neon.rcm.eqiad1.wikimedia.cloud,striker-docker-01.striker.eqiad1.wikimedia.cloud,traffic-cache-atstext-buster.traffic.eqiad1.wikimedia.cloud,ty-analytics.teyora.eqiad1.wikimedia.cloud,webserver.qrank.eqiad1.wikimedia.cloud,wikilabels-03.wikilabels.eqiad1.wikimedia.cloud

I think we can remove the redirects here. If someone has Puppet broken for months and did not react to the cloud-announce email when this was originally announced I'd say they're on their own.

Change 1002446 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] Remove ns-recursor0.openstack.eqiad.wikimediacloud.org names

https://gerrit.wikimedia.org/r/1002446

I think we can remove the redirects here. If someone has Puppet broken for months and did not react to the cloud-announce email when this was originally announced I'd say they're on their own.

I agree.

Change 998780 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: drop temporal NAT for legacy DNS resolvers

https://gerrit.wikimedia.org/r/998780

Change 1002446 merged by Majavah:

[operations/dns@master] Remove ns-recursor0.openstack.eqiad.wikimediacloud.org names

https://gerrit.wikimedia.org/r/1002446

I noticed this morning that this broke new VMs based on images built before the new resolver IP was added. To fix, I rebuilt and installed a new Bullseye base image, and built a new Buster base image in 'testlabs' but disabled the existing Buster image in toolforge because I'm hopeful that we won't be building any new Buster VMs.

  • 5 have 127.0.0.53 in resolv.conf

Those are probably running systemd-resolved, which creates a caching resolver on localhost for queries, and forwards requests. You'll should find the actual IP they're forwarding queries to in /etc/systemd/resolved.conf

The 13 with the proper reesolv.conf presumably have nameservers cached or configured someplace else. I'm not sure anything can/should be done about them.

Yeah probably some user-space thing doing DNS directly. Not really sure what you can do.

I agree that removing the redirect/NAT might be the best option so those things break, forcing any that are still maintained to adjust their config.