Some VPS instances still using ns-recursor0
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cmooney
	Sep 15 2023, 8:29 AM

Description

I noticed looking at another issue that some CloudVPS instances are still sending DNS queries to the ns-recursor0.openstack.eqiad1.wikimediacloud.org IP:

root@cloudnet1005:~# tcpdump -i qr-defc9d1d-40 -l -p -nn host 172.16.0.17 and port 53 
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on qr-defc9d1d-40, link-type EN10MB (Ethernet), snapshot length 262144 bytes
23:05:40.835546 IP 172.16.0.17.39310 > 208.80.154.143.53: 37405+ A? syslogaudit2.svc.eqiad1.wikimedia.cloud. (57)
23:05:40.835856 IP 208.80.154.143.53 > 172.16.0.17.39310: 37405 1/0/0 A 172.16.5.118 (73)

These need to be changed to use ns-recursor.openstack.eqiad1.wikimediacloud.org / 172.20.255.1 before we move cloudservices1005.

Details

Subject	Repo	Branch	Lines +/-
Remove ns-recursor0.openstack.eqiad.wikimediacloud.org names	operations/dns	master	+0 -2
cloudgw: drop temporal NAT for legacy DNS resolvers	operations/puppet	production	+1 -10
cloudgw: don't restrict compat DNS NAT to VMs without floating IPs	operations/puppet	production	+2 -2
cloudgw: route using NAT queries to legacy DNS recursors to the new	operations/puppet	production	+10 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aborrero	T296411 cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet
Resolved	aborrero	T297596 have cloud hardware servers in the cloud realm using a dedicated LB layer
Resolved	aborrero	T324992 cloudlb: create PoC on codfw
Resolved	aborrero	T327908 rename cloudgw2001-dev into cloudlb2001-dev
Resolved	ayounsi	T327930 Is Vlan 2122 cloud-support1-b-codfw required?
Resolved	cmooney	T327919 Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it
Resolved	Papaul	T331470 Run 2x1G links from asw-b1-codfw to cloudsw1-b1-codfw
		Restricted Task
Resolved	cmooney	T333316 Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16)
Resolved	cmooney	T336368 cloudgw: review security policy for edge network
Resolved	aborrero	T332153 cloudlb PoC: prepare backends
Resolved	aborrero	T336963 cloudcontrol2001-dev can't reach cloud-vps public IPs
Resolved	aborrero	T335759 cloud-private subnet: introduce new domain
Resolved	• taavi	T336566 cloud: failures resolving some `wikimedia.cloud` domains
Open	cmooney	T346428 Netbox: Add support for our complex host network setups in provision script
Resolved	aborrero	T346410 need private IPs for cloudvirt200[4-6]-dev.codfw.wmnet
Open	cmooney	T347375 Netbox device location information not available on the first Puppet run of a device
Resolved	aborrero	T335760 Modify Bird module to allow source IP to be passed to template
Resolved	aborrero	T336071 cloudlb: figure out routing
Resolved	aborrero	T337758 puppet: interface::route resource does not persists settings
Resolved	aborrero	T338936 cloudlb: figure out plans for eqiad1
Resolved	aborrero	T338937 cloudlb: review swift/radosgw status
Resolved	Andrew	T341380 Open swift port (28080) to the public internet
Resolved	Andrew	T341484 Horizon object store UI doesn't allow uploading of files
Resolved	Andrew	T341640 Move Horizon deploy to a Docker container
Resolved	Andrew	T341509 radosgw+keystone chokes on projects with '-' in their id
Resolved	Andrew	T343158 Support OpenStack projects where project name != project id
Open	None	T366679 openstack-browser support for projects where id != name
Resolved	Andrew	T347927 Better swift error messages for projects with invalid chars
Resolved	Andrew	T366301 Document and enforce restrictions on openstack project names
Resolved	aborrero	T340536 cloud-private: figure out, implement and test cross-DC traffic
Resolved	• taavi	T341060 openstack eqiad1: introduce cloud-private and cloudlb
Open	None	T317177 [tracking] Don't keep on the public vlans hosts that don't require it
Open	None	T340446 Cloud VPS Designate setup improvements
Resolved	aborrero	T307357 Move cloud vps ns-recursor IPs to host/row-independent addressing
Resolved	cmooney	T336587 cloudservices[2004/2005]-dev & cloudweb2002-dev: connect them to cloudsw so they can have cloud-private vlan
Resolved	aborrero	T342621 eqiad1: cloudlb: transition DNS clients (VMs) to the new BGP-based recursor VIP
Resolved	aborrero	T346426 Some VPS instances still using ns-recursor0

Event Timeline

cmooney triaged this task as Medium priority.Sep 15 2023, 8:29 AM

cmooney created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 15 2023, 8:29 AM

cmooney added a parent task: T346042: cloudservices1005: move to new setup.Sep 15 2023, 8:30 AM

I did a bit of research and found a bunch of VMs that don't run puppet since a long time ago and didn't change their /etc/resolv.conf

Example:

arturo@nostromo:~ $ ssh root@dns1.ldap-dev.eqiad1.wikimedia.cloud
Linux dns1 5.16.0-0.bpo.4-cloud-amd64 #1 SMP PREEMPT Debian 5.16.12-1~bpo11+1 (2022-03-08) x86_64
Debian GNU/Linux 11 (bullseye)
dns1 is a Authoritative DNS server (dns::auth)
The last Puppet run was at Wed Jul 20 20:38:37 UTC 2022 (606984 minutes ago). Puppet is disabled. blargh - jhathaway
Last puppet commit: (ad699665880) root - moar hacks
Last login: Mon Sep 11 19:04:52 2023

root@dns1:~# cat /etc/resolv.conf 
#####################################################################
#### THIS FILE IS MANAGED BY PUPPET
####  as template('resolvconf/resolv.conf.erb')
#####################################################################
search ldap-dev.eqiad1.wikimedia.cloud 
options timeout:1 attempts:3 ndots:1
nameserver 208.80.154.143
nameserver 208.80.154.24

I guess the options here are:

contact the owners and let them know we have a new DNS recursor
fix them ourselves
let them break
some combination of the above

aborrero added projects: User-aborrero, cloud-services-team.Sep 15 2023, 9:26 AM

aborrero added a subscriber: Andrew.

aborrero added a parent task: T342621: eqiad1: cloudlb: transition DNS clients (VMs) to the new BGP-based recursor VIP.Sep 15 2023, 9:29 AM

Change 957902 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: route using NAT queries to legacy DNS recursors to the new

https://gerrit.wikimedia.org/r/957902

gerritbot added a project: Patch-For-Review.Sep 15 2023, 11:08 AM

Change 957902 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: route using NAT queries to legacy DNS recursors to the new

https://gerrit.wikimedia.org/r/957902

Mentioned in SAL (#wikimedia-cloud) [2023-09-15T11:43:46Z] <arturo> merging NAT change for T346426 in cloudgw

Change 957909 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: don't restrict compat DNS NAT to VMs without floating IPs

https://gerrit.wikimedia.org/r/957909

Change 957909 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: don't restrict compat DNS NAT to VMs without floating IPs

https://gerrit.wikimedia.org/r/957909

Maintenance_bot removed a project: Patch-For-Review.Sep 15 2023, 12:10 PM

we should undo the NAT changes sometime.

Sent an email requesting users to fix their VMs: https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/5LLLYEOSYV7GVW5RZUJEGTXH2PNLSSGP/

I updated resolv.conf in all VMs that cumin can reach.

aborrero removed a parent task: T346042: cloudservices1005: move to new setup.Sep 20 2023, 12:25 PM

Change 998780 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: drop temporal NAT for legacy DNS resolvers

https://gerrit.wikimedia.org/r/998780

gerritbot added a project: Patch-For-Review.Feb 8 2024, 9:44 AM

Re-opening as we clearly still have traffic using the old recursor IPs

cmooney@cloudgw1001:~$ sudo nft list ruleset | grep 208.80.154.143
		ip daddr { 208.80.154.24, 208.80.154.143 } udp dport 53 counter packets 3860328 bytes 316392714 dnat ip to 172.20.255.1 comment "compat DNS resolver"

@aborrero not sure what the best thing to do is? Possibly if we can log the exceptions via nftables we can identify them?

Actually just did a quick tcpdump and at least these are using them so we can start there.

172.16.0.81	pontoon-os-collector-01.monitoring.eqiad1.wikimedia.cloud.
172.16.1.41	mwoffliner3.mwoffliner.eqiad1.wikimedia.cloud.
172.16.1.248	traffic-cache-atstext-buster.traffic.eqiad1.wikimedia.cloud.
172.16.2.80	rel2.search.eqiad1.wikimedia.cloud.
172.16.2.197	mwoffliner1.mwoffliner.eqiad1.wikimedia.cloud.
172.16.2.233	dns1.ldap-dev.eqiad1.wikimedia.cloud.
172.16.3.34	k8s-master-01.appservers.eqiad1.wikimedia.cloud.
172.16.3.68	webserver.qrank.eqiad1.wikimedia.cloud.
172.16.3.137	pontoon-arclamp-01.monitoring.eqiad1.wikimedia.cloud.
172.16.3.168	glitchtip.twl.eqiad1.wikimedia.cloud.
172.16.3.180	neon.rcm.eqiad1.wikimedia.cloud.
172.16.4.93	quarry-nfs-dev-02.quarry.eqiad1.wikimedia.cloud.
172.16.4.124	k8s-pontoonlb-01.appservers.eqiad1.wikimedia.cloud.
172.16.4.182	k8s-node-01.appservers.eqiad1.wikimedia.cloud.
172.16.5.11	hupu2.wikidocumentaries.eqiad1.wikimedia.cloud.
172.16.5.108	wikilabels-03.wikilabels.eqiad1.wikimedia.cloud.
172.16.5.109	
172.16.6.46	integration-agent-docker-1040.integration.eqiad1.wikimedia.cloud.
172.16.6.208	mwoffliner4.mwoffliner.eqiad1.wikimedia.cloud.
172.16.6.234	ty-analytics.teyora.eqiad1.wikimedia.cloud.
172.16.7.147	striker-docker-01.striker.eqiad1.wikimedia.cloud.

Can we assume the affected VMs that you discovered are either unmaintained, or have some special configuration? Maybe we just got them broken when we introduced the DNS change back in the day.

They were likely broken enough for cumin to not reach them. I'll nonetheless work on that list a bit.

According to cumin:

Two of those hosts have ns0 in their resolv.conf,
One of them is unreachable (quarry-nfs-dev-02.quarry.eqiad1.wikimedia.cloud)
13 have the right value (172.20.255.1) in resolv.conf
5 have 127.0.0.53 in resolv.conf

I will fix the first two.

The five with 127.0.0.53 are on their own -- they're probably pontoon or some other SRE-guided self reliance.

The 13 with the proper reesolv.conf presumably have nameservers cached or configured someplace else. I'm not sure anything can/should be done about them. They are:

dns1.ldap-dev.eqiad1.wikimedia.cloud,glitchtip.twl.eqiad1.wikimedia.cloud,hupu2.wikidocumentaries.eqiad1.wikimedia.cloud,integration-agent-docker-1040.integration.eqiad1.wikimedia.cloud,mwoffliner[1,3-4].mwoffliner.eqiad1.wikimedia.cloud,neon.rcm.eqiad1.wikimedia.cloud,striker-docker-01.striker.eqiad1.wikimedia.cloud,traffic-cache-atstext-buster.traffic.eqiad1.wikimedia.cloud,ty-analytics.teyora.eqiad1.wikimedia.cloud,webserver.qrank.eqiad1.wikimedia.cloud,wikilabels-03.wikilabels.eqiad1.wikimedia.cloud

I think we can remove the redirects here. If someone has Puppet broken for months and did not react to the cloud-announce email when this was originally announced I'd say they're on their own.

Change 1002446 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] Remove ns-recursor0.openstack.eqiad.wikimediacloud.org names

https://gerrit.wikimedia.org/r/1002446

In T346426#9533557, @taavi wrote:

I think we can remove the redirects here. If someone has Puppet broken for months and did not react to the cloud-announce email when this was originally announced I'd say they're on their own.

I agree.

Change 998780 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: drop temporal NAT for legacy DNS resolvers

https://gerrit.wikimedia.org/r/998780

Change 1002446 merged by Majavah:

[operations/dns@master] Remove ns-recursor0.openstack.eqiad.wikimediacloud.org names

https://gerrit.wikimedia.org/r/1002446

Maintenance_bot removed a project: Patch-For-Review.Feb 12 2024, 3:31 PM

I noticed this morning that this broke new VMs based on images built before the new resolver IP was added. To fix, I rebuilt and installed a new Bullseye base image, and built a new Buster base image in 'testlabs' but disabled the existing Buster image in toolforge because I'm hopeful that we won't be building any new Buster VMs.

In T346426#9529977, @Andrew wrote:

5 have 127.0.0.53 in resolv.conf

Those are probably running systemd-resolved, which creates a caching resolver on localhost for queries, and forwards requests. You'll should find the actual IP they're forwarding queries to in /etc/systemd/resolved.conf

In T346426#9529977, @Andrew wrote:

The 13 with the proper reesolv.conf presumably have nameservers cached or configured someplace else. I'm not sure anything can/should be done about them.

Yeah probably some user-space thing doing DNS directly. Not really sure what you can do.

I agree that removing the redirect/NAT might be the best option so those things break, forcing any that are still maintained to adjust their config.

Given we deleted the NAT compat rule, I guess there is nothing else to do regarding this ticket.

Please reopen if required.

Some VPS instances still using ns-recursor0Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Some VPS instances still using ns-recursor0
Closed, ResolvedPublic
Actions

Related Objects
Search...