Page MenuHomePhabricator

cloud-private subnet: introduce new domain
Closed, ResolvedPublic

Description

In T332191: Decision request - Choose a subdomain for new cloud-private subnets it was decided that a new domain will be introduced for the cloud-private subnet/vlan.

This task is to track work on the implementation.

Once implemented, this should help us tidy a bit the puppet code, that currently hardcodes IP addresses (and which can use FQDNs instead).

Event Timeline

aborrero triaged this task as Medium priority.May 2 2023, 11:05 AM
aborrero created this task.

Change 914310 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimedia.cloud: add new codfw.hw.wikimedia.cloud addresses

https://gerrit.wikimedia.org/r/914310

aborrero renamed this task from cloudlb: introduce new domain to cloud-private subnet: introduce new domain.May 2 2023, 11:35 AM

Quick IRC with @Volans about automating this from netbox.

The following things will need to be updated:

  • in the provision netbox script that assigns IPs and set their FQDNs in netbox
  • in the netbox->dns generation script as it might have assumptions on the number of DNS records per host and similar things
  • [just to check] the decommission cookbook to ensure it clears out those additional IPs too, but I think it should do it

also cc'ing @jbond and @ayounsi for any additional insights.

Change 914310 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimedia.cloud: add new codfw.hw.wikimedia.cloud addresses

https://gerrit.wikimedia.org/r/914310

For the sake of unblocking the BGP work, I'll merge the patch as is, while we work on the netbox integration in parallel.

Change 914310 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimedia.cloud: add new codfw.hw.wikimedia.cloud addresses

https://gerrit.wikimedia.org/r/914310

Another topic is reverse PTR record generation.

The addressing is 172.16.x.y and that is under designate @ openstack control, which means we would need to delegate the CIDR from designate into ns[0-2].wikimedia.org.

Another topic is reverse PTR record generation.

The addressing is 172.16.x.y and that is under designate @ openstack control, which means we would need to delegate the CIDR from designate into ns[0-2].wikimedia.org.

It seems the cloud DNS is only authoritative for 16.172.in-addr.arpa:

cathal@officepc:~$ dig +short SOA 172.in-addr.arpa @ns0.openstack.codfw1dev.wikimediacloud.org.
cathal@officepc:~$ dig +short SOA 16.172.in-addr.arpa @ns0.openstack.codfw1dev.wikimediacloud.org.
ns0.openstack.codfw1dev.wikimediacloud.org. root.wmflabs.org. 1683029987 3505 600 86400 3600

So adding NS entries for 20.172.in-addr.arpa pointing to the prod DNS should be feasible.

The following things will need to be updated:

  • in the provision netbox script that assigns IPs and set their FQDNs in netbox

Yeah. We have several systems (LVS, Ganeti and Cloud servers) that have additional interfaces that only get added to Netbox from PuppetDB after the OS install and first puppet run.

For the others DNS has not been an issue, but here it is. And there is a need to have the DNS in place before the first puppet run if it's going to be used by Puppet when configuring the host (Bird local IP etc.)

Once T296832 is closed out I had planned to investigate how to change the Netbox server provision script to add options for these special-case servers, pre-populating the additional interface and assigning IPs to them. Ultimately if we get that far we are in a position where all network config can be done at OS install time, based on Netbox vars, and the PuppetDB -> Netbox import script merely renames interfaces if needed.

Outside of having that populate IPs and DNS names in advance I tested 'reserving' IPs in Netbox manually, but it seems if a reserved IP has a DNS name it's not created by the cookbook (makes sense), so even doing that wouldn't get around the chicken and egg problem.

Change 914751 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] templates: add 20.172.in-addr.arpa

https://gerrit.wikimedia.org/r/914751

Change 917296 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimedia.cloud: refresh cloud-private vlan subdomain

https://gerrit.wikimedia.org/r/917296

Change 917297 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud-private: refresh domain

https://gerrit.wikimedia.org/r/917297

Change 917296 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimedia.cloud: refresh cloud-private vlan subdomain

https://gerrit.wikimedia.org/r/917296

Change 917297 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud-private: refresh domain

https://gerrit.wikimedia.org/r/917297

Quick IRC with @Volans about automating this from netbox.

The following things will need to be updated:

  • in the provision netbox script that assigns IPs and set their FQDNs in netbox
  • in the netbox->dns generation script as it might have assumptions on the number of DNS records per host and similar things
  • [just to check] the decommission cookbook to ensure it clears out those additional IPs too, but I think it should do it

also cc'ing @jbond and @ayounsi for any additional insights.

So @Volans did the netbox integration during the Wikimedia Hackathon 2023 in Athens.

Additional question: Do we integrate with the host provisioning script in netbox for additional automation? This should be helpful if we are really planning to go beyond this PoC and have a hundred next hosts provisioned with this cloud-private address in the next few years.

note: merge this patch next week when we are back from the hackathon: https://gerrit.wikimedia.org/r/c/operations/dns/+/914751/

Change 914751 merged by Cathal Mooney:

[operations/dns@master] templates: convert 172.20.5.0/24 to Nebox

https://gerrit.wikimedia.org/r/914751

@aborrero DNS patch is now merged, Netbox-generated records being served as expected:

cmooney@wikilap:~$ dig +noall +answer PTR 1.5.20.172.in-addr.arpa @ns0.wikimedia.org
1.5.20.172.in-addr.arpa. 3600	IN	PTR	cloudsw.private.codfw.wikimedia.cloud.
cmooney@wikilap:~$ dig +noall +answer A cloudsw.private.codfw.wikimedia.cloud. @ns0.wikimedia.org
cloudsw.private.codfw.wikimedia.cloud. 3600 IN A 172.20.5.1

Something is still not working as expected:

aborrero@cloudlb2001-dev:~ $ host cloudsw.private.codfw.wikimedia.cloud
cloudsw.private.codfw.wikimedia.cloud has address 172.20.5.1
aborrero@cloudlb2001-dev:~ $ host 172.20.5.1
Host 1.5.20.172.in-addr.arpa. not found: 3(NXDOMAIN)

I didn't have time yet to debug more.

Forward records work as expected via our recursive DNS:

cmooney@cloudlb2001-dev:~$ dig +nsid A cloudsw.private.codfw.wikimedia.cloud

; <<>> DiG 9.16.37-Debian <<>> +nsid A cloudsw.private.codfw.wikimedia.cloud
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12955
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 64 6e 73 32 30 30 36 ("dns2006")
;; QUESTION SECTION:
;cloudsw.private.codfw.wikimedia.cloud. IN A

;; ANSWER SECTION:
cloudsw.private.codfw.wikimedia.cloud. 3595 IN A 172.20.5.1

;; Query time: 4 msec
;; SERVER: 10.3.0.1#53(10.3.0.1)
;; WHEN: Wed May 31 13:33:22 UTC 2023
;; MSG SIZE  rcvd: 93

Reverse records aren't working unless you query the authdns servers directly though. I suspect we may need to change something here to tell recdns that nsX.wikimedia.org is auth for 20.172.in-addr.arpa.

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/dnsrecursor/manifests/init.pp#68

Change 924954 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] dnsrecursor: add reverse DNS for cloud-private

https://gerrit.wikimedia.org/r/924954

Change 924954 merged by Cathal Mooney:

[operations/puppet@production] dnsrecursor: add reverse DNS for cloud-private

https://gerrit.wikimedia.org/r/924954

All looks good after the merge, closing task.

cmooney@cloudlb2001-dev:~$ dig +nsid NS 20.172.in-addr.arpa 

; <<>> DiG 9.16.37-Debian <<>> +nsid NS 20.172.in-addr.arpa
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63781
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 64 6e 73 32 30 30 36 ("dns2006")
;; QUESTION SECTION:
;20.172.in-addr.arpa.		IN	NS

;; ANSWER SECTION:
20.172.in-addr.arpa.	86400	IN	NS	ns1.wikimedia.org.
20.172.in-addr.arpa.	86400	IN	NS	ns0.wikimedia.org.
20.172.in-addr.arpa.	86400	IN	NS	ns2.wikimedia.org.

;; Query time: 0 msec
;; SERVER: 10.3.0.1#53(10.3.0.1)
;; WHEN: Wed May 31 15:15:51 UTC 2023
;; MSG SIZE  rcvd: 126
cmooney@cloudlb2001-dev:~$ dig +nsid -x 172.20.5.1

; <<>> DiG 9.16.37-Debian <<>> +nsid -x 172.20.5.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44473
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; NSID: 64 6e 73 32 30 30 36 ("dns2006")
;; QUESTION SECTION:
;1.5.20.172.in-addr.arpa.	IN	PTR

;; ANSWER SECTION:
1.5.20.172.in-addr.arpa. 3600	IN	PTR	cloudsw.private.codfw.wikimedia.cloud.

;; Query time: 0 msec
;; SERVER: 10.3.0.1#53(10.3.0.1)
;; WHEN: Wed May 31 15:16:04 UTC 2023
;; MSG SIZE  rcvd: 114
aborrero added a subscriber: taavi.

Thanks you both @taavi and @cmooney.

However, something is still not quite right with this:

aborrero@puppetmaster2001:~ $ host cloudcontrol2004-dev.private.codfw.wikimedia.cloud
Host cloudcontrol2004-dev.private.codfw.wikimedia.cloud not found: 3(NXDOMAIN)
aborrero@cloudlb2002-dev:~ $ host cloudcontrol2004-dev.private.codfw.wikimedia.cloud
cloudcontrol2004-dev.private.codfw.wikimedia.cloud has address 172.20.5.6
aborrero@cloudcontrol2001-dev:~ $ host cloudcontrol2004-dev.private.codfw.wikimedia.cloud
cloudcontrol2004-dev.private.codfw.wikimedia.cloud has address 172.20.5.6

Thanks you both @taavi and @cmooney.

However, something is still not quite right with this:

aborrero@puppetmaster2001:~ $ host cloudcontrol2004-dev.private.codfw.wikimedia.cloud
Host cloudcontrol2004-dev.private.codfw.wikimedia.cloud not found: 3(NXDOMAIN)
aborrero@cloudlb2002-dev:~ $ host cloudcontrol2004-dev.private.codfw.wikimedia.cloud
cloudcontrol2004-dev.private.codfw.wikimedia.cloud has address 172.20.5.6
aborrero@cloudcontrol2001-dev:~ $ host cloudcontrol2004-dev.private.codfw.wikimedia.cloud
cloudcontrol2004-dev.private.codfw.wikimedia.cloud has address 172.20.5.6

nevermind, it works now. It just took a long long time for the puppetmaster to see the record.

Re-opening to add new reverse delegation for 172.20.254.0/24

Change 928543 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add reverse DNS zone for 172.20.254.0/24 cloud-private vip range

https://gerrit.wikimedia.org/r/928543

Change 928543 merged by Cathal Mooney:

[operations/dns@master] Add include in 20.172.in-addr.arpa for 172.20.254.0/24 netbox records

https://gerrit.wikimedia.org/r/928543