Page MenuHomePhabricator

cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005
Closed, ResolvedPublic

Description

Background

While working through some of the Cloud DNS stuff with @Andrew we noticed that cloudservices1006 is using private IP 10.64.151.4 to send DNS NOTIFY messages to cloudservices1005. We probably should have anticipated this, as that's the IP cloudservices1006 will use by default when trying to reach any external address. Including the current cloudservices1005 IP of 208.80.154.148.

In an ideal scenario mdns would use the local public IP for all comms. We've configured the local-address and query-local-address pdns options to force that, but it seems they don't affect DNS NOTIFYs. Checking through the pdns options I couldn't see a similar way to control the source IP for NOTIFYs (someone with more pdns experience might want to double check). (EDIT: mdns is producing the updates / notify messages, and sending them to pdns, pdns is not creating them).

Right now, when cloudservices1006 sends a NOTIFY to itself, it's using 185.15.56.163 ('lo' interface). But when sending them to cloudservices1005 it picks 10.64.151.4 ('eno12399np0' interface). This is probably just normal kernel IP selection based on outbound interface. Either way it's not using the same IP for all NOTIFYs.

Current Setup

Andrew updated the list of IPs defined as "master" on each host earlier, to ensure updates were allowed both from the local system and its remote peer. This cleared up errors we observed whereby cloudservices1006 was rejecting its own NOTIFYs sent from 185.15.56.163. Right now the masters are set up as follows and all updates are being accepted:

HostAllowed Masters
cloudservices1005208.80.154.148 (itself), 10.64.151.4 (cloudservices1006)
cloudservices1006185.15.56.163 (itself), 208.80.154.148 (cloudservices1005)

Eventual Setup

When we have moved and reimaged cloudservices1005, making it live again for the new ns0 IP 185.15.56.162, we probably need to have the following setup:

HostAllowed Masters
cloudservices1005185.15.56.162 (itself), 172.20.1.5 (cloudservices1006)
cloudservices1006185.15.56.163 (itself), 172.20.2.4 (cloudservices1005)

Ideally we could just have 185.15.56.162 and 185.15.56.163 on both, covering the local and remote system in either case. But instead we need a different pair of IPs on each, as the systems are using different source addresses for local vs. remote updates. Perhaps we could include all 4 IPs on both, if it didn't cause any other issue?

NOTE: In the new setup, the destination addresses for updates will be 185.15.56.162 (ns0 / 1005) and 185.15.56.163 (ns1 / 1006). As those addresses are in the 185.15.56.0/24 network, the hosts will use their cloud-private interface to get there, hence the 172.20.x addressing rather than 10.x. This is causing some configuration issues that are tracked in the follow-up task T350995: [openstack] cloudservices + Designate are using different source addresses for local vs. remote updates.

Event Timeline

cmooney triaged this task as Medium priority.Sep 14 2023, 6:24 PM
cmooney created this task.
cmooney renamed this task from cloudservices1006 using eqiad.wmnet address to send NOTIFY updates to cloudservices1005 to cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005.Sep 14 2023, 6:30 PM
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)

Btw I'm assuming pdns is actually generating all of these packets. I'm not very familiar with the overall setup and how designate pushes out changes to the two servers. It's entirely possible that some other process is generating the update to the local host, and pdns is only sending the update to the remote one.

Even if it is some other process, without a way to tell it to use the public IP we still need a setup as described in the description.

Background:

Each host has three dns servers, mdns (which is managed by openstack designate) pdns (auth for the outside world) and pdns-recursor (for VM requests.)

The traffic you're seeing is between mdns and pdns. Either mdns service can send updates to both pdns servers, which prompts the pdns servers to make axfr syncs with the mdns that sent the notification.

What's misconfigured:

Desginate has a 'pool' config that specifies who the mdns servers should notify, and also what the mdns servers should check to convince themselves that the axfr sync happened correctly. This pool config is puppetized as /etc/designate/pool.yaml; it's not a real-time config, though; actually updating designate requires running a command.

That 10. address is in the current pool config. It's probably wrong, but also everything is changing constantly so I'm inclined to ignore this issue until y'all stop renumbering and reracking servers.

Thanks for the context @Andrew, I was thinking it was something like that thanks for filling in the gaps.

I guess the big question I have is there any way to control what IP mdns uses for updates? Or otherwise is it possible to define the pools.yaml file such that the "allowed masters" will include the right IPs on each host as I mention in the description?

That 10. address is in the current pool config. It's probably wrong, but also everything is changing constantly so I'm inclined to ignore this issue until y'all stop renumbering and reracking servers.

I think now that cloudservices1005 is offline it doesn't need to be there. But I agree, no point making random changes, let's think it through fully and work out what we need to have in place to arrive at the final config we need.

cmooney updated the task description. (Show Details)

I did update domains set master="172.20.1.5:5354 172.20.2.4:5354 185.15.56.162:5354 185.15.56.163:5354"; on the pdns DB in both cloudservices1005/1006.

Change 959377 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] pdns_server: make the webserver address configurable

https://gerrit.wikimedia.org/r/959377

Change 959378 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Update pdns web server to use private IPs

https://gerrit.wikimedia.org/r/959378

Change 959379 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] designate pools.yaml: contact pdns webserver on private IP

https://gerrit.wikimedia.org/r/959379

Change 959377 merged by Andrew Bogott:

[operations/puppet@production] pdns_server: make the webserver address configurable

https://gerrit.wikimedia.org/r/959377

Change 959378 merged by Andrew Bogott:

[operations/puppet@production] Update pdns web server to use private IPs

https://gerrit.wikimedia.org/r/959378

Change 960624 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudservices1006: remove old listen-on address

https://gerrit.wikimedia.org/r/960624

Change 960624 merged by Andrew Bogott:

[operations/puppet@production] cloudservices1006: remove old listen-on address

https://gerrit.wikimedia.org/r/960624

Change 960755 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] designate pools.yaml: contact pdns webserver on private IP

https://gerrit.wikimedia.org/r/960755

Change 959379 merged by Andrew Bogott:

[operations/puppet@production] designate/pdns: refactor a bunch of address settings

https://gerrit.wikimedia.org/r/959379

Change 960755 merged by Andrew Bogott:

[operations/puppet@production] designate pools.yaml: contact pdns webserver on private IP

https://gerrit.wikimedia.org/r/960755

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

I think this is fixed now, right?

I think this has been edited to indicate a different problem, that still exists:

the systems are using different source addresses for local vs. remote updates

This means that the config in /etc/designate/pool.yaml is setting incorrect values in the database, and "Allowed Masters" need to be reset manually, after a reimage and probably (to be verified) after a new domain is created.

I will update the title of this task.

Or maybe it's easier to create a new task and resolve this one :)