Page MenuHomePhabricator

Mailman PTR records
Open, Needs TriagePublic

Description

During the migration yesterday, we removed the need for having two IPs on the lists host that serves mailman. The short story is that we now have lists.wikimedia.org and lists1004.wikimedia.org pointing to the same IP, and have PTR records for both names on the same IP which might not be ideal.

The old setup

Pre-migration, lists1001 had two IP addresses. One was the primary IP assigned to the host by netbox, and one that was reserved for use by mailman. lists1001.wm.o forward/reverse records pointed to the primary IP, and lists.wm.o pointed to the second IP forward/reverse.

The current setup

We initially wanted to move the service IPs across to the new host, but this wouldn't work because the hosts were in different VLANs. Instead, we opted to remove the service IP entirely and only have one v4/v6 address on the new host.

We did a simple replace on the lists.wm.o record and pointed them to lists1004's IPs (the new host). This resulted in us having two PTR records that point to the same IP, one answering lists1004.wm.o and one answering lists.wm.o.

Is this a problem?

In short, we don't really know yet.

We've run into at least one instance with rsync that needed changing, rsyncd looks up the IP address and since it sometimes got lists1004 and sometimes got lists, was intermittently failing needing this fix.

The RFC seems to suggest that having two names is not recommended:

Both the gateway pointers at network nodes and the normal host pointers at full address nodes use the PTR RR to point back to the primary domain names of the corresponding hosts.

and later

Gateways will often have two names in separate domains, only one of which can be primary.

But largely, the concern is email deliverability, and whether having mismatched IPs/PTRs or multiple PTRs would affect either reputation or spam detection.

The big question is whether we need to do anything now, or if it's something we can live with for a number of weeks/months until a better option is found.

Potential solutions

The ideal outcome here is for there to only be one PTR pointing to the IP, and that is the hostname (lists1004.wm.o), and use that as the outbound MX, while also being able to serve the web UI from lists.wikimedia.org

Simply changing the MX record to be lists1004 and removing the lists.wm.o PTR doesn't work, since this fails the DNS CI checks (as seen in this change).

Move the UI behind the CDN

This seems to be the best solution long term. We would end up with:

lists.wm.o -> CDN
lists.wm.o MX -> lists1004.wm.o

We would also get the benefit of caching for list archives. It's a non-trivial amount of work to do though.

Re-add service IPs

Returning to the setup we had previously which had a separate IP for lists1001 and lists.wm.o would make the DNS side of this simplest, but isn't ideal from a network automation point of view.

Ignoring the CI error

The CI error mostly complains that there's no PTR for an address that's added in the zone file. But since we include the zone files from netbox and there will be a PTR for the host in that IP anyway. But I'm not sure how easy it is to get the checker ignore errors like this

Event Timeline

Prior related discussions in T278495: Figure out plan for mailman IP situation which can likely be closed once this task is resolved.

We've run into at least one instance with rsync that needed changing, rsyncd looks up the IP address and since it sometimes got lists1004 and sometimes got lists, was intermittently failing needing this fix.

Hmm, seems rsync's reverse lookup isn't very smart, it is obviously only checking one of the records returned to see if it matches the forward entry.

Two solutions may be:

  1. Resolve the hostname in puppet and pass the IP to the rsync template (also valid in "hosts allow"), so the hostname does not need to be directly in rsyncd.conf
  2. Add a toggle to the puppet class for rsync that allows adding reverse lookup = no to the conf file where desired

Either way I don't think this behaviour in rsync should cause us to make any assumptions about how mail servers behave.

The RFC seems to suggest that having two names is not recommended:

This RFC from the dawn of time probably has been superseded in most respects, certainly when it comes to email and the more recent development of SPF, DKIM etc.

Both the gateway pointers at network nodes and the normal host pointers at full address nodes use the PTR RR to point back to the primary domain names of the corresponding hosts.

That doesn't seem to specifically address multiple PTRs on an IP?

Gateways will often have two names in separate domains, only one of which can be primary.

"Gateway" here means router, so not relevant to this use on a host.

The big question is whether we need to do anything now, or if it's something we can live with for a number of weeks/months until a better option is found.

To my mind it doesn't make sense to take action unless we identify a problem or someone who knows more about this advises us there is one. I see no reason the current configuration cannot remain long term.

In terms of PTRs and mail servers the closest I could find to a standard is RFC8601, which we comply with as things stand:

Expressed as an algorithm: If the client peer's IP address is I, the
list of names to which I maps (after a "PTR" query) is the set N, and
the union of IP addresses to which each member of N maps (after
corresponding "A" and "AAAA" queries) is L, then this test is
successful if I is an element of L.

Basically you start with the IP, get all the PTRs, and check that the forward-entry for at least one of those hostnames points back at the starting IP. In our case all the hostnames returned in the PTR query point back at the original IP.

Simply changing the MX record to be lists1004 and removing the lists.wm.o PTR doesn't work, since this fails the DNS CI checks

Ah ok I see what you mean. That is a little unfortunate, ultimately it means we cannot have two forward entries (so that lists1004.wikimedia.org and lists.wikimedia.org both exist), but only a single PTR entry. I expect this is why a second IP was used previously :(