Page MenuHomePhabricator

[tofu] [designate] [pdns] Swapping a CNAME and an A record can cause a loop
Open, MediumPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Start with two DNS records, one with type A and the other one with type CNAME. The CNAME record should point to the A record. Make sure both are managed by OpenTofu.
  • Create a Tofu change that swaps the two records: the CNAME records becomes an A record, and the A record becomes a CNAME record pointing to the new A record
  • Run tofu apply

What happens?:

  • the OpenStack CLI and Horizon show the correct new values
  • DNS queries for both records fail
  • pdns logs show got a CNAME referral (from cache) that causes a loop
  • restarting the pdns recursor fixes the issue

What should have happened instead?:

  • the new records should resolve correctly without the need of a manual restart

Other information:

This happened during T352206: [toolsdb] Upgrade to MariaDB 10.6, the Tofu change was merge_requests/142.

It should be easy to reproduce with test records, to see if it consistently fails or if it's a race condition. I haven't tried to reproduce it yet.

Full pdns error log:

Nov 25 13:51:02 cloudservices1005 pdns-recursor[2773294]: msg="Sending SERVFAIL during resolve" error="got a CNAME referral (from cache) that causes a loop" subsystem="syncres" level="0" prio="Notice" tid="3" ts="1732542662.251" ecs="" mtid="102916031" proto="udp" qname="tools.db.svc.wikimedia.cloud" qtype="A" remote="185.15.56.63:39724"

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fnegri triaged this task as Medium priority.Dec 3 2024, 3:54 PM
fnegri added a subscriber: Andrew.

@Andrew suggested this might be expected because pdns will cache the old record for the duration of the TTL, which was quite high in the case of the CNAME record that was deleted in merge_requests/142.

It should be possible to reproduce this scenario using test records with a shorter TTL, checking if the error clears after the TTL expires, without a restart of the pdns recursor.