Page MenuHomePhabricator

Improving the time it takes to run authdns-update
Closed, ResolvedPublic

Description

authdns-update has been getting noticeably slower over the past few months. I don't have raw numbers since we don't really measure this, but we did add SAL logging for START and END authdns-update events, so you can see these in https://sal.toolforge.org/production?p=0&q=authdns-update&d=. On average, it takes about 2-3 minutes for authdns-update to complete, and the DNS changes are not considered live until it finishes running on all 16 DNS hosts. So for a change to go live, there is the base 2-3 minutes, in addition to whatever TTL is there so this may not be ideal.

The DNS admin cookbook to depool sites (sre.dns.admin) works differently (confd + direct gdnsd reload) so we are not worried about the time it takes to depool sites in case of emergencies. But in general, running authdns-update should not take this much time, and since it has gone noticeably slower over the past few months, we should look into why and try to optimize it.

In I31abc9e26e0e006096d43c9c4f997001ff68da39, we bumped the clush timeout (to 90 seconds) because the 45 second execution time was not enough for authdns-local-update to complete, so this has been a problem for a while.

Part of the reason is the simple increase in zone files, at the time of writing:

Assembling and testing data in /tmp/dns-check.7umy3nq_
 -- Generating zonefiles from zone templates
 -- Processed 625 zones into directory /tmp/dns-check.7umy3nq_/zones
real	2m32.029s
user	0m0.740s
sys	0m0.255s

How much of a problem is the simple increase in zone files remains to be seen, or if the it is the $INCLUDE snippets for Netbox (also remains to be seen). This possible research and improvements will form the bulk of the work for this task and are also documented in T362985.

The other thing we can do is to try doing some Git cleanup by running git maintenance run and seeing if that helps. We ran it on dns1004 and reported some success but we haven't run it anywhere else.

Event Timeline

ssingh triaged this task as Medium priority.May 7 2025, 2:23 PM

Mentioned in SAL (#wikimedia-operations) [2025-05-07T15:06:14Z] <sukhe> sudo cumin -b1 -s10 'A:dnsbox' 'sudo -u authdns git -C /srv/authdns/git maintenance run' T393602

So we have trimmed it down even with a simple git maintenance run:

real	1m2.831s
user	0m0.703s
sys	0m0.263s

This is definitely some progress but we should continue looking.

Thanks for the task!

On the general issue tbh I don't have much insight into what is the cause of the increase. Regarding T362985 I'd note that implementing it will NOT reduce the number of zones, or the total number of records. It will significantly reduce the number of INCLUDE statements (referencing other files) in our zones. So if that is a factor it ought to help.

Change #1143593 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns:auth::update: add timer for monthly git maintenance run

https://gerrit.wikimedia.org/r/1143593

Change #1143593 merged by Ssingh:

[operations/puppet@production] P:dns:auth::update: add timer for monthly git maintenance run

https://gerrit.wikimedia.org/r/1143593

ssingh claimed this task.

The gc-authdns-git-repo.timer has been running monthly and has significantly reduced the time it takes to run authdns-update. That, combined with the sre.dns.admin cookbook for site depools, and with the reduction of the dyna TTL, I think we can mark this as resolved for now and come back to it later if required. It takes about a minute to run this now and I think that's acceptable.