authdns-update has been getting noticeably slower over the past few months. I don't have raw numbers since we don't really measure this, but we did add SAL logging for START and END authdns-update events, so you can see these in https://sal.toolforge.org/production?p=0&q=authdns-update&d=. On average, it takes about 2-3 minutes for authdns-update to complete, and the DNS changes are not considered live until it finishes running on all 16 DNS hosts. So for a change to go live, there is the base 2-3 minutes, in addition to whatever TTL is there so this may not be ideal.
The DNS admin cookbook to depool sites (sre.dns.admin) works differently (confd + direct gdnsd reload) so we are not worried about the time it takes to depool sites in case of emergencies. But in general, running authdns-update should not take this much time, and since it has gone noticeably slower over the past few months, we should look into why and try to optimize it.
In I31abc9e26e0e006096d43c9c4f997001ff68da39, we bumped the clush timeout (to 90 seconds) because the 45 second execution time was not enough for authdns-local-update to complete, so this has been a problem for a while.
Part of the reason is the simple increase in zone files, at the time of writing:
Assembling and testing data in /tmp/dns-check.7umy3nq_ -- Generating zonefiles from zone templates -- Processed 625 zones into directory /tmp/dns-check.7umy3nq_/zones
real 2m32.029s user 0m0.740s sys 0m0.255s
How much of a problem is the simple increase in zone files remains to be seen, or if the it is the $INCLUDE snippets for Netbox (also remains to be seen). This possible research and improvements will form the bulk of the work for this task and are also documented in T362985.
The other thing we can do is to try doing some Git cleanup by running git maintenance run and seeing if that helps. We ran it on dns1004 and reported some success but we haven't run it anywhere else.