Page MenuHomePhabricator

Verify ATS handling of DNS TTLs
Closed, ResolvedPublic

Description

To confirm that we can reduce the amount of time needed to move CDN request routing away from eqiad by lowering DNS TTLs, verify that ATS backends honor TTLs appropriately.

Apache Traffic Server can use DNS TTLs, a hardcoded setting, or a combination of the two depending on the value of proxy.config.hostdb.ttl_mode. See the documentation for further details, but long story short we use the default value (respect TTL from the DNS response).

This is the theory, and we have this task to verify that practice matches it.

Related Objects

Event Timeline

ema triaged this task as Medium priority.Aug 26 2020, 1:45 PM
ema created this task.
ema added a project: Traffic.

Change 622570 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: trace hostdb handling of TTLs

https://gerrit.wikimedia.org/r/622570

Change 622570 merged by Ema:
[operations/puppet@production] ATS: trace hostdb handling of TTLs

https://gerrit.wikimedia.org/r/622570

Running hostdb_ttls.stp on cp3050, I get the following output:

max age for 10.2.2.22=300, ttl=300
max age for 10.64.4.15=300, ttl=300
max age for 10.2.2.1=300, ttl=300
max age for 10.2.2.17=300, ttl=300
max age for 10.2.2.52=300, ttl=300
max age for 208.80.153.15=307, ttl=300
max age for 10.2.2.32=302, ttl=300
max age for 10.64.32.178=306, ttl=3600
max age for 10.2.2.40=731, ttl=2869
max age for 10.64.53.26=1192, ttl=1359
max age for 10.64.32.174=214, ttl=300
max age for 10.2.2.10=300, ttl=300
max age for 10.64.16.8=300, ttl=300
max age for 10.2.2.18=303, ttl=300
max age for 10.2.2.34=300, ttl=300
max age for 10.64.48.39=348, ttl=300
max age for 10.2.2.33=184, ttl=300
max age for 10.64.32.137=764, ttl=2001
max age for 10.64.32.187=1300, ttl=300
max age for 10.64.0.55=306, ttl=300
max age for 10.192.0.160=248, ttl=2856
max age for 10.64.0.142=261, ttl=300
max age for 10.64.48.26=357, ttl=300

As expected, for origin servers accessed often the maximum age of a cached response is equal/similar to the ttl. For those requested less then once per second, instead, there can be discrepancies given that we're tracing all calls and not only those for which the item is considered fresh.

Another piece of perhaps even more convincing evidence is that the DNS record for appservers-rw.discovery.wmnet is refreshed every 5 minutes (300s):

13:53:02 ema@cp3050.esams.wmnet:~
$ sudo timeout 600 tcpdump 'udp and dst port 53' | grep appservers
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens2f0np0, link-type EN10MB (Ethernet), capture size 262144 bytes
13:53:36.273411 IP cp3050.esams.wmnet.47128 > recdns.anycast.wmnet.domain: 64398+ A? appservers-rw.discovery.wmnet. (47)
13:58:36.298725 IP cp3050.esams.wmnet.47128 > recdns.anycast.wmnet.domain: 10803+ A? appservers-rw.discovery.wmnet. (47)
207 packets captured
498 packets received by filter
0 packets dropped by kernel

Beyond a reasonable doubt it seems that our ATS backend setup does honor DNS TTLs. Leaving the task open for now to see if @RLazarus or anyone else wants to perform further tests.

@ema as one of the requester for this test thanks a lot for the effort. It looks like we're in good shape here.

@ema as one of the requester for this test thanks a lot for the effort. It looks like we're in good shape here.

Thank you for raising the question and confirming that the answer is satisfactory. Closing!