In time-critical DC failover scenarios, we're often able to respond quickly with a DNS change to re-route users, but the 10 minute TTL window is the limiter on restoring service to users. Lowering the TTL would reduce the user-facing downtimes in these scenarios. I think 5 minutes is a reasonable target; going any lower might raise other challenges and issues that we're not yet ready to face. Even to get from 10 minutes to 5 minutes, there are a couple of issues to consider:
- This raises the importance of AuthDNS server reliability. Times lower than 10 minutes are getting closer to hardware reboot times after a hard crash, etc. Therefore, we should probably block this on making each AuthDNS site redundant via LVS ( T101525 ).
- This could roughly double our AuthDNS traffic, so we need to take a careful look at how close we are to any kind of limitations there before we hit loss or latency issues. We're already serving ~1K/sec DNS requests on each of the 3x AuthDNS servers.