Page MenuHomePhabricator

Lower geodns TTLs from 600 (10min) to 300 (5min)
Closed, ResolvedPublic

Description

In time-critical DC failover scenarios, we're often able to respond quickly with a DNS change to re-route users, but the 10 minute TTL window is the limiter on restoring service to users. Lowering the TTL would reduce the user-facing downtimes in these scenarios. I think 5 minutes is a reasonable target; going any lower might raise other challenges and issues that we're not yet ready to face. Even to get from 10 minutes to 5 minutes, there are a couple of issues to consider:

  1. This raises the importance of AuthDNS server reliability. Times lower than 10 minutes are getting closer to hardware reboot times after a hard crash, etc. Therefore, we should probably block this on making each AuthDNS site redundant via LVS ( T101525 ).
  2. This could roughly double our AuthDNS traffic, so we need to take a careful look at how close we are to any kind of limitations there before we hit loss or latency issues. We're already serving ~1K/sec DNS requests on each of the 3x AuthDNS servers.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

It would also be good to know that a single server will handle the increased load when degraded. Adjusting the TTL before adding redundancy/capacity may be advantageous in that it could highlight unexpected issues that could cause cascading failure later. Ratcheting the TTL downward by 50 or 100, monitoring for a few days, lather, rinse and repeat is one approach.

We're probably fine on existing capacity to handle failover at 600, and even at 300. We've had authdns server outages before, and the stats are pretty simple to interpret in general. It is, of course, best to test those assumptions! :)

The existing stats on DNS reqs (which I believe are, confusingly, units of requests/5min) are here: https://grafana.wikimedia.org/dashboard/db/dns . Those put us at current weekly peaks around ~4.5K/sec over the three servers, or ~1.5K/sec per server if averaged out. Those systems run at a bit under 2% avg cpu utilization (e.g. https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=radon&var-datasource=eqiad%20prometheus%2Fops ). Even if all 4.5K/sec ended up on a single server, and there were some ugly scaling issues, it's hard to image more than ~10% cpu load. Mostly the second point at the top is paranoia (because authdns is so critical) about the scaling issues. There are sometimes artificial limitations in real-world socket throughput, and perhaps as we cross into peaking closer to 10K/sec (in a hypothetical outage of 2/3 servers), they might start mattering and perhaps reveal that some tuning and tweaking is needed. Odds are very good it's a non-issue. I know gdnsd has tested at upwards of 50K/sec serial performance through real kernel sockets (single socket, single i/o thread, single cpu) even in a low-cpu-power virtual environment in the distant past, and these aren't the kind of numbers that generally give our network cards much issue, either.

The first point (about the cycle time of typical crash->recover scenarios getting more significant the lower the TTL is) is probably the one to worry about more. At this point I don't think we're really trying to aim to expand the authdns server pool via LVS (as indicated in the linked ticket), but more likely via anycasting to 2x servers per site (edges included, so that would soon give us 10 total authdns servers globally).

Krinkle renamed this task from Lower geodns TTLs from 600 to 300 to Lower geodns TTLs from 600 (10min) to 300 (5min).Nov 21 2017, 1:18 AM
Krinkle subscribed.

So we've reduced query volume by ~32% in T208263 . Since the last significant updates here, we've also deployed newer versions of our authdns software which perform even better, and refreshed some hardware as well. We're still in the basic scenario that we only have 3x singular authdns hosts in the world, but they're running with plenty of headroom in terms of handling query rate spikes and server outages. There's really two things holding us up on experimenting with lower TTLs for faster failover:

  • Ideally, we should get past the Anycast hurdle first, giving us more servers and easy depool on server crash, etc. Without this, I don't know that I'd be comfortable going under ~300, because it takes a while for any 1/3 that crashes to reboot, and "depooling" in the current world is a router config change.
  • We're facing GeoDNS -related challenges in our transition to ATS as well: without multi-tier backend caches, we really need ways to smoothly pool in DCs' whose caches have gone cold. Until that feature-work is complete, any reduction in TTLs here to make failover faster also make failbacks to cold caches (which are relatively rare) more painful and difficult. However, only cache_upload is progressing through that transition now; cache_text is still quarters away from it, and we're really hoping to fix the smooth-repooling issue before we transition cache_text.

So, I think we could step down towards 300 for now, only for cache_text, and then stall there on the resolution of the other related issues above. It will at least get that traffic failing over faster, even if images take a few minutes more to follow.

This is still something we want to pursue, but we really need to get past the smooth repooling issue first, so I've added that as a subtask (consider it blocking this one).

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Following some discussion this week, @BBlack and I decided to revisit this task and provide an update to some of the concerns above, in the hope of providing a path to lowering the TTL for dyna.wikimedia.org (and upload.wikimedia.org, which is just an geoip!upload-addrs A record).

Essentially, we are in favour of lowering this TTL now from 600 seconds (10 minutes) to 300 seconds (5 minutes) to help reduce the time it takes for traffic to move over. This is helpful in general for planned maintenance work but more so for unplanned maintenance work (emergency site depools) as it reduces the impact of such incidents from 10 minutes to 5 minutes, ignoring concerns with non-TTL respecting resolvers and such, which we generally work under the assumption of anyway. (Note that we are only changing the TTL for dyna.wikimedia.org; the TTL for the record that points to dyna.wikimedia.org CNAME with a TTL of 1D remains unchanged.)

Our reasons for doing this are:

  • Since this task was last updated in 2019, we have better capacity in both the general hardware running the DNS hosts and the core sites; we have three DNS boxes each in eqiad and codfw instead of two (after the rec and auth DNS roles were merged in T330670 and the unicast routes for ns0 and ns1 announced via bird in T347054).

    As static routes for ns0/ns1 have been superseded with the bird BGP adverts, there is more resiliency in this setup than before, thus we should be able to absorb the performance hit from the increased lookups.
  • The n2 IP is now anycast (since T343942), so we are already spreading the load over to all DNS hosts, in both core and edge sites (for a total of 14 hosts). This setup has been functional since Aug 2023 and has served us well so far.

The concern with reducing the time window especially in the case of cold caches still remains. While this change is beneficial to our general maintenance work (both planned and unplanned) with shorter time windows, we have to be more careful about longer maintenance windows with cold or empty caches or for cases where we bring a new site up. A blocker to the task is the smooth repooling suggestion but we are still quite far from it and challenges there remain so we are not addressing that right now.

But the bulk of our work should be short windows and when we do actually route traffic to a site with a cold (or empty) cache with this change, we might have to manually control the rollout by editing the geo-map (based on the geographical region or IP subnet) and so we will have to accept that as a trade-off if we decrease the TTL. This does increase the manual work but such cases are infrequent and in the case of bringing a new site up (Brazil), we were planning of doing a more controlled rollout regardless of this change.

We are looking for input on this potential change. If we do decide to roll this out, we should follow it up with a test-run with the reduced TTL so that we know of what to expect in case there is an actual maintenance window/emergency. Also do note the following sites are now single-backend: eqiad, esams, ulsfo, eqsin.

Seems reasonable. There are some good reasons not to go too far (reducing load both our side and for recursive servers on the internet), but 5 mins seems ok to me.

Thanks for the feedback folks on the task and on IRC. We plan to merge this patch next week (week of February 12) since there have been no concerns raised so far. If there are any concerns pending, please let us know.

Change 1002585 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] wikimedia.org: lower TTLs for dyna.wm.org and upload.wm.org to 300s

https://gerrit.wikimedia.org/r/1002585

Speaking with a appservers/wikikube clusters hat on, we don't see any problems with the lowering of the dyna.wikimedia.org from 10 minutes to 5 minutes.

With an overall SRE hat on, this will be ok as long as we can reliably sustain partial failures of the DNS infrastructure (failure, for whatever reason, of 1 site), which per the summary above we are in a better position now and should be able to achieve.

Change 1002585 merged by Ssingh:

[operations/dns@master] templates: lower TTLs for dyna.wm.org and upload.wm.org to 300s

https://gerrit.wikimedia.org/r/1002585

Mentioned in SAL (#wikimedia-operations) [2024-02-13T17:23:33Z] <sukhe> running authdns-update to lower dyna TTLs: T140365

We have rolled this out today. For a complete list of domains affected, see the commit above.