Page MenuHomePhabricator

Set up DNS caching for node services
Closed, ResolvedPublic

Description

By default, node.js does not cache DNS resolution at all. This can be a performance issue especially for UDP. To avoid this for statsd, we configured raw IP addresses. However, this means that statsd hosts can't be easily switched by updating DNS.

Instead, we should set up generic DNS caching in node, possibly using the in-process cache described in T128015. TTLs should either respect DNS responses, or default to a conservative TTL of 5s. Since this should apply to all services, it might make sense to add this to service-runner.

Alternatively or additionally, we could set up a local caching DNS resolver service, and configure this as the local DNS server. However, since this would still involve DNS queries on each packet, it is likely that performance would not be as good as the in-process cache.

Event Timeline

At some point I've looked into the dnscache package and tried to apply it to several services and measure performance. I didn't find any major performance improvements for smallish cache periods, but that's because I've been mostly testing it for TCP traffic between services, where a DNS lookup is just a small fraction of the request latency. I can try again with enabling all of our UDP traffic and check whether performance boost would be more significant.

A couple of issues with the dnscache package (which is by far the most used by the community):

  1. It doesn't and will not respect records TTL (see https://github.com/yahoo/dnscache/issues/14 ) and actually for a good reason - ability to fetch the record TTL was only added to node core in version 7.2 (https://github.com/nodejs/node/pull/9296)
  2. It's global. We can't distinguish for which traffic we want to cache and for which we don't. Let's say, if we loose 5 seconds of metrics - it's OK, but if RESTBase looses connection to all backend services for 5 seconds - that's a real 5-second-long outage.

I've made a simple benchmark for htcp-purge package. Without DNS caching we're doing 2932 purges/sec, using the dnscache package we're able to do 35663 purges per second. So for UDP traffic the performance improvement is tremendous, more then 10 times.

@GWicke when I flipped over statsd.eqiad.wmnet after the CNAME change it has been sufficient to restart the services, not change any IP address. It looks like the dns name is what's in the config but it is resolved only once ?

@fgiunchedi, you are right - the config file currently has the name. We did resolve the IPs in puppet at some point, but it looks like we moved away from that.

In any case, we should enable general short-term DNS caching, and disable any once-only resolution code in the statsd backend.

I've made a PR for service-runner to stop caching statsd IPs indefinitely and to introduce the global DNS caching: https://github.com/wikimedia/service-runner/pull/152

So performance measurements show, that just disabling internal statsd client DNS cache without introducing a global cache have really bad impact on performance and CPU usage, since we generally send a lot of metrics. On the other hand, with a 5-second global cache performance difference is negligible comparing to infinite Statsd internal cache.

So, now when the PR is merged we probably need to go over services setting up the caching configuration and updating the service-runner version..

That's awesome @Pchelolo, thanks! I'll need to switch back statsd to eqiad in the next few days, let me know what services have been switched so we can test it in production too.

Mentioned in SAL (#wikimedia-operations) [2017-02-20T17:54:28Z] <ppchelko@tin> Started deploy [changeprop/deploy@30873eb]: Update change-prop to 30873ebd5: enabling DNS caching for T158338

Mentioned in SAL (#wikimedia-operations) [2017-02-20T17:56:10Z] <ppchelko@tin> Finished deploy [changeprop/deploy@30873eb]: Update change-prop to 30873ebd5: enabling DNS caching for T158338 (duration: 01m 41s)

I'll need to switch back statsd to eqiad in the next few days, let me know what services have been switched so we can test it in production too.

@fgiunchedi We've deployed ChangeProp with the new behavior today, I've been closely monitoring it, so far so good. The new behavior is default in a newer version of service-runner, so all the services that would be deployed would automatically switch to it. I'll try to keep track on what's updated and post it here.

@Pchelolo @mobrovac after today's CNAME flip looks like changeprop correctly switched to graphite1001 !

So, seems we need to redeploy all the services once before the switchover happens. Do if think it worths doing that explicitly or rely on normal deployment flow for that?

@Pchelolo, if you mean the DC switch-over, that's scheduled for April. Most (if not all) services will be re-deployed several times before then, but lets make a note to double check before April?

Pchelolo claimed this task.
Pchelolo edited projects, added Services (done); removed Services (doing).

Ok, that's now done. Resolving.