Page MenuHomePhabricator

unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet
Closed, ResolvedPublic

Description

It turns out that internal recDNS is underprovisioned in eqiad given current load. 80% of the load on eqiad recdns is lookups for statsd.eqiad.wmnet, which seem to be made multiple times per MW appserver query, and never cached by those clients (presumably for usual PHP reasons).

https://gerrit.wikimedia.org/r/c/operations/puppet/+/554618 dropped us from ~70k packets-per-second on each recdns host to about 12k pps. But this is a kludge, and should be rolled back when we have the capacity (10G NICs coming Soon, which will likely help), or when we work around it other ways (such as with a local stub resolver on every host [with a max-ttl set to only a minute or two, so we don't create more of a mess around purging bad records]).

Event Timeline

If I recall correctly, HHVM had a dns cache. This is among the reasons that, over the years, we gradually adopted more use of hostnames in wmf-config for services instead of hardcoding IP addresses. I guess we lost that in the PHP7 transition. Does the OS not cache this at all? Does PHP7 do something to bypass it?

Change 554631 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] statsd: document Puppet /etc/hosts-ification

https://gerrit.wikimedia.org/r/554631

Change 554632 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] base: document statsd DNS kludge

https://gerrit.wikimedia.org/r/554632

In general, usually applayer DNS caching is a Bad Idea unless it's done very carefully (e.g. cap it at something like 5s max, or actually use a full-featured resolver library and get the real TTLs from upstream, or both).

There are several ways we can make the OS layer do the same, but they all involve some form of stub cache, and no such thing is presently configured. There's a systemd-resolved stub cache that's fairly easy to inject at the OS layer, but it lacks any kind of config to cap TTLs down to ~5s and has no way to wipe individual records, so we'd lose our current ability to actively wipe individual cache entries from recdns in various operational problem scenarios. Plugging together a per-host real cache like powerdns recursor is also very tricky...

Change 554632 merged by CDanis:
[operations/puppet@production] base: document statsd DNS kludge

https://gerrit.wikimedia.org/r/554632

Change 554631 merged by CDanis:
[operations/dns@master] statsd: document Puppet /etc/hosts-ification

https://gerrit.wikimedia.org/r/554631

Given most nodejs applications don't use statsd anymore (in kubernetes we just use the prometheus-statsd exporter), and I have submitted https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/661732 to switch to using the IP address directly in MediaWiki, I think we can remove the puppetized host file entry once my patch is merged.

@Joe's patch mentioned above has been merged in Feb 2021 and the hardcoded IP config has since been moved to monitoring.pp [1]. @CDanis can the entry be removed at this point?

[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/725298

Change 883151 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] P:monitoring: Absent hardcoded statsd host entry

https://gerrit.wikimedia.org/r/883151

@Joe's patch mentioned above has been merged in Feb 2021 and the hardcoded IP config has since been moved to monitoring.pp [1]. @CDanis can the entry be removed at this point?

[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/725298

CR uploaded, @CDanis can merge and remove it?

Is https://grafana.wikimedia.org/d/000000399/dns-recursors?orgId=1 the right dash to watch for possible issues after merging that removal?

That's right!

Change 883151 merged by Clément Goubert:

[operations/puppet@production] P:monitoring: Absent hardcoded statsd host entry

https://gerrit.wikimedia.org/r/883151

Removing this entry broke Page Previews metrics, and possibly others. Reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/887762
A possible cause is statsd.eqiad.wmnet resolves to a CNAME for graphite1005.eqiad.wmnet which then resolves to a AAAA record. statsd-proxy can't listen on udp6, and it's possible a similar issue is happening with graphite.

More investigation needed.

Change 904185 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: prep statsd/graphite records for easier write failover

https://gerrit.wikimedia.org/r/904185

Change 904186 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: remove hardcoded statsd.eqiad.wmnet

https://gerrit.wikimedia.org/r/904186

Change 904185 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: prep statsd/graphite records for easier write failover

https://gerrit.wikimedia.org/r/904185

Change 904186 merged by Filippo Giunchedi:

[operations/puppet@production] profile: remove hardcoded statsd.eqiad.wmnet

https://gerrit.wikimedia.org/r/904186

fgiunchedi claimed this task.
fgiunchedi subscribed.

The hardcoded statsd.eqiad.wmnet entry is gone and I can confirm we're receiving statsd traffic on v6 too, thanks to all involved!

Change 904677 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd_proxy: reuseaddr for 6to4 proxy to avoid crashloop

https://gerrit.wikimedia.org/r/904677

Change 904677 merged by Filippo Giunchedi:

[operations/puppet@production] statsd_proxy: reuseaddr for 6to4 proxy to avoid crashloop

https://gerrit.wikimedia.org/r/904677

Change 904771 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd_proxy: fix socat invocation to not crashloop

https://gerrit.wikimedia.org/r/904771

Change 904771 merged by Filippo Giunchedi:

[operations/puppet@production] statsd_proxy: fix socat invocation to not crashloop

https://gerrit.wikimedia.org/r/904771

Change 997803 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: remove absented statsd hosts entry

https://gerrit.wikimedia.org/r/997803

Change 997803 merged by Filippo Giunchedi:

[operations/puppet@production] profile: remove absented statsd hosts entry

https://gerrit.wikimedia.org/r/997803