hosts override of statsd.eqiad.wmnet
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CDanis
	Dec 4 2019, 9:27 PM

Description

It turns out that internal recDNS is underprovisioned in eqiad given current load. 80% of the load on eqiad recdns is lookups for statsd.eqiad.wmnet, which seem to be made multiple times per MW appserver query, and never cached by those clients (presumably for usual PHP reasons).

https://gerrit.wikimedia.org/r/c/operations/puppet/+/554618 dropped us from ~70k packets-per-second on each recdns host to about 12k pps. But this is a kludge, and should be rolled back when we have the capacity (10G NICs coming Soon, which will likely help), or when we work around it other ways (such as with a local stub resolver on every host [with a max-ttl set to only a minute or two, so we don't create more of a mess around purging bad records]).

Details

Subject	Repo	Branch	Lines +/-
profile: remove absented statsd hosts entry	operations/puppet	production	+0 -8
statsd_proxy: fix socat invocation to not crashloop	operations/puppet	production	+1 -1
statsd_proxy: reuseaddr for 6to4 proxy to avoid crashloop	operations/puppet	production	+1 -1
profile: remove hardcoded statsd.eqiad.wmnet	operations/puppet	production	+2 -1
wmnet: prep statsd/graphite records for easier write failover	operations/dns	master	+4 -3
P:monitoring: Absent hardcoded statsd host entry	operations/puppet	production	+4 -1
statsd: document Puppet /etc/hosts-ification	operations/dns	master	+1 -1
base: document statsd DNS kludge	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects

Mentioned In: T279013: Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution"
T231025: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002
T171498: Implement machine-local forwarding DNS caches

Event Timeline

CDanis created this task.Dec 4 2019, 9:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 4 2019, 9:27 PM

If I recall correctly, HHVM had a dns cache. This is among the reasons that, over the years, we gradually adopted more use of hostnames in wmf-config for services instead of hardcoding IP addresses. I guess we lost that in the PHP7 transition. Does the OS not cache this at all? Does PHP7 do something to bypass it?

Krinkle added a project: Performance-Team (Radar).Dec 4 2019, 9:28 PM

Change 554631 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] statsd: document Puppet /etc/hosts-ification

https://gerrit.wikimedia.org/r/554631

gerritbot added a project: Patch-For-Review.Dec 4 2019, 9:28 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Dec 4 2019, 9:29 PM

Change 554632 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] base: document statsd DNS kludge

https://gerrit.wikimedia.org/r/554632

In general, usually applayer DNS caching is a Bad Idea unless it's done very carefully (e.g. cap it at something like 5s max, or actually use a full-featured resolver library and get the real TTLs from upstream, or both).

There are several ways we can make the OS layer do the same, but they all involve some form of stub cache, and no such thing is presently configured. There's a systemd-resolved stub cache that's fairly easy to inject at the OS layer, but it lacks any kind of config to cap TTLs down to ~5s and has no way to wipe individual records, so we'd lose our current ability to actively wipe individual cache entries from recdns in various operational problem scenarios. Plugging together a per-host real cache like powerdns recursor is also very tricky...

Change 554632 merged by CDanis:
[operations/puppet@production] base: document statsd DNS kludge

https://gerrit.wikimedia.org/r/554632

Change 554631 merged by CDanis:
[operations/dns@master] statsd: document Puppet /etc/hosts-ification

https://gerrit.wikimedia.org/r/554631

Maintenance_bot removed a project: Patch-For-Review.Dec 4 2019, 10:10 PM

akosiaris triaged this task as Low priority.Dec 5 2019, 9:09 AM

BBlack mentioned this in T171498: Implement machine-local forwarding DNS caches.Dec 6 2019, 1:51 PM

Joe mentioned this in T231025: LegacyHandler.php: PHP Warning: Host lookup failed [-10002]: Unknown error -10002.Feb 4 2021, 2:20 PM

Given most nodejs applications don't use statsd anymore (in kubernetes we just use the prometheus-statsd exporter), and I have submitted https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/661732 to switch to using the IP address directly in MediaWiki, I think we can remove the puppetized host file entry once my patch is merged.

CDanis mentioned this in T279013: Phabricator intermittently slow; db connection failures to m3-master.eqiad.wmnet with "Temporary failure in name resolution".Mar 31 2021, 9:25 PM

@Joe's patch mentioned above has been merged in Feb 2021 and the hardcoded IP config has since been moved to monitoring.pp [1]. @CDanis can the entry be removed at this point?

[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/725298

Krinkle unsubscribed.Jan 8 2023, 5:59 PM

Change 883151 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] P:monitoring: Absent hardcoded statsd host entry

https://gerrit.wikimedia.org/r/883151

gerritbot added a project: Patch-For-Review.Jan 24 2023, 11:53 AM

In T239862#8504801, @LSobanski wrote:

@Joe's patch mentioned above has been merged in Feb 2021 and the hardcoded IP config has since been moved to monitoring.pp [1]. @CDanis can the entry be removed at this point?

[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/725298

CR uploaded, @CDanis can merge and remove it?

fgiunchedi edited projects, added Traffic; removed SRE.Feb 6 2023, 1:54 PM

Is https://grafana.wikimedia.org/d/000000399/dns-recursors?orgId=1 the right dash to watch for possible issues after merging that removal?

In T239862#8589709, @Clement_Goubert wrote:

Is https://grafana.wikimedia.org/d/000000399/dns-recursors?orgId=1 the right dash to watch for possible issues after merging that removal?

That's right!

Change 883151 merged by Clément Goubert:

[operations/puppet@production] P:monitoring: Absent hardcoded statsd host entry

https://gerrit.wikimedia.org/r/883151

Maintenance_bot removed a project: Patch-For-Review.Feb 6 2023, 3:31 PM

Removing this entry broke Page Previews metrics, and possibly others. Reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/887762
A possible cause is statsd.eqiad.wmnet resolves to a CNAME for graphite1005.eqiad.wmnet which then resolves to a AAAA record. statsd-proxy can't listen on udp6, and it's possible a similar issue is happening with graphite.

More investigation needed.

statsd-proxy udp6 support added by @herron in https://gerrit.wikimedia.org/r/c/operations/puppet/+/887804

Change 904185 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: prep statsd/graphite records for easier write failover

https://gerrit.wikimedia.org/r/904185

gerritbot added a project: Patch-For-Review.Mar 29 2023, 1:50 PM

Change 904186 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: remove hardcoded statsd.eqiad.wmnet

https://gerrit.wikimedia.org/r/904186

Change 904185 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: prep statsd/graphite records for easier write failover

https://gerrit.wikimedia.org/r/904185

Change 904186 merged by Filippo Giunchedi:

[operations/puppet@production] profile: remove hardcoded statsd.eqiad.wmnet

https://gerrit.wikimedia.org/r/904186

Maintenance_bot removed a project: Patch-For-Review.Mar 30 2023, 8:30 AM

The hardcoded statsd.eqiad.wmnet entry is gone and I can confirm we're receiving statsd traffic on v6 too, thanks to all involved!

Change 904677 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd_proxy: reuseaddr for 6to4 proxy to avoid crashloop

https://gerrit.wikimedia.org/r/904677

gerritbot added a project: Patch-For-Review.Mar 31 2023, 7:42 AM

Change 904677 merged by Filippo Giunchedi:

[operations/puppet@production] statsd_proxy: reuseaddr for 6to4 proxy to avoid crashloop

https://gerrit.wikimedia.org/r/904677

Maintenance_bot removed a project: Patch-For-Review.Mar 31 2023, 8:10 AM

Change 904771 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd_proxy: fix socat invocation to not crashloop

https://gerrit.wikimedia.org/r/904771

gerritbot added a project: Patch-For-Review.Mar 31 2023, 11:01 AM

Change 904771 merged by Filippo Giunchedi:

[operations/puppet@production] statsd_proxy: fix socat invocation to not crashloop

https://gerrit.wikimedia.org/r/904771

Maintenance_bot removed a project: Patch-For-Review.Mar 31 2023, 11:11 AM

Change 997803 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: remove absented statsd hosts entry

https://gerrit.wikimedia.org/r/997803

gerritbot added a project: Patch-For-Review.Feb 6 2024, 11:38 AM

Change 997803 merged by Filippo Giunchedi: