Metrics are vital to provide an understanding of the system, and should continue working in the case of a DC fail-over / outage. This means that we need to figure out how to keep Graphite working in each DC, while also allowing frontends like grafana to integrate information from multiple DCs.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
codfw: add statsd service entry | operations/dns | master | +1 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • GWicke | T127976 Graphite DC fail-over / per-DC setup | |||
Resolved | • Pchelolo | T158338 Set up DNS caching for node services |
Event Timeline
Change 273199 had a related patch set uploaded (by Filippo Giunchedi):
codfw: add statsd service entry
currently graphite is setup the same on graphite1001 and graphite2001, each machine can receive both statsd traffic and graphite "line protocol" traffic and will replicate both locally and to the other datacenter.
ATM all internal clients, including clients in codfw, are pointed to statsd.eqiad.wmnet, while externally varnish proxies graphite.wikimedia.org to graphite1001.
$ git grep -i statsd.eqiad.wmnet hieradata/codfw/mediawiki/jobrunner.yaml:statsd_server: statsd.eqiad.wmnet:8125 hieradata/common.yaml:statsd: statsd.eqiad.wmnet:8125 hieradata/common/ocg.yaml:statsd_host: "statsd.eqiad.wmnet" hieradata/eqiad/mediawiki/jobrunner.yaml:statsd_server: statsd.eqiad.wmnet:8125 hieradata/role/common/aqs.yaml:aqs::statsd_host: statsd.eqiad.wmnet hieradata/role/common/logstash.yaml:role::logstash::statsd_host: statsd.eqiad.wmnet hieradata/role/common/maps.yaml:service::configuration::statsd_host: statsd.eqiad.wmnet hieradata/role/common/restbase.yaml:restbase::statsd_host: statsd.eqiad.wmnet hieradata/role/common/sca.yaml:service::configuration::statsd_host: statsd.eqiad.wmnet hieradata/role/common/scb.yaml:service::configuration::statsd_host: statsd.eqiad.wmnet manifests/role/eventlogging.pp: $statsd_host = hiera('eventlogging_statsd_host', 'statsd.eqiad.wmnet') manifests/role/webperf.pp: $statsd_host = 'statsd.eqiad.wmnet' modules/base/manifests/puppet.pp: statsd_host => 'statsd.eqiad.wmnet', modules/eventlogging/manifests/service/reporter.pp:# StatsD host. Example: 'statsd.eqiad.wmnet'. modules/eventlogging/manifests/service/reporter.pp:# host => 'statsd.eqiad.wmnet', modules/puppet_statsd/README.md: statsd_host => 'statsd.eqiad.wmnet', modules/puppet_statsd/manifests/init.pp:# statsd_host => 'statsd.eqiad.wmnet', modules/role/manifests/cache/kafka/webrequest.pp: logster_options => "-o statsd --statsd-host=statsd.eqiad.wmnet:8125 --metric-prefix=${graphite_metric_prefix}", modules/role/manifests/cache/statsd.pp: statsd_server => 'statsd.eqiad.wmnet', modules/role/manifests/cache/statsd.pp: statsd_server => 'statsd.eqiad.wmnet', modules/role/manifests/cache/statsd.pp: statsd_server => 'statsd.eqiad.wmnet', modules/role/manifests/cache/upload.pp: statsd_server => 'statsd.eqiad.wmnet', modules/role/manifests/zuul/configuration.pp: 'statsd_host' => 'statsd.eqiad.wmnet', modules/scap/files/scap.cfg:statsd_host: statsd.eqiad.wmnet modules/scap/files/scap.cfg:statsd_host: statsd.eqiad.wmnet modules/scap/manifests/master.pp: $statsd_host = 'statsd.eqiad.wmnet', modules/swift/manifests/stats/accounts.pp: $statsd_host = 'statsd.eqiad.wmnet', modules/swift/manifests/stats/dispersion.pp: $statsd_host = 'statsd.eqiad.wmnet', modules/swift/manifests/stats/stats_container.pp: $statsd_host = 'statsd.eqiad.wmnet', modules/varnish/manifests/logging/media.pp:# statsd_server => 'statsd.eqiad.wmnet:8125 modules/varnish/manifests/logging/rls.pp:# statsd_server => 'statsd.eqiad.wmnet:8125 modules/varnish/manifests/logging/statsd.pp:# statsd_server => 'statsd.eqiad.wmnet:8125 modules/varnish/manifests/logging/xcps.pp:# statsd_server => 'statsd.eqiad.wmnet:8125
assuming for the first failover we will keep eqiad reachable from codfw, once we failover clients to statsd.codfw.wmnet metrics aggregation will move from eqiad to codfw and the resulting metrics will still be pushed to both, IOW no backfilling necessary on failover/failback.
however if eqiad isn't reachable from codfw we'll have to fail over all (codfw only?) statsd clients to statsd.codfw.wmnet, when eqiad "comes back" we'll have to rsync all updated metrics in codfw to eqiad and vice versa to bring things back in sync.
While working on T157022: Suspected faulty SSD on graphite1001 and in particular the DNS switchover, not all services are observing the DNS TTL of 1H.
Below is the list of statsd prefixes still coming in to graphite1001 after several hours after flipping the dns
Hadoop VisualEditor aqs changeprop citoid cxserver eventlogging eventstreams frontend gerrit graphoid kafka kartotherian mathoid mobileapps mw nodepool ocg ores parsoid parsoid-tests restbase restbase-dev restbase-test tilerator tileratorui trendingedits zookeeper zuul
So basically either the connection is kept open on the client side and the name is never looked up again, or the applications cache dns indefinitely.
@godog did you consider restarting statsd on graphite1001 and see what reconnects there?
IIRC statsd dns resolution for nodejs services was special-cased. Otherwise by default nodejs or the statsd library would issue dns lookups every time a statsd udp packet was sent out.
I'm not sure about the java/jmxtrans case, IIRC java used to cache dns records indefinitely but that was fixed in newer jvm releases.
Also note that python services (e.g. webperf-related on hafnium) also didn't seem to pick up the dns change.
The full list of restarts this time around is available from https://phabricator.wikimedia.org/T157022#3014109 onwards.
@godog did you consider restarting statsd on graphite1001 and see what reconnects there?
All UDP traffic, so no dice :(
IIRC statsd dns resolution for nodejs services was special-cased. Otherwise by default nodejs or the statsd library would issue dns lookups every time a statsd udp packet was sent out.
Yes, IIRC we configured IPs to avoid this. We should replace this with short term DNS caching, along the lines of T128015.
All services but Parsoid are now running the latest version of service-runner in production. Parsoid has scheduled the deploy within the next week.
This means that we can call this task done.