Page MenuHomePhabricator

Graphite DC fail-over / per-DC setup
Closed, ResolvedPublic

Description

Metrics are vital to provide an understanding of the system, and should continue working in the case of a DC fail-over / outage. This means that we need to figure out how to keep Graphite working in each DC, while also allowing frontends like grafana to integrate information from multiple DCs.

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Change 273199 had a related patch set uploaded (by Filippo Giunchedi):
codfw: add statsd service entry

https://gerrit.wikimedia.org/r/273199

currently graphite is setup the same on graphite1001 and graphite2001, each machine can receive both statsd traffic and graphite "line protocol" traffic and will replicate both locally and to the other datacenter.

ATM all internal clients, including clients in codfw, are pointed to statsd.eqiad.wmnet, while externally varnish proxies graphite.wikimedia.org to graphite1001.

$ git grep -i statsd.eqiad.wmnet
hieradata/codfw/mediawiki/jobrunner.yaml:statsd_server: statsd.eqiad.wmnet:8125
hieradata/common.yaml:statsd: statsd.eqiad.wmnet:8125
hieradata/common/ocg.yaml:statsd_host: "statsd.eqiad.wmnet"
hieradata/eqiad/mediawiki/jobrunner.yaml:statsd_server: statsd.eqiad.wmnet:8125
hieradata/role/common/aqs.yaml:aqs::statsd_host: statsd.eqiad.wmnet
hieradata/role/common/logstash.yaml:role::logstash::statsd_host: statsd.eqiad.wmnet
hieradata/role/common/maps.yaml:service::configuration::statsd_host: statsd.eqiad.wmnet
hieradata/role/common/restbase.yaml:restbase::statsd_host: statsd.eqiad.wmnet
hieradata/role/common/sca.yaml:service::configuration::statsd_host: statsd.eqiad.wmnet
hieradata/role/common/scb.yaml:service::configuration::statsd_host: statsd.eqiad.wmnet
manifests/role/eventlogging.pp:    $statsd_host         = hiera('eventlogging_statsd_host',      'statsd.eqiad.wmnet')
manifests/role/webperf.pp:    $statsd_host = 'statsd.eqiad.wmnet'
modules/base/manifests/puppet.pp:        statsd_host   => 'statsd.eqiad.wmnet',
modules/eventlogging/manifests/service/reporter.pp:#   StatsD host. Example: 'statsd.eqiad.wmnet'.
modules/eventlogging/manifests/service/reporter.pp:#    host => 'statsd.eqiad.wmnet',
modules/puppet_statsd/README.md:  statsd_host   => 'statsd.eqiad.wmnet',
modules/puppet_statsd/manifests/init.pp:#     statsd_host => 'statsd.eqiad.wmnet',
modules/role/manifests/cache/kafka/webrequest.pp:        logster_options => "-o statsd --statsd-host=statsd.eqiad.wmnet:8125 --metric-prefix=${graphite_metric_prefix}",
modules/role/manifests/cache/statsd.pp:        statsd_server => 'statsd.eqiad.wmnet',
modules/role/manifests/cache/statsd.pp:        statsd_server => 'statsd.eqiad.wmnet',
modules/role/manifests/cache/statsd.pp:        statsd_server => 'statsd.eqiad.wmnet',
modules/role/manifests/cache/upload.pp:        statsd_server => 'statsd.eqiad.wmnet',
modules/role/manifests/zuul/configuration.pp:            'statsd_host'          => 'statsd.eqiad.wmnet',
modules/scap/files/scap.cfg:statsd_host: statsd.eqiad.wmnet
modules/scap/files/scap.cfg:statsd_host: statsd.eqiad.wmnet
modules/scap/manifests/master.pp:    $statsd_host        = 'statsd.eqiad.wmnet',
modules/swift/manifests/stats/accounts.pp:    $statsd_host   = 'statsd.eqiad.wmnet',
modules/swift/manifests/stats/dispersion.pp:    $statsd_host   = 'statsd.eqiad.wmnet',
modules/swift/manifests/stats/stats_container.pp:    $statsd_host = 'statsd.eqiad.wmnet',
modules/varnish/manifests/logging/media.pp:#    statsd_server => 'statsd.eqiad.wmnet:8125
modules/varnish/manifests/logging/rls.pp:#    statsd_server => 'statsd.eqiad.wmnet:8125
modules/varnish/manifests/logging/statsd.pp:#    statsd_server => 'statsd.eqiad.wmnet:8125
modules/varnish/manifests/logging/xcps.pp:#    statsd_server => 'statsd.eqiad.wmnet:8125

assuming for the first failover we will keep eqiad reachable from codfw, once we failover clients to statsd.codfw.wmnet metrics aggregation will move from eqiad to codfw and the resulting metrics will still be pushed to both, IOW no backfilling necessary on failover/failback.

however if eqiad isn't reachable from codfw we'll have to fail over all (codfw only?) statsd clients to statsd.codfw.wmnet, when eqiad "comes back" we'll have to rsync all updated metrics in codfw to eqiad and vice versa to bring things back in sync.

Change 273199 merged by Filippo Giunchedi:
codfw: add statsd service entry

https://gerrit.wikimedia.org/r/273199

While working on T157022: Suspected faulty SSD on graphite1001 and in particular the DNS switchover, not all services are observing the DNS TTL of 1H.
Below is the list of statsd prefixes still coming in to graphite1001 after several hours after flipping the dns

Hadoop
VisualEditor
aqs
changeprop
citoid
cxserver
eventlogging
eventstreams
frontend
gerrit
graphoid
kafka
kartotherian
mathoid
mobileapps
mw
nodepool
ocg
ores
parsoid
parsoid-tests
restbase
restbase-dev
restbase-test
tilerator
tileratorui
trendingedits
zookeeper
zuul

So basically either the connection is kept open on the client side and the name is never looked up again, or the applications cache dns indefinitely.

@godog did you consider restarting statsd on graphite1001 and see what reconnects there?

So basically either the connection is kept open on the client side and the name is never looked up again, or the applications cache dns indefinitely.

IIRC statsd dns resolution for nodejs services was special-cased. Otherwise by default nodejs or the statsd library would issue dns lookups every time a statsd udp packet was sent out.

I'm not sure about the java/jmxtrans case, IIRC java used to cache dns records indefinitely but that was fixed in newer jvm releases.

Also note that python services (e.g. webperf-related on hafnium) also didn't seem to pick up the dns change.

The full list of restarts this time around is available from https://phabricator.wikimedia.org/T157022#3014109 onwards.

@godog did you consider restarting statsd on graphite1001 and see what reconnects there?

All UDP traffic, so no dice :(

IIRC statsd dns resolution for nodejs services was special-cased. Otherwise by default nodejs or the statsd library would issue dns lookups every time a statsd udp packet was sent out.

Yes, IIRC we configured IPs to avoid this. We should replace this with short term DNS caching, along the lines of T128015.

GWicke claimed this task.

All services but Parsoid are now running the latest version of service-runner in production. Parsoid has scheduled the deploy within the next week.

This means that we can call this task done.