Graphite DC fail-over / per-DC setup
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Feb 24 2016, 4:10 PM

Description

Metrics are vital to provide an understanding of the system, and should continue working in the case of a DC fail-over / outage. This means that we need to figure out how to keep Graphite working in each DC, while also allowing frontends like grafana to integrate information from multiple DCs.

Details

	Subject	Repo	Branch	Lines +/-
	codfw: add statsd service entry	operations/dns	master	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• GWicke	T127976 Graphite DC fail-over / per-DC setup
		Resolved		• Pchelolo	T158338 Set up DNS caching for node services

Event Timeline

• GWicke created this task.Feb 24 2016, 4:10 PM

Restricted Application added a project: codfw-rollout. · View Herald TranscriptFeb 24 2016, 4:10 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

• GWicke mentioned this in T127974: Services DC switch-over checklist / tracking task.Feb 24 2016, 4:12 PM

Change 273199 had a related patch set uploaded (by Filippo Giunchedi):
codfw: add statsd service entry

https://gerrit.wikimedia.org/r/273199

gerritbot added a project: Patch-For-Review.Feb 25 2016, 11:03 AM

currently graphite is setup the same on graphite1001 and graphite2001, each machine can receive both statsd traffic and graphite "line protocol" traffic and will replicate both locally and to the other datacenter.

ATM all internal clients, including clients in codfw, are pointed to statsd.eqiad.wmnet, while externally varnish proxies graphite.wikimedia.org to graphite1001.

$ git grep -i statsd.eqiad.wmnet
hieradata/codfw/mediawiki/jobrunner.yaml:statsd_server: statsd.eqiad.wmnet:8125
hieradata/common.yaml:statsd: statsd.eqiad.wmnet:8125
hieradata/common/ocg.yaml:statsd_host: "statsd.eqiad.wmnet"
hieradata/eqiad/mediawiki/jobrunner.yaml:statsd_server: statsd.eqiad.wmnet:8125
hieradata/role/common/aqs.yaml:aqs::statsd_host: statsd.eqiad.wmnet
hieradata/role/common/logstash.yaml:role::logstash::statsd_host: statsd.eqiad.wmnet
hieradata/role/common/maps.yaml:service::configuration::statsd_host: statsd.eqiad.wmnet
hieradata/role/common/restbase.yaml:restbase::statsd_host: statsd.eqiad.wmnet
hieradata/role/common/sca.yaml:service::configuration::statsd_host: statsd.eqiad.wmnet
hieradata/role/common/scb.yaml:service::configuration::statsd_host: statsd.eqiad.wmnet
manifests/role/eventlogging.pp:    $statsd_host         = hiera('eventlogging_statsd_host',      'statsd.eqiad.wmnet')
manifests/role/webperf.pp:    $statsd_host = 'statsd.eqiad.wmnet'
modules/base/manifests/puppet.pp:        statsd_host   => 'statsd.eqiad.wmnet',
modules/eventlogging/manifests/service/reporter.pp:#   StatsD host. Example: 'statsd.eqiad.wmnet'.
modules/eventlogging/manifests/service/reporter.pp:#    host => 'statsd.eqiad.wmnet',
modules/puppet_statsd/README.md:  statsd_host   => 'statsd.eqiad.wmnet',
modules/puppet_statsd/manifests/init.pp:#     statsd_host => 'statsd.eqiad.wmnet',
modules/role/manifests/cache/kafka/webrequest.pp:        logster_options => "-o statsd --statsd-host=statsd.eqiad.wmnet:8125 --metric-prefix=${graphite_metric_prefix}",
modules/role/manifests/cache/statsd.pp:        statsd_server => 'statsd.eqiad.wmnet',
modules/role/manifests/cache/statsd.pp:        statsd_server => 'statsd.eqiad.wmnet',
modules/role/manifests/cache/statsd.pp:        statsd_server => 'statsd.eqiad.wmnet',
modules/role/manifests/cache/upload.pp:        statsd_server => 'statsd.eqiad.wmnet',
modules/role/manifests/zuul/configuration.pp:            'statsd_host'          => 'statsd.eqiad.wmnet',
modules/scap/files/scap.cfg:statsd_host: statsd.eqiad.wmnet
modules/scap/files/scap.cfg:statsd_host: statsd.eqiad.wmnet
modules/scap/manifests/master.pp:    $statsd_host        = 'statsd.eqiad.wmnet',
modules/swift/manifests/stats/accounts.pp:    $statsd_host   = 'statsd.eqiad.wmnet',
modules/swift/manifests/stats/dispersion.pp:    $statsd_host   = 'statsd.eqiad.wmnet',
modules/swift/manifests/stats/stats_container.pp:    $statsd_host = 'statsd.eqiad.wmnet',
modules/varnish/manifests/logging/media.pp:#    statsd_server => 'statsd.eqiad.wmnet:8125
modules/varnish/manifests/logging/rls.pp:#    statsd_server => 'statsd.eqiad.wmnet:8125
modules/varnish/manifests/logging/statsd.pp:#    statsd_server => 'statsd.eqiad.wmnet:8125
modules/varnish/manifests/logging/xcps.pp:#    statsd_server => 'statsd.eqiad.wmnet:8125

assuming for the first failover we will keep eqiad reachable from codfw, once we failover clients to statsd.codfw.wmnet metrics aggregation will move from eqiad to codfw and the resulting metrics will still be pushed to both, IOW no backfilling necessary on failover/failback.

however if eqiad isn't reachable from codfw we'll have to fail over all (codfw only?) statsd clients to statsd.codfw.wmnet, when eqiad "comes back" we'll have to rsync all updated metrics in codfw to eqiad and vice versa to bring things back in sync.

Change 273199 merged by Filippo Giunchedi:
codfw: add statsd service entry

https://gerrit.wikimedia.org/r/273199

Krinkle moved this task from Backlog to Out of scope on the codfw-rollout-Jan-Mar-2016 board.Apr 21 2016, 3:01 PM

While working on T157022: Suspected faulty SSD on graphite1001 and in particular the DNS switchover, not all services are observing the DNS TTL of 1H.
Below is the list of statsd prefixes still coming in to graphite1001 after several hours after flipping the dns

Hadoop
VisualEditor
aqs
changeprop
citoid
cxserver
eventlogging
eventstreams
frontend
gerrit
graphoid
kafka
kartotherian
mathoid
mobileapps
mw
nodepool
ocg
ores
parsoid
parsoid-tests
restbase
restbase-dev
restbase-test
tilerator
tileratorui
trendingedits
zookeeper
zuul

So basically either the connection is kept open on the client side and the name is never looked up again, or the applications cache dns indefinitely.

@godog did you consider restarting statsd on graphite1001 and see what reconnects there?

In T127976#3016505, @Joe wrote:

So basically either the connection is kept open on the client side and the name is never looked up again, or the applications cache dns indefinitely.

IIRC statsd dns resolution for nodejs services was special-cased. Otherwise by default nodejs or the statsd library would issue dns lookups every time a statsd udp packet was sent out.

I'm not sure about the java/jmxtrans case, IIRC java used to cache dns records indefinitely but that was fixed in newer jvm releases.

Also note that python services (e.g. webperf-related on hafnium) also didn't seem to pick up the dns change.

The full list of restarts this time around is available from https://phabricator.wikimedia.org/T157022#3014109 onwards.

@godog did you consider restarting statsd on graphite1001 and see what reconnects there?

All UDP traffic, so no dice :(

IIRC statsd dns resolution for nodejs services was special-cased. Otherwise by default nodejs or the statsd library would issue dns lookups every time a statsd udp packet was sent out.

Yes, IIRC we configured IPs to avoid this. We should replace this with short term DNS caching, along the lines of T128015.

• GWicke created subtask T158338: Set up DNS caching for node services.Feb 16 2017, 6:11 PM

• Pchelolo closed subtask T158338: Set up DNS caching for node services as Resolved.Feb 24 2017, 11:17 PM

All services but Parsoid are now running the latest version of service-runner in production. Parsoid has scheduled the deploy within the next week.

This means that we can call this task done.

Graphite DC fail-over / per-DC setupClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Graphite DC fail-over / per-DC setup
Closed, ResolvedPublic
Actions

Related Objects
Search...