Improve graphite failover
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Feb 9 2015, 4:20 PM

Description

Failing over graphite in T85909 and T157022 was not fun nor fast, namely doing it via dns has several disavantages and mostly because clients are long-running and won't pick up dns changes by themselves. I'm excluding service-runner based services since dns caching there is fixed by the latest service-runner version (T158338)

I'm listing here the services that needed manual restarts due to caching DNS records forever:

~~zuul (contint1001)~~ Done https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474128/
jmxtrans (on old kafka analytics cluster)
~~navtiming (webperf)~~
eventbus
eventlogging
~~nodepool (labnodepool1001)~~ Service is being removed T209361
~~statsv~~
~~mwerrors (eventlog1001)~~
~~parsoid (wtp* / ruthenium)~~

Details

Subject	Repo	Branch	Lines +/-
ci: use statsite for localhost statsd aggregation	operations/puppet	production	+3 -2
Use the newer statsd name for ORES nodes	operations/puppet	production	+3 -3
lower statsd/carbon CNAMES TTL	operations/dns	master	+4 -4

Customize query in gerrit

Related Objects

Mentioned In: T206963: Perform a statsd and Graphite switch
T179093: Support multi DC statsv
T171048: Eventbus does not handle gracefully changes in DNS recursors
T157022: Suspected faulty SSD on graphite1001
Mentioned Here: T209361: Phase out Nodepool from production
T157022: Suspected faulty SSD on graphite1001
T158338: Set up DNS caching for node services
T85909: migrate graphite to new hardware

Event Timeline

fgiunchedi created this task.Feb 9 2015, 4:20 PM

fgiunchedi raised the priority of this task from to Medium.

fgiunchedi updated the task description. (Show Details)

fgiunchedi added projects: acl*sre-team, Grafana.

fgiunchedi subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2015, 4:20 PM

fgiunchedi claimed this task.Jul 22 2015, 9:11 AM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 22 2015, 9:11 AM

fgiunchedi renamed this task from improve graphite failover to Improve graphite failover.Feb 21 2017, 6:24 PM

fgiunchedi updated the task description. (Show Details)

fgiunchedi mentioned this in T157022: Suspected faulty SSD on graphite1001.Feb 22 2017, 9:10 AM

zuul (contint1001)
nodepool (labnodepool1001)

Both use python-statsd. They create a StatsClient which cache socket.gethostbyname() result.

Zuul embeds 2.1.2. Zuul creates a which is initialized with:

def __init__(self, host='localhost', port=8125, prefix=None,
             maxudpsize=512):
    """Create a new client."""
    self._addr = (socket.gethostbyname(host), port)
    self._sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

And then send it with:

self._sock.sendto(data.encode('ascii'), self._addr)

Nodepool uses python-statsd deb package 3.0.1 .

I thought statsd.eqiad.wmnet pointed to a service IP that would be moved from host to host but DNS shows it is a CNAME to the graphite hosts.

fgiunchedi mentioned this in T171048: Eventbus does not handle gracefully changes in DNS recursors.Jul 20 2017, 3:21 PM

• ema subscribed.Jul 21 2017, 3:46 PM

elukey subscribed.Jul 21 2017, 3:48 PM

fgiunchedi mentioned this in T179093: Support multi DC statsv.Dec 14 2017, 4:51 PM

fgiunchedi updated the task description. (Show Details)Dec 20 2017, 10:22 AM

hashar updated the task description. (Show Details)Dec 21 2017, 8:36 PM

fgiunchedi updated the task description. (Show Details)Aug 28 2018, 9:20 AM

Nodepool is legacy / will be gone.

Change 455805 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] lower statsd/carbon CNAMES TTL

https://gerrit.wikimedia.org/r/455805

gerritbot added a project: Patch-For-Review.Aug 28 2018, 10:32 AM

Change 455805 merged by Filippo Giunchedi:
[operations/dns@master] lower statsd/carbon CNAMES TTL

https://gerrit.wikimedia.org/r/455805

fgiunchedi mentioned this in T206963: Perform a statsd and Graphite switch.Oct 15 2018, 9:45 AM

Mentioned in SAL (#wikimedia-operations) [2018-10-17T18:57:14Z] <awight> Restarting ORES cluster to refresh DNS, T88997

Krinkle added a project: Performance-Team (Radar).Oct 17 2018, 6:57 PM

<awight> godog: for future reference, here's the command I run on ores*, in serial, not parallel:
<awight> sudo service celery-ores-worker restart

Mentioned in SAL (#wikimedia-operations) [2018-10-17T19:09:56Z] <godog> roll-restart eventbus for statsd DNS change - T88997

Mentioned in SAL (#wikimedia-operations) [2018-10-17T19:19:06Z] <godog> restart zuul for statsd DNS change - T88997

Mentioned in SAL (#wikimedia-operations) [2018-10-17T19:26:09Z] <shdubsh> restart eventlogging for statsd DNS change - T88997

hashar updated the task description. (Show Details)Oct 17 2018, 7:47 PM

Mentioned in SAL (#wikimedia-operations) [2018-10-17T22:56:56Z] <awight> Restarting ORES uwsgi service for T88997

Change 468182 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Use the newer statsd name for ORES nodes

https://gerrit.wikimedia.org/r/468182

fgiunchedi updated the task description. (Show Details)Oct 18 2018, 7:59 AM

In T88997#3051083, @hashar wrote:

I thought statsd.eqiad.wmnet pointed to a service IP that would be moved from host to host but DNS shows it is a CNAME to the graphite hosts.

Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data for the same metric) I was thinking we could sidestep the problem and run statsite locally then have zuul send to localhost:8125 instead. What do you think?

In T88997#4676648, @fgiunchedi wrote:

Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data for the same metric) I was thinking we could sidestep the problem and run statsite locally then have zuul send to localhost:8125 instead. What do you think?

Definitely, that sounds perfect :] Thank you to have noticed my edit. The statsd host is mentioned in the hieradata for role::ci::master:

hieradata/role/common/ci/master.yaml

23 profile::zuul::server::conf:
24     # ferm defaults to ACCEPT on loopback:
25     gearman_server: 127.0.0.1
26     config_git_branch: master
27     gearman_server_start: true
28     # FIXME use a lookup?
29     statsd_host: statsd.eqiad.wmnet   # <--------- [ EASY CHANGE ] ------------
30     url_pattern: 'https://integration.wikimedia.org/ci/job/{job.name}/{build.number}/console'
31     status_url: 'https://integration.wikimedia.org/zuul/'

So probably we just need to add the statsite profile to the role modules/role/manifests/ci/master.pp, restart Zuul and call it done?

In T88997#4676750, @hashar wrote:
In T88997#4676648, @fgiunchedi wrote:

Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data for the same metric) I was thinking we could sidestep the problem and run statsite locally then have zuul send to localhost:8125 instead. What do you think?

Definitely, that sounds perfect :] Thank you to have noticed my edit. The statsd host is mentioned in the hieradata for role::ci::master:
hieradata/role/common/ci/master.yaml
23 profile::zuul::server::conf:
24     # ferm defaults to ACCEPT on loopback:
25     gearman_server: 127.0.0.1
26     config_git_branch: master
27     gearman_server_start: true
28     # FIXME use a lookup?
29     statsd_host: statsd.eqiad.wmnet   # <--------- [ EASY CHANGE ] ------------
30     url_pattern: 'https://integration.wikimedia.org/ci/job/{job.name}/{build.number}/console'
31     status_url: 'https://integration.wikimedia.org/zuul/'
So probably we just need to add the statsite profile to the role modules/role/manifests/ci/master.pp, restart Zuul and call it done?

Yes that's it, include role::statsite and switch zuul to localhost:8125. Unlikely I'll have time to followup this quarter but I'm happy to review patches!

Change 468182 merged by Cwhite:
[operations/puppet@production] Use the newer statsd name for ORES nodes

https://gerrit.wikimedia.org/r/468182

Mentioned in SAL (#wikimedia-operations) [2018-10-18T18:19:04Z] <awight> Restarting ORES services for T88997

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Nov 5 2018, 10:06 PM

fgiunchedi updated the task description. (Show Details)Nov 13 2018, 10:51 AM

fgiunchedi updated the task description. (Show Details)Nov 14 2018, 9:17 AM

@fgiunchedi can we add the statsd proxy for the servers running Zuul? My previous comment above T88997#4676750 has all the relevant bits. It should be straightforward to setup.

We did a similar thing on the CI hosts for SMTP, they all talk to localhost and the actual routing is handled via the puppet global config :)

In T88997#4745665, @hashar wrote:

@fgiunchedi can we add the statsd proxy for the servers running Zuul? My previous comment above T88997#4676750 has all the relevant bits. It should be straightforward to setup.

Yes we can! Should be as easy as include role::statsite where appropriate (it should be a profile really but not in scope for this) and then point zuul to localhost:8125 instead. This works as expected (i.e. the statsd aggregation happens on localhost instead of the graphite host) unless the service sends the same metric from multiple hosts, which afaics it isn't the case for zuul.

Also to clarify: statsite isn't statsd-proxy, they are related but different and we want the former in this case.

We did a similar thing on the CI hosts for SMTP, they all talk to localhost and the actual routing is handled via the puppet global config :)

Indeed that's the idea!

Change 474128 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] ci: use statsite for localhost statsd aggregation

https://gerrit.wikimedia.org/r/474128

hashar updated the task description. (Show Details)Nov 16 2018, 12:49 PM

Change 474128 merged by Filippo Giunchedi:
[operations/puppet@production] ci: use statsite for localhost statsd aggregation

https://gerrit.wikimedia.org/r/474128