Page MenuHomePhabricator

Improve graphite failover
Open, MediumPublic

Description

Failing over graphite in T85909 and T157022 was not fun nor fast, namely doing it via dns has several disavantages and mostly because clients are long-running and won't pick up dns changes by themselves. I'm excluding service-runner based services since dns caching there is fixed by the latest service-runner version (T158338)

I'm listing here the services that needed manual restarts due to caching DNS records forever:

Details

Related Gerrit Patches:
operations/puppet : productionci: use statsite for localhost statsd aggregation
operations/puppet : productionUse the newer statsd name for ORES nodes
operations/dns : masterlower statsd/carbon CNAMES TTL

Event Timeline

fgiunchedi raised the priority of this task from to Medium.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: acl*sre-team, Graphite.
fgiunchedi added a subscriber: fgiunchedi.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2015, 4:20 PM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 22 2015, 9:11 AM
fgiunchedi renamed this task from improve graphite failover to Improve graphite failover.Feb 21 2017, 6:24 PM
fgiunchedi updated the task description. (Show Details)
hashar added a subscriber: hashar.

zuul (contint1001)
nodepool (labnodepool1001)

Both use python-statsd. They create a StatsClient which cache socket.gethostbyname() result.

Zuul embeds 2.1.2. Zuul creates a which is initialized with:

def __init__(self, host='localhost', port=8125, prefix=None,
             maxudpsize=512):
    """Create a new client."""
    self._addr = (socket.gethostbyname(host), port)
    self._sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

And then send it with:

self._sock.sendto(data.encode('ascii'), self._addr)

Nodepool uses python-statsd deb package 3.0.1 .

I thought statsd.eqiad.wmnet pointed to a service IP that would be moved from host to host but DNS shows it is a CNAME to the graphite hosts.

ema added a subscriber: ema.Jul 21 2017, 3:46 PM
fgiunchedi updated the task description. (Show Details)Dec 20 2017, 10:22 AM
hashar updated the task description. (Show Details)Dec 21 2017, 8:36 PM
fgiunchedi updated the task description. (Show Details)Aug 28 2018, 9:20 AM

Nodepool is legacy / will be gone.

Change 455805 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] lower statsd/carbon CNAMES TTL

https://gerrit.wikimedia.org/r/455805

Change 455805 merged by Filippo Giunchedi:
[operations/dns@master] lower statsd/carbon CNAMES TTL

https://gerrit.wikimedia.org/r/455805

Mentioned in SAL (#wikimedia-operations) [2018-10-17T18:57:14Z] <awight> Restarting ORES cluster to refresh DNS, T88997

<awight> godog: for future reference, here's the command I run on ores*, in serial, not parallel:
<awight> sudo service celery-ores-worker restart

Mentioned in SAL (#wikimedia-operations) [2018-10-17T19:09:56Z] <godog> roll-restart eventbus for statsd DNS change - T88997

Mentioned in SAL (#wikimedia-operations) [2018-10-17T19:19:06Z] <godog> restart zuul for statsd DNS change - T88997

Mentioned in SAL (#wikimedia-operations) [2018-10-17T19:26:09Z] <shdubsh> restart eventlogging for statsd DNS change - T88997

hashar updated the task description. (Show Details)Oct 17 2018, 7:47 PM

Mentioned in SAL (#wikimedia-operations) [2018-10-17T22:56:56Z] <awight> Restarting ORES uwsgi service for T88997

Change 468182 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Use the newer statsd name for ORES nodes

https://gerrit.wikimedia.org/r/468182

fgiunchedi updated the task description. (Show Details)Oct 18 2018, 7:59 AM

I thought statsd.eqiad.wmnet pointed to a service IP that would be moved from host to host but DNS shows it is a CNAME to the graphite hosts.

Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data for the same metric) I was thinking we could sidestep the problem and run statsite locally then have zuul send to localhost:8125 instead. What do you think?

Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data for the same metric) I was thinking we could sidestep the problem and run statsite locally then have zuul send to localhost:8125 instead. What do you think?

Definitely, that sounds perfect :] Thank you to have noticed my edit. The statsd host is mentioned in the hieradata for role::ci::master:

hieradata/role/common/ci/master.yaml
23 profile::zuul::server::conf:
24     # ferm defaults to ACCEPT on loopback:
25     gearman_server: 127.0.0.1
26     config_git_branch: master
27     gearman_server_start: true
28     # FIXME use a lookup?
29     statsd_host: statsd.eqiad.wmnet   # <--------- [ EASY CHANGE ] ------------
30     url_pattern: 'https://integration.wikimedia.org/ci/job/{job.name}/{build.number}/console'
31     status_url: 'https://integration.wikimedia.org/zuul/'

So probably we just need to add the statsite profile to the role modules/role/manifests/ci/master.pp, restart Zuul and call it done?

Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data for the same metric) I was thinking we could sidestep the problem and run statsite locally then have zuul send to localhost:8125 instead. What do you think?

Definitely, that sounds perfect :] Thank you to have noticed my edit. The statsd host is mentioned in the hieradata for role::ci::master:

hieradata/role/common/ci/master.yaml
23 profile::zuul::server::conf:
24     # ferm defaults to ACCEPT on loopback:
25     gearman_server: 127.0.0.1
26     config_git_branch: master
27     gearman_server_start: true
28     # FIXME use a lookup?
29     statsd_host: statsd.eqiad.wmnet   # <--------- [ EASY CHANGE ] ------------
30     url_pattern: 'https://integration.wikimedia.org/ci/job/{job.name}/{build.number}/console'
31     status_url: 'https://integration.wikimedia.org/zuul/'

So probably we just need to add the statsite profile to the role modules/role/manifests/ci/master.pp, restart Zuul and call it done?

Yes that's it, include role::statsite and switch zuul to localhost:8125. Unlikely I'll have time to followup this quarter but I'm happy to review patches!

Change 468182 merged by Cwhite:
[operations/puppet@production] Use the newer statsd name for ORES nodes

https://gerrit.wikimedia.org/r/468182

Mentioned in SAL (#wikimedia-operations) [2018-10-18T18:19:04Z] <awight> Restarting ORES services for T88997

fgiunchedi updated the task description. (Show Details)Nov 13 2018, 10:51 AM
fgiunchedi updated the task description. (Show Details)Nov 14 2018, 9:17 AM

@fgiunchedi can we add the statsd proxy for the servers running Zuul? My previous comment above T88997#4676750 has all the relevant bits. It should be straightforward to setup.

We did a similar thing on the CI hosts for SMTP, they all talk to localhost and the actual routing is handled via the puppet global config :)

@fgiunchedi can we add the statsd proxy for the servers running Zuul? My previous comment above T88997#4676750 has all the relevant bits. It should be straightforward to setup.

Yes we can! Should be as easy as include role::statsite where appropriate (it should be a profile really but not in scope for this) and then point zuul to localhost:8125 instead. This works as expected (i.e. the statsd aggregation happens on localhost instead of the graphite host) unless the service sends the same metric from multiple hosts, which afaics it isn't the case for zuul.

Also to clarify: statsite isn't statsd-proxy, they are related but different and we want the former in this case.

We did a similar thing on the CI hosts for SMTP, they all talk to localhost and the actual routing is handled via the puppet global config :)

Indeed that's the idea!

Change 474128 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] ci: use statsite for localhost statsd aggregation

https://gerrit.wikimedia.org/r/474128

hashar updated the task description. (Show Details)Nov 16 2018, 12:49 PM

Change 474128 merged by Filippo Giunchedi:
[operations/puppet@production] ci: use statsite for localhost statsd aggregation

https://gerrit.wikimedia.org/r/474128

hashar updated the task description. (Show Details)Nov 16 2018, 3:27 PM

Zuul now emits stats to localhost which has statsite running :)

hashar moved this task from Backlog to Enhancements on the Zuul board.Nov 26 2018, 10:54 PM
hashar removed a project: Zuul.
hashar removed a subscriber: hashar.

Dropping Zuul. Filippo proposed a nice fixup (statsite). Nodepool is gone. I am unsubscribing. Thank you @godog!

fgiunchedi updated the task description. (Show Details)Sep 6 2019, 10:45 AM