Page MenuHomePhabricator

Improve graphite failover
Closed, ResolvedPublic

Description

Failing over graphite in T85909 and T157022 was not fun nor fast, namely doing it via dns has several disavantages and mostly because clients are long-running and won't pick up dns changes by themselves. I'm excluding service-runner based services since dns caching there is fixed by the latest service-runner version (T158338)

I'm listing here the services that needed manual restarts due to caching DNS records forever:

Event Timeline

fgiunchedi raised the priority of this task from to Medium.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: acl*sre-team, Grafana.
fgiunchedi subscribed.
fgiunchedi renamed this task from improve graphite failover to Improve graphite failover.Feb 21 2017, 6:24 PM
fgiunchedi updated the task description. (Show Details)
hashar subscribed.

zuul (contint1001)
nodepool (labnodepool1001)

Both use python-statsd. They create a StatsClient which cache socket.gethostbyname() result.

Zuul embeds 2.1.2. Zuul creates a which is initialized with:

def __init__(self, host='localhost', port=8125, prefix=None,
             maxudpsize=512):
    """Create a new client."""
    self._addr = (socket.gethostbyname(host), port)
    self._sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)

And then send it with:

self._sock.sendto(data.encode('ascii'), self._addr)

Nodepool uses python-statsd deb package 3.0.1 .

I thought statsd.eqiad.wmnet pointed to a service IP that would be moved from host to host but DNS shows it is a CNAME to the graphite hosts.

Change 455805 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] lower statsd/carbon CNAMES TTL

https://gerrit.wikimedia.org/r/455805

Change 455805 merged by Filippo Giunchedi:
[operations/dns@master] lower statsd/carbon CNAMES TTL

https://gerrit.wikimedia.org/r/455805

Mentioned in SAL (#wikimedia-operations) [2018-10-17T18:57:14Z] <awight> Restarting ORES cluster to refresh DNS, T88997

<awight> godog: for future reference, here's the command I run on ores*, in serial, not parallel:
<awight> sudo service celery-ores-worker restart

Mentioned in SAL (#wikimedia-operations) [2018-10-17T19:09:56Z] <godog> roll-restart eventbus for statsd DNS change - T88997

Mentioned in SAL (#wikimedia-operations) [2018-10-17T19:19:06Z] <godog> restart zuul for statsd DNS change - T88997

Mentioned in SAL (#wikimedia-operations) [2018-10-17T19:26:09Z] <shdubsh> restart eventlogging for statsd DNS change - T88997

Mentioned in SAL (#wikimedia-operations) [2018-10-17T22:56:56Z] <awight> Restarting ORES uwsgi service for T88997

Change 468182 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] Use the newer statsd name for ORES nodes

https://gerrit.wikimedia.org/r/468182

I thought statsd.eqiad.wmnet pointed to a service IP that would be moved from host to host but DNS shows it is a CNAME to the graphite hosts.

Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data for the same metric) I was thinking we could sidestep the problem and run statsite locally then have zuul send to localhost:8125 instead. What do you think?

Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data for the same metric) I was thinking we could sidestep the problem and run statsite locally then have zuul send to localhost:8125 instead. What do you think?

Definitely, that sounds perfect :] Thank you to have noticed my edit. The statsd host is mentioned in the hieradata for role::ci::master:

hieradata/role/common/ci/master.yaml
23 profile::zuul::server::conf:
24     # ferm defaults to ACCEPT on loopback:
25     gearman_server: 127.0.0.1
26     config_git_branch: master
27     gearman_server_start: true
28     # FIXME use a lookup?
29     statsd_host: statsd.eqiad.wmnet   # <--------- [ EASY CHANGE ] ------------
30     url_pattern: 'https://integration.wikimedia.org/ci/job/{job.name}/{build.number}/console'
31     status_url: 'https://integration.wikimedia.org/zuul/'

So probably we just need to add the statsite profile to the role modules/role/manifests/ci/master.pp, restart Zuul and call it done?

Since zuul doesn't seem to use/need global statsd aggregation (i.e. multiple hosts send statsd data for the same metric) I was thinking we could sidestep the problem and run statsite locally then have zuul send to localhost:8125 instead. What do you think?

Definitely, that sounds perfect :] Thank you to have noticed my edit. The statsd host is mentioned in the hieradata for role::ci::master:

hieradata/role/common/ci/master.yaml
23 profile::zuul::server::conf:
24     # ferm defaults to ACCEPT on loopback:
25     gearman_server: 127.0.0.1
26     config_git_branch: master
27     gearman_server_start: true
28     # FIXME use a lookup?
29     statsd_host: statsd.eqiad.wmnet   # <--------- [ EASY CHANGE ] ------------
30     url_pattern: 'https://integration.wikimedia.org/ci/job/{job.name}/{build.number}/console'
31     status_url: 'https://integration.wikimedia.org/zuul/'

So probably we just need to add the statsite profile to the role modules/role/manifests/ci/master.pp, restart Zuul and call it done?

Yes that's it, include role::statsite and switch zuul to localhost:8125. Unlikely I'll have time to followup this quarter but I'm happy to review patches!

Change 468182 merged by Cwhite:
[operations/puppet@production] Use the newer statsd name for ORES nodes

https://gerrit.wikimedia.org/r/468182

Mentioned in SAL (#wikimedia-operations) [2018-10-18T18:19:04Z] <awight> Restarting ORES services for T88997

@fgiunchedi can we add the statsd proxy for the servers running Zuul? My previous comment above T88997#4676750 has all the relevant bits. It should be straightforward to setup.

We did a similar thing on the CI hosts for SMTP, they all talk to localhost and the actual routing is handled via the puppet global config :)

@fgiunchedi can we add the statsd proxy for the servers running Zuul? My previous comment above T88997#4676750 has all the relevant bits. It should be straightforward to setup.

Yes we can! Should be as easy as include role::statsite where appropriate (it should be a profile really but not in scope for this) and then point zuul to localhost:8125 instead. This works as expected (i.e. the statsd aggregation happens on localhost instead of the graphite host) unless the service sends the same metric from multiple hosts, which afaics it isn't the case for zuul.

Also to clarify: statsite isn't statsd-proxy, they are related but different and we want the former in this case.

We did a similar thing on the CI hosts for SMTP, they all talk to localhost and the actual routing is handled via the puppet global config :)

Indeed that's the idea!

Change 474128 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] ci: use statsite for localhost statsd aggregation

https://gerrit.wikimedia.org/r/474128

Change 474128 merged by Filippo Giunchedi:
[operations/puppet@production] ci: use statsite for localhost statsd aggregation

https://gerrit.wikimedia.org/r/474128

Zuul now emits stats to localhost which has statsite running :)

hashar removed a project: Zuul.
hashar unsubscribed.

Dropping Zuul. Filippo proposed a nice fixup (statsite). Nodepool is gone. I am unsubscribing. Thank you @godog!

fgiunchedi claimed this task.

Resolving since graphite failover nowadays is much better and documented at https://wikitech.wikimedia.org/wiki/Graphite#Failover