Page MenuHomePhabricator

switch diamond to use graphite line protocol
Closed, ResolvedPublic

Description

for historical reasons diamond sends its samples using statsd every 60s, this matches statsd flush interval. furthermore we're mostly using gauges so none of statsd aggregation capabilities, switching to graphite will relieve some pressure on statsd and possibly reduce diamond metrics lag too.

  • audit for non-gauge metrics
    • so far nginx and tcp collector send counters
  • (incrementally) switch diamond to report to graphite and not statsd

Event Timeline

fgiunchedi raised the priority of this task from to Medium.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: SRE, observability.
fgiunchedi subscribed.

on second thought we should be fine with collectors that send counters too, what happens is that graphite will store the number as is and there shouldn't be a change in semantics since diamond pushes once a minute and statsite also flushes once a minute

I've temporarily switched diamond to graphite on filippo-test-trusty.eqiad.wmflabs to see the net effect it has (if any) on the metrics

Change 268360 had a related patch set uploaded (by Filippo Giunchedi):
diamond: send labs instance metrics via graphite/carbon

https://gerrit.wikimedia.org/r/268360

I would like to keep the ability to send timers

Change 268360 merged by Filippo Giunchedi:
diamond: send labs instance metrics via graphite/carbon

https://gerrit.wikimedia.org/r/268360

merged this now for labs instances, metrics from tcp collector will change type as counters don't get derived metrics like in statsd

309088 Mar 22 14:24 /srv/carbon/whisper/toolserver-legacy/relic/tcp/ListenOverflows.wsp
309088 Mar 22 12:27 /srv/carbon/whisper/toolserver-legacy/relic/tcp/ListenOverflows/lower.wsp
309088 Mar 22 12:29 /srv/carbon/whisper/toolserver-legacy/relic/tcp/ListenOverflows/sum.wsp
309088 Mar 22 12:26 /srv/carbon/whisper/toolserver-legacy/relic/tcp/ListenOverflows/count.wsp
309088 Mar 22 12:30 /srv/carbon/whisper/toolserver-legacy/relic/tcp/ListenOverflows/rate.wsp
309088 Mar 22 12:27 /srv/carbon/whisper/toolserver-legacy/relic/tcp/ListenOverflows/upper.wsp
309088 Mar 22 12:34 /srv/carbon/whisper/toolserver-legacy/relic/tcp/ListenOverflows/mean.wsp

also corresponding drop in udp incoming datagrams on labmon1001

2016-03-22-143249_688x335_scrot.png (335×688 px, 44 KB)

Change 281622 had a related patch set uploaded (by Filippo Giunchedi):
diamond: send production traffic via graphite line protocol

https://gerrit.wikimedia.org/r/281622

Change 281622 merged by Filippo Giunchedi:
diamond: send production traffic via graphite line protocol

https://gerrit.wikimedia.org/r/281622

this has been deployed, though the impact on statsd seems to have been minimal in terms of udp packets/drops, though the move also implied moving to TCP

reopening as there seem to be missing ACLs from hosts with public IPs towards graphite1001:2003 tcp (e.g. carbon)

Change 282706 had a related patch set uploaded (by Filippo Giunchedi):
graphite: permit line protocol traffic from ALL_NETWORKS

https://gerrit.wikimedia.org/r/282706

Change 282706 merged by Filippo Giunchedi:
graphite: permit line protocol traffic from ALL_NETWORKS

https://gerrit.wikimedia.org/r/282706

as it turns out it wasn't ACLs but iptables, now fixed