Page MenuHomePhabricator

investigate carbon-c-relay stalls/drops towards graphite2002
Closed, DeclinedPublic

Description

As of last week from T143889 we have expanded to two graphite machines per datacenter, it required a new carbon-c-relay configuration to split metrics prefixed with cassandra. to a different set of machines.

Upon applying the new configuration I noticed carbon-c-relay was crashing periodically with a buffer overflow, the version we were running v0.39 (2015-04-21) on trusty was ancient in general and compared to the one on graphite1003 / graphite2002 which run jessie and carbon-c-relay 1.11 from jessie-backports and 1.11 has been stable.

Though from time to time I've observed metrics backing up while sending to graphite2002 up to the point of filling up the queue on carbon-c-relay and start dropping some metrics, graphite1003 doesn't exhibit the same problem queuing problem: https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?from=1463016445905&to=1463405855052

Related Objects

Event Timeline

I've also noticed this seems to correlate with an elevated number of established tcp connections to the graphite machines.

I've been dumping with ss every 30m to further track this down, the diff at 2016-05-16T11:57:12+0000 and 2016-05-16T13:27:12+0000 at P3108 is interesting, showing several connections with many bytes in recv-q towards tcp/2003, so possibly carbon-c-relay getting overwhelmed with metrics

Mentioned in SAL [2016-05-17T13:24:40Z] <godog> bounce carbon/frontend-relay on graphite1001 to increase queue size T135385

still looking into what might be causing this with a given periodicity, close to us daily patterns perhaps it looks like:

2016-05-18-162858_755x317_scrot.png (317×755 px, 66 KB)

2016-05-18-162913_734x648_scrot.png (648×734 px, 86 KB)

it looks like metric sending from cassandra-metrics-collector keep piling up as they get stalled and never time out the sending, myself and @Eevans have been looking at it and a fix is in the works

Mentioned in SAL [2016-05-20T15:03:06Z] <godog> roll-restart cassandra-metrics-collector in codfw for T135385

Mentioned in SAL [2016-05-20T15:16:07Z] <godog> roll-restart cassandra-metrics-collector in eqiad for T135385

Mentioned in SAL [2016-05-23T11:20:12Z] <godog> deploy new version of cassandra-metrics-collector T135385

Mentioned in SAL [2016-05-24T09:51:51Z] <godog> reenable puppet on graphite1001 T135385

with the cassandra-metrics-collector changes deployed I haven't seen yet a reoccurence of queues full and drops/stalls, though there seem to be some very low drops towards graphite2001 but the graphite2001 isn't full, in any case we'd know more after 24/48h

Change 290488 had a related patch set uploaded (by Eevans):
By-pass graphite-in; Use graphite1003 directly

https://gerrit.wikimedia.org/r/290488

Change 290488 merged by Filippo Giunchedi:
By-pass graphite-in; Use graphite1003 directly

https://gerrit.wikimedia.org/r/290488

@fgiunchedi: The patch in Gerrit has been merged. Can this task be resolved (via Add Action...Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee.

Yes resolvable, graphite is on its way out eventually