As of last week from T143889 we have expanded to two graphite machines per datacenter, it required a new carbon-c-relay configuration to split metrics prefixed with cassandra. to a different set of machines.
Upon applying the new configuration I noticed carbon-c-relay was crashing periodically with a buffer overflow, the version we were running v0.39 (2015-04-21) on trusty was ancient in general and compared to the one on graphite1003 / graphite2002 which run jessie and carbon-c-relay 1.11 from jessie-backports and 1.11 has been stable.
Though from time to time I've observed metrics backing up while sending to graphite2002 up to the point of filling up the queue on carbon-c-relay and start dropping some metrics, graphite1003 doesn't exhibit the same problem queuing problem: https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?from=1463016445905&to=1463405855052