statsd counter `count` field aggregates as rate, does not actually count
Closed, ResolvedPublic

Description

The aggregation for the count field in counter values appears to be a rate rather than the actual count:

https://graphite.wikimedia.org/render/?width=588&height=310&_salt=1439050647.036&target=restbase.requests.public.count

This particular counter is incremented by one, so sample_rate is basically the same graph, scaled to 1/s instead of 1/aggregation interval.

https://graphite.wikimedia.org/render/?width=588&height=310&_salt=1439050647.036&target=restbase.requests.public.rate

GWicke updated the task description. (Show Details)
GWicke raised the priority of this task from to Normal.
GWicke added a project: Graphite.
GWicke added subscribers: GWicke, fgiunchedi, ori.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 8 2015, 4:22 PM

that's by design, counters are reset at each flush interval, https://github.com/etsy/statsd/blob/master/stats.js#L105

@fgiunchedi, I guess the idea would be to aggregate a sum with another sum in graphite, so that we end up with a monotonically increasing counter.

thoughts on how to do that? also what's the underlying problem you are trying to solve?

what's the underlying problem you are trying to solve?

The main problem I'm trying to solve is counting events like page views. The aggregation of rates is currently fairly murky (see also T111170), and it isn't clear how we can get a reliable count out of a statsd counter. It might actually make more sense to sum the sum property in graphite for this.

I'm not even sure it'll work with sum, looking at T111170 there count gets aggregated by summing but the graphs don't look like an ever increasing counter due to flush periods.

To achieve sth like an ever increasing counter I'd suggest keeping an internal counter that increments on page view and push that as a gauge per-service, on service restart it gets reset to 0 but that's easy to detect and doesn't alter summing, to get total page views you can sum across different services. That incidentally makes it slightly more reliable since losing one push of the counter isn't as bad. Of course if you want reliable tracking analytics-type analysis is more suited

@Krinkle: Good point! Had missed that integration is supported. It might not be as precise as actually summing up the values, but is probably the closest we can get for now & good enough to get an order of magnitude.

@Krinkle nice !

I've added a FAQ section at https://wikitech.wikimedia.org/wiki/Graphite since this will come up again

Krinkle closed this task as Resolved.Oct 30 2015, 12:36 AM
Krinkle claimed this task.