Page MenuHomePhabricator

Counter data swallowed by Graphite?
Closed, ResolvedPublic

Description

This could be operator error, but I'm seeing very strange behavior when building graphs in Grafana. I've tried to provide a minimal example case below, which shows that a metric recorded using increment, which should result in the c counter type in the statsd protocol, gets incorrectly rolled up across time buckets, as if Graphite thinks it is a gauge metric:

https://graphite.wikimedia.org/render?target=MediaWiki.FileImporter.error.byRecoverable.false.byType.duplicateFiles.sum&from=1560895200&until=1560981659&format=json&noNullPoints=1&maxDataPoints=300
->

1[{"target": "MediaWiki.FileImporter.error.byRecoverable.false.byType.duplicateFiles.sum", "datapoints": [[null, 1560895380], [null, 1560895560], [null, 1560895740], [null, 1560895920], [null, 1560896100], [null, 1560896280], [null, 1560896460], [null, 1560896640], [null, 1560896820], [null, 1560897000], [null, 1560897180], [null, 1560897360], [null, 1560897540], [null, 1560897720], [null, 1560897900], [null, 1560898080], [null, 1560898260], [null, 1560898440], [null, 1560898620], [null, 1560898800], [null, 1560898980], [null, 1560899160], [null, 1560899340], [null, 1560899520], [null, 1560899700], [null, 1560899880], [null, 1560900060], [null, 1560900240], [null, 1560900420], [null, 1560900600], [null, 1560900780], [null, 1560900960], [null, 1560901140], [null, 1560901320], [null, 1560901500], [null, 1560901680], [null, 1560901860], [null, 1560902040], [null, 1560902220], [null, 1560902400], [null, 1560902580], [null, 1560902760], [null, 1560902940], [null, 1560903120], [null, 1560903300], [null, 1560903480], [null, 1560903660], [null, 1560903840], [null, 1560904020], [null, 1560904200], [null, 1560904380], [null, 1560904560], [null, 1560904740], [null, 1560904920], [null, 1560905100], [null, 1560905280], [null, 1560905460], [null, 1560905640], [null, 1560905820], [null, 1560906000], [null, 1560906180], [null, 1560906360], [null, 1560906540], [null, 1560906720], [null, 1560906900], [null, 1560907080], [null, 1560907260], [null, 1560907440], [null, 1560907620], [null, 1560907800], [null, 1560907980], [null, 1560908160], [null, 1560908340], [null, 1560908520], [null, 1560908700], [null, 1560908880], [null, 1560909060], [null, 1560909240], [null, 1560909420], [null, 1560909600], [null, 1560909780], [null, 1560909960], [null, 1560910140], [null, 1560910320], [null, 1560910500], [null, 1560910680], [null, 1560910860], [null, 1560911040], [null, 1560911220], [null, 1560911400], [null, 1560911580], [null, 1560911760], [null, 1560911940], [null, 1560912120], [null, 1560912300], [null, 1560912480], [null, 1560912660], [null, 1560912840], [null, 1560913020], [null, 1560913200], [null, 1560913380], [null, 1560913560], [null, 1560913740], [null, 1560913920], [null, 1560914100], [null, 1560914280], [null, 1560914460], [null, 1560914640], [null, 1560914820], [null, 1560915000], [null, 1560915180], [null, 1560915360], [null, 1560915540], [null, 1560915720], [null, 1560915900], [null, 1560916080], [null, 1560916260], [null, 1560916440], [null, 1560916620], [null, 1560916800], [null, 1560916980], [null, 1560917160], [null, 1560917340], [null, 1560917520], [null, 1560917700], [null, 1560917880], [null, 1560918060], [null, 1560918240], [null, 1560918420], [null, 1560918600], [null, 1560918780], [null, 1560918960], [null, 1560919140], [null, 1560919320], [null, 1560919500], [null, 1560919680], [null, 1560919860], [null, 1560920040], [null, 1560920220], [null, 1560920400], [null, 1560920580], [null, 1560920760], [null, 1560920940], [null, 1560921120], [null, 1560921300], [null, 1560921480], [null, 1560921660], [null, 1560921840], [null, 1560922020], [null, 1560922200], [null, 1560922380], [null, 1560922560], [null, 1560922740], [null, 1560922920], [null, 1560923100], [null, 1560923280], [null, 1560923460], [null, 1560923640], [null, 1560923820], [null, 1560924000], [1.0, 1560924180], [null, 1560924360], [null, 1560924540], [null, 1560924720], [null, 1560924900], [1.0, 1560925080], [null, 1560925260], [null, 1560925440], [null, 1560925620], [null, 1560925800], [null, 1560925980], [null, 1560926160], [1.0, 1560926340], [null, 1560926520], [1.0, 1560926700], [null, 1560926880], [1.0, 1560927060], [1.0, 1560927240], [1.0, 1560927420], [null, 1560927600], [null, 1560927780], [null, 1560927960], [null, 1560928140], [null, 1560928320], [null, 1560928500], [null, 1560928680], [null, 1560928860], [null, 1560929040], [null, 1560929220], [null, 1560929400], [null, 1560929580], [null, 1560929760], [null, 1560929940], [null, 1560930120], [null, 1560930300], [null, 1560930480], [null, 1560930660], [null, 1560930840], [null, 1560931020], [null, 1560931200], [null, 1560931380], [null, 1560931560], [null, 1560931740], [null, 1560931920], [null, 1560932100], [1.0, 1560932280], [null, 1560932460], [null, 1560932640], [null, 1560932820], [null, 1560933000], [null, 1560933180], [null, 1560933360], [null, 1560933540], [null, 1560933720], [null, 1560933900], [null, 1560934080], [null, 1560934260], [null, 1560934440], [null, 1560934620], [null, 1560934800], [null, 1560934980], [null, 1560935160], [null, 1560935340], [null, 1560935520], [null, 1560935700], [null, 1560935880], [null, 1560936060], [null, 1560936240], [null, 1560936420], [null, 1560936600], [null, 1560936780], [null, 1560936960], [null, 1560937140], [null, 1560937320], [null, 1560937500], [null, 1560937680], [null, 1560937860], [null, 1560938040], [null, 1560938220], [null, 1560938400], [null, 1560938580], [null, 1560938760], [null, 1560938940], [null, 1560939120], [null, 1560939300], [null, 1560939480], [null, 1560939660], [null, 1560939840], [null, 1560940020], [null, 1560940200], [null, 1560940380], [null, 1560940560], [null, 1560940740], [null, 1560940920], [null, 1560941100], [null, 1560941280], [null, 1560941460], [null, 1560941640], [null, 1560941820], [null, 1560942000], [null, 1560942180], [null, 1560942360], [null, 1560942540], [null, 1560942720], [null, 1560942900], [null, 1560943080], [null, 1560943260], [null, 1560943440], [null, 1560943620], [null, 1560943800], [null, 1560943980], [null, 1560944160], [null, 1560944340], [null, 1560944520], [null, 1560944700], [null, 1560944880], [null, 1560945060], [null, 1560945240], [null, 1560945420], [null, 1560945600], [null, 1560945780], [null, 1560945960], [null, 1560946140], [null, 1560946320], [null, 1560946500], [null, 1560946680], [null, 1560946860], [null, 1560947040], [null, 1560947220]]}]

https://graphite.wikimedia.org/render?target=MediaWiki.FileImporter.error.byRecoverable.false.byType.duplicateFiles.sum&from=1560895200&until=1560981659&format=json&noNullPoints&maxDataPoints=100
->

1[{"target": "MediaWiki.FileImporter.error.byRecoverable.false.byType.duplicateFiles.sum", "datapoints": [[null, 1560895380], [null, 1560895920], [null, 1560896460], [null, 1560897000], [null, 1560897540], [null, 1560898080], [null, 1560898620], [null, 1560899160], [null, 1560899700], [null, 1560900240], [null, 1560900780], [null, 1560901320], [null, 1560901860], [null, 1560902400], [null, 1560902940], [null, 1560903480], [null, 1560904020], [null, 1560904560], [null, 1560905100], [null, 1560905640], [null, 1560906180], [null, 1560906720], [null, 1560907260], [null, 1560907800], [null, 1560908340], [null, 1560908880], [null, 1560909420], [null, 1560909960], [null, 1560910500], [null, 1560911040], [null, 1560911580], [null, 1560912120], [null, 1560912660], [null, 1560913200], [null, 1560913740], [null, 1560914280], [null, 1560914820], [null, 1560915360], [null, 1560915900], [null, 1560916440], [null, 1560916980], [null, 1560917520], [null, 1560918060], [null, 1560918600], [null, 1560919140], [null, 1560919680], [null, 1560920220], [null, 1560920760], [null, 1560921300], [null, 1560921840], [null, 1560922380], [null, 1560922920], [null, 1560923460], [1.0, 1560924000], [null, 1560924540], [1.0, 1560925080], [null, 1560925620], [1.0, 1560926160], [1.0, 1560926700], [1.0, 1560927240], [null, 1560927780], [null, 1560928320], [null, 1560928860], [null, 1560929400], [null, 1560929940], [null, 1560930480], [null, 1560931020], [null, 1560931560], [1.0, 1560932100], [null, 1560932640], [null, 1560933180], [null, 1560933720], [null, 1560934260], [null, 1560934800], [null, 1560935340], [null, 1560935880], [null, 1560936420], [null, 1560936960], [null, 1560937500], [null, 1560938040], [null, 1560938580], [null, 1560939120], [null, 1560939660], [null, 1560940200], [null, 1560940740], [null, 1560941280], [null, 1560941820], [null, 1560942360], [null, 1560942900], [null, 1560943440], [null, 1560943980], [null, 1560944520], [null, 1560945060], [null, 1560945600], [null, 1560946140], [null, 1560946680]]}]

In the first response, we see 8 increment events for the metric, but in the second response there are only 6. The same glitch happens if I change the time window from 24hr to 48hr, of course, so I can't build correct graphs on this metric. Am I using Graphite wrong? Is the metric stored using the wrong type?

Strangely, in this example I see the data rolled up correctly:
https://graphite.wikimedia.org/render?target=MediaWiki.FileImporter.error.byRecoverable.false.byType.duplicateFiles.sum&from=1560095200&until=1560981659&format=json&noNullPoints=1
->

1[{"target": "MediaWiki.FileImporter.error.byRecoverable.false.byType.duplicateFiles.sum", "datapoints": [[1.0, 1560924000], [1.0, 1560924900], [1.0, 1560926400], [1.0, 1560926700], [2.0, 1560927000], [1.0, 1560927300], [1.0, 1560932100]]}]

I have a punchline. The default consolidateBy function is to average data points in each time bucket. This is only correct behavior for a gauge, but not for a counter. AFAICT, graphite has no concept of metric type, so I'm pretty sure this is doing the wrong thing in... lots of graphs. In our case, I was able to get the expected totals by wrapping the sumSeries like so:
consolidateBy(sumSeries(MediaWiki.FileImporter.error.byRecoverable.*.byType.*.sum), 'sum')

Event Timeline

This entertaining blog post describes the exact glitch I ran into, plus some others which seem important. For example, historical rollups are done using a different function for each metric, according to an explicit configuration.

addshore@graphite1004:/srv/carbon/whisper$ whisper-info MediaWiki/FileImporter/error/byRecoverable/false/byType/duplicateFiles/sum.wsp
maxRetention: 157680000
xFilesFactor: 0.0
aggregationMethod: sum
fileSize: 331000

sum is being used, which is what should be used based on the metric name, and would also be correct for counters.

I have a punchline. The default consolidateBy function is to average data points in each time bucket. This is only correct behavior for a gauge, but not for a counter.

This is correct. But the queries you show in the description do not make use of consolidateBy ? however you do add maxDataPoints which I believe will cause consolidation {{cn}}.

Grafana will automatically send a maxdatapoints param with the request, thus you should always use the consolidateBy function if the avg consolidation is not wanted.

https://addshore.com/2018/09/grafana-graphite-and-maxdatapoints-confusion-for-totals/ is a vaguely related write up about this.

awight claimed this task.

Hi @Addshore, thanks for sharing your insights! Like you pointed out, the historical rollup function isn't applied to current metrics, where consolidateBy is the only way to get a correct graph in many cases. The Wikitech page was helpful, and links to an article that addressed the worst of my misgivings: http://dieter.plaetinck.be/post/25-graphite-grafana-statsd-gotchas/

Our documentation is correct, so I should have just dug deeper.