Page MenuHomePhabricator

"sum" aggregation broken in Graphite
Closed, ResolvedPublic

Description

When querying data less than 7 days old, data.count and data.sum are always perfectly equal (for increment metrics from statsd).

However when querying a time period further in the past (E.g. last month; August 1 - August 2) or when querying a larger period until now (e.g. last 14 days), then these properties diverge. It even claims different values for the same time period.

For example:

graphite: from=-7days&target=mw.js.deprecate.wgNamespaceNumber

graphite: from=-14days&target=mw.js.deprecate.wgNamespaceNumber

'

Note how the last 7 days of the "last 14 days" doesn't match the "last 7 days".

Event Timeline

Krinkle created this task.Sep 2 2015, 12:03 PM
Krinkle raised the priority of this task from to High.
Krinkle updated the task description. (Show Details)
Krinkle added subscribers: Krinkle, ori, Gilles.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 2 2015, 12:03 PM

likely related to how we aggregate sum vs count in graphite configuration across retentions (1m:7d,5m:14d,15m:30d,1h:1y)

# Aggregation methods for whisper files.
storage_aggregation => {
    'min'   => {
        pattern           => '\.min$',
        xFilesFactor      => 0.1,
        aggregationMethod => 'min',
    },
    'max'   => {
        pattern           => '\.max$',
        xFilesFactor      => 0.1,
        aggregationMethod => 'max',
    },
    'sum'   => {
        pattern           => '\.count$',
        xFilesFactor      => 0,
        aggregationMethod => 'sum',
    },
    # statsite extended counters
    'lower' => {
        pattern           => '\.lower$',
        xFilesFactor      => 0.1,
        aggregationMethod => 'min',
    },
    'upper' => {
        pattern           => '\.upper$',
        xFilesFactor      => 0.1,
        aggregationMethod => 'max',
    },
},

and statsite calculates these stats for both counters and timers as

STREAM("%s%s.count|%lld|%lld\n", prefix, name, counter_count(value));
STREAM("%s%s.sum|%f|%lld\n", prefix, name, counter_sum(value));

STREAM("%s%s.count|%lld|%lld\n", prefix, name, timer_count(&t->tm));
STREAM("%s%s.sum|%f|%lld\n", prefix, name, timer_sum(&t->tm));

in any case, count is how many samples have been received and sum the sum of the sample values received, both per flush interval (60s). so sum gets aggregated as average over retention archives (i.e. the first time after 7d) while count gets summed over the same periods

this is what I think is going on, does it match expectations? I'm not sure

chasemp added a subscriber: chasemp.Sep 2 2015, 6:07 PM

I do think summing sum's is going to be more logically consistent for aggregation periods

Yes. count and sum both should be aggregated using sum(). They have different meanings (number of data points vs actual data) but in both cases aggregation logic should use addition.

The key of the count aggregation in the storage_aggregation object refers to the sum property. But it was removed in f4be20a148 – which also removed lower and upper. The latter two were then re-added in fbc91d3284, but sum was never re-introduced.

Change 235612 had a related patch set uploaded (by Krinkle):
graphite: Add aggregation logic for ".sum" properties

https://gerrit.wikimedia.org/r/235612

Krinkle claimed this task.Sep 2 2015, 8:22 PM
Krinkle set Security to None.
Krinkle moved this task from Inbox to Doing on the Performance-Team board.

Change 235612 merged by Ori.livneh:
graphite: Add aggregation logic for ".sum" properties

https://gerrit.wikimedia.org/r/235612

for existing archives we'll need to retroactively change archive aggregation, I can take care of that starting next week

Is there something we need to do to deploy https://gerrit.wikimedia.org/r/235612? I don't see any difference in data aggregated in the past 24 hours.

Is there something we need to do to deploy https://gerrit.wikimedia.org/r/235612? I don't see any difference in data aggregated in the past 24 hours.

the aggregation periods are set in whisper files and the graphite configuration applies only to newly created whisper files. I can retroactively change existing archives but data older than 7d that's already been aggregated won't change ttbomk, newly aggregated data will however so will see changes as time goes on.

to clarify the above, whisper files (one per metric) are created as needed when new metrics (not data) shows up. thus the settings in https://gerrit.wikimedia.org/r/235612 ATM will apply to all new metrics that flow in, existing metrics like mw.js.deprecate.wgNamespaceNumber.sum have still the old settings because the whisper file has been created already.
we should retroactively change the aggregation for existing whisper files and see the effects when data is aggregated again as time goes by. existing aggregated data will be left in place, it will get eventually replaced by data aggregated using sum. We can try changing only mw.js.deprecate.wgNamespaceNumber.sum and see how that looks like

Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 4 2015, 4:36 PM

@fgiunchedi Ah explains the behaviour I've been seeing. The settings are hardcoded when new whisper files are created. Whenever fresh data is aggregated, it uses the settings present in that file.

Yeah, let's change that one as example to verify the patch (which is already merged!) actually works as expected. If it works fine, perhaps we can apply this more broadly? We're driving blind in some areas because of this. The sooner we can start aggregating data correctly the better.

@fgiunchedi Ah explains the behaviour I've been seeing. The settings are hardcoded when new whisper files are created. Whenever fresh data is aggregated, it uses the settings present in that file.
Yeah, let's change that one as example to verify the patch (which is already merged!) actually works as expected. If it works fine, perhaps we can apply this more broadly? We're driving blind in some areas because of this. The sooner we can start aggregating data correctly the better.

sounds good!

_graphite@graphite1001:/tmp$ whisper-set-aggregation-method /var/lib/carbon/whisper/mw/js/deprecate/wgNamespaceNumber/sum.wsp sum
Updated aggregation method: /var/lib/carbon/whisper/mw/js/deprecate/wgNamespaceNumber/sum.wsp (average -> sum)
ori added a comment.Sep 4 2015, 6:01 PM

This should do it:

#!/usr/bin/env bash
shopt -s globstar
for wsp in /var/lib/carbon/whisper/**/sum.wsp; do
  if [[ $(/usr/bin/whisper-info "$wsp" aggregationMethod) != "sum" ]]; then
    /usr/bin/whisper-set-aggregation-method "$wsp" sum
  fi
done
fgiunchedi added a comment.EditedSep 4 2015, 6:04 PM

@ori that would, before running it we should check whisper-set-aggregation-method locks the whisper file. if it doesn't we risk clashing with carbon-cache

ori added a comment.Sep 4 2015, 6:08 PM

@ori that would, before running it we should check whisper-set-aggregation-method locks the whisper file. if it doesn't we risk clashing with carbon-cache

The code for locking the file is there:

https://github.com/graphite-project/whisper/blob/master/whisper.py#L283-L284

But it is conditional on the value of LOCK, which is unconditionally set to False:

https://github.com/graphite-project/whisper/blob/master/whisper.py#L84

indeed, IIRC carbon sets whisper.LOCK = True explicitly if gets configured with locking writes, which we do

ori added a comment.Sep 4 2015, 6:20 PM

Hacked this together, seems to work (but did not run it in prod):

update-aggr.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os

import whisper

assert whisper.CAN_LOCK
whisper.LOCK = True

for root, dirs, files in os.walk('/var/lib/carbon/whisper'):
    for basename in files:
        if basename.endswith('sum.wsp'):
            filename = os.path.join(root, basename)
            try:
                info = whisper.info(filename)
                if info['aggregationMethod'] != 'sum':
                    old_method = whisper.setAggregationMethod(filename, 'sum')
                    print('%s: %s -> sum' % (filename, old_method))
            except Exception as e:
                    print('Fail to update %s: %s' % filename, e)

I'll take this one to run the script from @ori

since the above seems to be working I've ran the script to all mw.js hierarchy, will apply everywhere tomorrow if no issues arise

fgiunchedi closed this task as Resolved.Sep 8 2015, 12:51 PM

existing files have been changed to the sum aggregation, (tentatively) resolving