Page MenuHomePhabricator

Investigate odd aggregation of MediaWiki.edit.failures.conflict.sum metric in graphite
Closed, ResolvedPublic

Description

Looking at the data at:

And the timestamp "1531381500" the retention / aggregation of this metric appear to be rather odd..

While looking at the 7 day data this data point has a value of "2", and looking at the 30 day data the value has been aggregated to "1.2666666666666666".

I'm not sure this make sense as this data should be aggregated using the sum function so a) should result in a larger number and b) should not result in a decimal?
Although I could be miss understanding what graphite is actually trying to do here?

This was noticed while looking at the top graph of the following dashboard with the following date ranges:

The number of edit conflicts seems to decrease when switching from the 7 day dashboard to the 30 day dashboard.

CCed @Krinkle as he last modified the dashboard
CCed @Lea_WMDE & @Aleksey_WMDE as they noticed the issue

Event Timeline

I believe the discrepancy comes from the fact that the whisper file for that metric uses average as aggregation method, not sum. Likely because it is a file created a long time ago, before we fixed the issue in modules/role/manifests/graphite/base.pp to have .sum metric files aggregate with sum.

I believe the discrepancy comes from the fact that the whisper file for that metric uses average as aggregation method, not sum. Likely because it is a file created a long time ago, before we fixed the issue in modules/role/manifests/graphite/base.pp to have .sum metric files aggregate with sum.

Got it, I figured that might be the case but had no way of checking.
Is there a way to alter this for future aggregations of this method?
Is there a way to get a dump of the configs the whisper files use for all? metrics so that issues like this can be seen publicly?
I wonder how many would need fixing.

See Graphite § Troubleshooting on Wikitech for how to check and (if needed) change the xff aggregation logic of an existing metric's whisper file. This would normally be done by an SRE/root on request.

I believe the discrepancy comes from the fact that the whisper file for that metric uses average as aggregation method, not sum. Likely because it is a file created a long time ago, before we fixed the issue in modules/role/manifests/graphite/base.pp to have .sum metric files aggregate with sum.

Got it, I figured that might be the case but had no way of checking.
Is there a way to alter this for future aggregations of this method?

Yes, in the sense that newly created files/metrics will be fixed already

Is there a way to get a dump of the configs the whisper files use for all? metrics so that issues like this can be seen publicly?

good question, I don't think the graphite api exposes it so no not out of the box

I wonder how many would need fixing.

I don't know the exact number, but likely any metric created before https://gerrit.wikimedia.org/r/c/operations/puppet/+/235612. Though I can fix mediawiki.edit ones if you want?

I don't know the exact number, but likely any metric created before https://gerrit.wikimedia.org/r/c/operations/puppet/+/235612. Though I can fix mediawiki.edit ones if you want?

That would be great!

Mentioned in SAL (#wikimedia-operations) [2018-07-27T09:00:12Z] <godog> adjust aggregation to 'sum' for MediaWiki.edit sum metrics - T199968

Addshore assigned this task to fgiunchedi.

I'm going to mark this as resolved as the investigation and fix for these specific metrics has been fixed.

Addshore moved this task from Incoming to Done on the WMDE-Analytics-Engineering board.
Addshore moved this task from Inbox to Carbon / Whisper on the Graphite board.