Page MenuHomePhabricator

enable statsd extended counters
Closed, ResolvedPublic

Description

The counter at restbase.sys_parsoid_generateAndSave.unchanged_rev_render used to have .rate and .count children, which are now gone; instead, that name has a rate that seems to be multiplied by 1000, which can be fixed by using the graphite 'ScaleToSeconds' function.

Is this expected behavior?

Event Timeline

GWicke assigned this task to fgiunchedi.
GWicke raised the priority of this task from to High.
GWicke updated the task description. (Show Details)
GWicke removed a project: Patch-For-Review.
GWicke set Security to None.
GWicke removed subscribers: mark, Aklapper, chasemp and 3 others.

it is, counters by default in statsites are defined as the sum of the counter values (unless extended counters)

STREAM("%s%s|%f|%lld\n", prefix, name, counter_sum(value));

I agree though that having more metrics on counters would be useful, statsite doesn't let you select what metrics to export for extended counters but we could filter those out when streaming to graphite

so it is a rate in the sense that the counter (a sum of all values received for that counter during the flush period) is reset at each flush to 0, hence a 1/flushperiod rate

another factor is the aggregation that whisper does underneath, which does averages by default

Yeah, the regular counter semantics are closer to what statsite provides as counter_count(value) if extended counters are enabled. It would also provide a properly scaled rate.

Should we consider enabling extended counters?

Should we do a trial run on labs? :)

Change 204695 had a related patch set uploaded (by Yuvipanda):
labmon: Enable extended statsite counters for labs

https://gerrit.wikimedia.org/r/204695

Change 204695 merged by Filippo Giunchedi:
labmon: Enable extended statsite counters for labs

https://gerrit.wikimedia.org/r/204695

extended counters have been enabled on labmon1001, production will be harder because for each counter that means a 7x increase

current counters users are:

  • MediaWiki ~40k
  • jobrunner ~1.5k
  • eventlogging ~1k
  • other minor users like ocg/restbase/webperf

so let's say 45k counters give or take, that's another ~300G which we don't have ATM.
Most of mediawiki counters are job queue related though, see also https://phabricator.wikimedia.org/T95913

so let's say 45k counters give or take, that's another ~300G which we don't have ATM.

that's not strictly true, but we'd be left with just 70G free, I'll take a closer look at mediawiki counters

Also see T85451 about scaling graphite storage.

mw counters will be reduced a lot by T95913 (ETA 2-3d) so we'll be able to enable extended counters

@yuvipanda tried extended counters in labs and are working as expected AFAICT (?)

enabling extended counters will require some renaming, namely moving current counters to its sum extended counterpart.

Change 206781 had a related patch set uploaded (by Filippo Giunchedi):
statsite: enable extended counters by default

https://gerrit.wikimedia.org/r/206781

I'll be enabling extended counters tomorrow with https://gerrit.wikimedia.org/r/206781 and will be renaming the metrics on the graphite/puppet side, this should end the renamings for a while!

Change 206797 had a related patch set uploaded (by Filippo Giunchedi):
Revert "eventlogging: adjust counters thresholds"

https://gerrit.wikimedia.org/r/206797

correction, holding this while statsite restart is improved in https://gerrit.wikimedia.org/r/#/c/206819/

Change 206781 merged by Filippo Giunchedi:
statsite: enable extended counters by default

https://gerrit.wikimedia.org/r/206781

Change 206797 merged by Filippo Giunchedi:
Revert "eventlogging: adjust counters thresholds"

https://gerrit.wikimedia.org/r/206797

extended counters have been enabled in prod and the additional metrics are being created now, pending full creation of all new metrics

fgiunchedi renamed this task from Counters now only provide rates (multiplied by 1000?) to enable statsd extended counters.May 1 2015, 10:44 AM