Page MenuHomePhabricator

Urgent: Statsite changes semantics of timer rate metrics, need metric rename
Closed, ResolvedPublic

Description

Since the move to statsite, rate metrics derived from timers all show values that seem to be 1000 times the actual rate / s. For general sanity it would be nice if this was fixed in the statsite configuration, if possible. Doing so will will let us compare rates with historical data & avoid the need to scale rates manually.

Event Timeline

GWicke assigned this task to fgiunchedi.
GWicke raised the priority of this task from to High.
GWicke updated the task description. (Show Details)
GWicke added a project: Graphite.
GWicke added a subscriber: GWicke.
GWicke renamed this task from Sitestat timer rates not stored as req/s, multiplied by 1000 to Statsite timer rates not stored as req/s, multiplied by 1000.Apr 9 2015, 6:19 PM
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)

It turns out that statsite [changes the semantics of the .rate metric to be about the value and not the number of samples per time unit](https://github.com/armon/statsite/blob/4ebd58f4e334515dcf52d0deea3d0350d749faa9/src/conn_handler.c#L143-L144):

STREAM("%s%s.rate|%f|%lld\n", prefix, name, timer_sum(&t->tm) / GLOBAL_CONFIG->flush_interval);
STREAM("%s%s.sample_rate|%f|%lld\n", prefix, name, (double)timer_count(&t->tm) / GLOBAL_CONFIG->flush_interval);

The factor of 1000 turns out to be roughly the average value of each timing in the example I was looking at.

This means that our existing rate metrics are now being broken. The historical data will become worthless unless we fix this quickly. We could patch statsite, but I think the better fix is to

  • delete the new .sample_rate metrics, and
  • rename current .rate metrics to .sample_rate.
GWicke renamed this task from Statsite timer rates not stored as req/s, multiplied by 1000 to Urgent: Statsite timer changes semantics of timer rate metrics, need metric rename.Apr 9 2015, 8:22 PM
GWicke renamed this task from Urgent: Statsite timer changes semantics of timer rate metrics, need metric rename to Urgent: Statsite changes semantics of timer rate metrics, need metric rename.

https://phabricator.wikimedia.org/T90111#1153830 has the rename script for the original migration.

Maybe something like this could work?

# test:
find /path/to/metrics -type f -name 'rate.wsp' | while read i; do echo "$i" `dirname "$i"`/sample_rate.wsp;done

# actually move:
find /path/to/metrics -type f -name 'rate.wsp' | while read i; do mv "$i" `dirname "$i"`/sample_rate.wsp;done

Discussed in IRC; ran this at approximately UTC 22:32:

sudo stop carbon/relay
sudo stop carbon/cache NAME=a
sudo stop carbon/cache NAME=b
sudo stop carbon/cache NAME=c
sudo stop carbon/cache NAME=d
sudo stop carbon/cache NAME=e
sudo stop carbon/cache NAME=f
sudo stop carbon/cache NAME=g
sudo stop carbon/cache NAME=h
find /var/lib/carbon/whisper/{cassandra,restbase} -type f -name 'sample_rate.wsp' -printf "%h\n" | while read i; do sudo mv "$i/rate.wsp" "$i"/sample_rate.wsp ; done
sudo start carbon/cache NAME=a
sudo start carbon/cache NAME=b
sudo start carbon/cache NAME=c
sudo start carbon/cache NAME=d
sudo start carbon/cache NAME=e
sudo start carbon/cache NAME=f
sudo start carbon/cache NAME=g
sudo start carbon/cache NAME=h
sudo start carbon/relay

Not sure why, but it looks like the old data didn't make it across from rate.wsp. The new sample_rate metrics all look like they started from scratch. Good thing we didn't try this on all rates.

odd, that should have worked, also there's currently ~1700 metrics under restbase/ does that seem ok? we can restore from backups and rename too

@fgiunchedi, it might be worth trying to restore those rates from the backup (and moving them directly to sample_rate). It would bring back the historical data.

If confirmed working, we might want to do the same for all timers.

I think we're now back on track with sample_rate being used in place of rate ?

I think this can be closed, reopen if appropriate