Page MenuHomePhabricator

Set up grafana dashboard for ORES extension
Closed, ResolvedPublic

Description

We have https://grafana.wikimedia.org/dashboard/db/ores-extension which contains some stats about jobs.

Event Timeline

Few tips:

  • Use the "Dashboards" link feature instead of hardcoding hyperlinks. That way users will stay on the current Grafana domain (e.g. public or admin), and it'll automatically add additional ones when you tag them with ores. You can also enable the dropdown option if it becomes too crowded. No visible change.
  • Use .rate instead of .count. count is the number of statsd messages received, not the total of the values (typically 1, but sometimes 0 or a number higher than 1 due to aggregation proxies). This change did actually change the shape of the graph:

Before:

Screen Shot 2016-07-28 at 00.24.25.png (790×2 px, 226 KB)

After:

Screen Shot 2016-07-28 at 00.24.32.png (796×2 px, 230 KB)

It seems the spikes were ducked earlier. Presumably because ORES will send stats messages with an increment higher than 1.

Not shown above is that previously the counts were off when changing the time range in Grafana to something wider than 7 days because Graphite aggregation there is no longer per minute, but larger, at which point Grafana will render dots further apart, and count will keep representing the total number of messages (much higher). Whereas rate will keep representing the aggregate rate per second.

See https://wikitech.wikimedia.org/wiki/Graphite#Extended_properties

Hey @Krinkle,
Thanks for the tips. One question. Why not using sum instead of rate and scaling it up?

Hey @Krinkle,
Thanks for the tips. One question. Why not using sum instead of rate and scaling it up?

Because the fact that sum is the total of a minute is not stable. If you look at the last 5 days it'll be per minute. But if you look at a larger range (e.g. 30 days) or if you look at the same range further in the past (e.g. a range of 5 days last month) - it'll be per 15 minutes (or per hour, or whatever) with values that will likely confuse you.

Example from current configuration of https://grafana.wikimedia.org/dashboard/db/ores. It appears that the rate per minute was 500 one week, and 2000 the week before:

Screen Shot 2016-07-28 at 21.19.09.png (528×1 px, 134 KB)

Screen Shot 2016-07-28 at 21.20.24.png (232×1 px, 43 KB)

Screen Shot 2016-07-28 at 21.19.13.png (530×1 px, 129 KB)

In this last graph, if you were to hover a few of the points and look at the timestamp, you'll see that it snaps to 15 minute intervals, because that's what Graphite is configured to do to save space in historical data (it aggregates to larger intervals the further back you go).

sum: The total sum of all values in this interval. This is not subject to averaging in later stages of aggregation. As such, recent data will reflect the rate per minute, but queries further back report higher numbers over longer intervals. Use this in conjunction with integral() to produce a running total. When used in a graph directly, it will give something like a rate per minute (or per 5min, or per hour, depending on how far back your query goes).

Use rate instead if you need a stable rate per interval (e.g. always per second or always per minute). This will remain accurate beyond the last 7 days.

Okay, Thank you! it was super helpful.