Page MenuHomePhabricator

Investigate gnmic metric gaps and counters going to zero
Closed, ResolvedPublic

Description

Following up from https://phabricator.wikimedia.org/T372457#10408775

In Netops we make little use of the LibreNMS stats exported to Graphite, so it's fine if they go.

WMCS do use them, however, so we need to check on that side. I think everything that we need should be there, and if not we can add those path's to the gnmic collection. My only worry is that right now the gnmic stats have some problems, namely that we observe gaps in the graphs like this from time to time:

image.png (573×963 px, 59 KB)

I'm not 100% sure what the issue is here. I'm fairly certain it is not related to the way those graphs are set up, or anything like rollovers in counter max values etc. But I've not had time to dig into the issue fully. When we first rolled out the gnmic stats we had similar gaps, but much bigger and more frequently. Increasing scraper timeouts and worker threads solved it for the most part, but we still see it sometimes. That makes me suspect the issue is still some sort of occasional performance bottleneck. The netflow VMs don't seem to be overly taxed, however (CPU hits max on them scraping at certain points, but it's not constant so there should be cycles for it to do whatever it needs).

Thank you for the explanation, that makes sense to me. What is the dashboard and the underlying expression in the graph above?

We also see another type of discrepancy, which I'm not so sure about (perhaps is related to counter rollovers?). Here we do appear to have measurements but the counter goes to zero, even though it's pretty much impossible that was actually the case:

{F58025300 width=600}

The image doesn't show up for me, however what's the dashboard and expression I can take a look at ?

Event Timeline

What is the dashboard and the underlying expression in the graph above?

That one came from here I think:

https://grafana.wikimedia.org/goto/MDzHFTINR

The expression is designed to always reflect peak usage over time so it's like this:

max_over_time(irate(gnmi_interfaces_interface_state_counters_in_octets{instance=~"$device",interface_name="$interface"}[2m])[$__rate_interval:]) * 8

But even a simple one rate() or irate() you'll see it in (if the gap is big enough)

I looked at the panel for the second type of problem and spotted a problem with the prom query. With that fixed I don't see the "jumps to 0" problem. So the issue seems to be confined to the "gaps", which we've had since day one and have been much less severe as we've given more resources to gnmic.

@fgiunchedi yeah I'm pretty sure it's only gaps in the data we are seeing, for instance here:

https://grafana.wikimedia.org/goto/_GSV1TIHR

On the "stacked" graphs the gaps get rendered strangely and look like traffic dips, but when just showing a single interface it's clear there are no measures getting stored with zero increment of the counter, it's just we have gaps in the data.

There definitely seems to be a pattern to some of them also:

https://grafana.wikimedia.org/goto/7ryixoSHg

You can paper over it in Grafana with "connect null values" in a timeseries graph, but still best we get to the bottom of it.

Indeed the underlying data/samples are there as expected: I tested this theory by removing all functions and look at the raw data, which has indeed no gaps. I noticed the interval used for (i)rate calculation is 2m, which I believe is the culprit. In the sense that we're scraping data every 1m and the raw data points (i.e. two) might fall outside the window looked by irate(...[2m]) for a given interval, therefore returning no points for that interval. Switching the rate calculation from 2m to 5m widens the window to look for data and eliminates the gaps, note the values are unchanged since rate/irate always return per-second values. Let me know what you think! cc @CDanis

Indeed the underlying data/samples are there as expected: I tested this theory by removing all functions and look at the raw data, which has indeed no gaps.

Ok cool thanks for digging into this!

I noticed the interval used for (i)rate calculation is 2m, which I believe is the culprit. In the sense that we're scraping data every 1m and the raw data points (i.e. two) might fall outside the window looked by irate(...[2m]) for a given interval,

In my mind if we record a value every 1 minute, then in any 2-minute sliding window we should have two samples recorded right?

But we can deal with that if that is the cause. The goal of the "irate" is that we want as much granularity here as possible. Ideally we want to base the rate on the increase between two successive samples. A big gain for us in having the gnmi stats over LibreNMS is the increased resolution, smoothing the peaks over time has been a big issue for us there.

So - to try and retain as much of that as possible - I think maybe the best way forward it to enable the "connect null values" on the individual graphs (with threshold set to 3-4 mins). For the stacked graphs we can increase the window on the irate - I'll start with 3 minutes and see, we should definitely always have two samples in 3 minutes.

But we can deal with that if that is the cause. The goal of the "irate" is that we want as much granularity here as possible. Ideally we want to base the rate on the increase between two successive samples. A big gain for us in having the gnmi stats over LibreNMS is the increased resolution, smoothing the peaks over time has been a big issue for us there.

It's fine to make the time window longer with irate() -- it will always pick the two most-recent samples available within that range window.
(Note that this can cause different aliasing issues when running historical queries over long timespans!)

It's fine to make the time window longer with irate() -- it will always pick the two most-recent samples available within that range window.

I'd responded to this but not groked it properly. If we extend the time we'll always get the two most recent, that's fine. I was thinking that we might miss the diff between samples 1 and 2 if we compare 2 and 3, but in such a scenario the previously graphed pixel will be comparing 1 and 2 (and ignoring 0). So yep that does make sense as being ok.

(Note that this can cause different aliasing issues when running historical queries over long timespans!)

Indeed we spoke about that before and thankfully I think the current way those graphs are working is ok for us.

FWIW I changed those dashboards as per my previous comment so let's see how it goes over the next while, hopefully we can now put it to bed. Thanks all!

Thank you all for looking into this -- let's indeed see how 3m (or larger) goes and if that is satisfactory!

In my mind if we record a value every 1 minute, then in any 2-minute sliding window we should have two samples recorded right?

Yes and that's almost always the case, my understanding though is that the samples may not always be exactly aligned to two-minute boundaries (hence the gaps we were seeing)

Yes and that's almost always the case, my understanding though is that the samples may not always be exactly aligned to two-minute boundaries (hence the gaps we were seeing)

Yep that makes sense, esp. here when you consider the pipeline. So far looks good with those changes. Lets keep this task open for now but if we don't notice any discrepancies over the coming days we can close it. Awesome to get to the bottom of it!

cmooney claimed this task.

I think we can close this one, all the graphs have been solid since changing the parameters on the query. Thanks for the help @fgiunchedi !