Page MenuHomePhabricator

Gaps in gNMI network statistics in eqiad
Closed, ResolvedPublic

Description

In recent weeks we have started to see gaps in the Network statistics in Prometheus we are gathering using gNMI. This has occurred after the recent changes in configuration - and additional BGP subscriptions - however the strange thing is we didn't have any problems in the first few days following those changes.

It may be a performance issue however it's worth noting that we do not have the problem in codfw, where we have more subscriptions and metrics (because top-of-rack switches there are independent L3 devices that support gnmi, unlike eqiad).

The gaps are not consistent between different measurements or devices. In other words we might have stats for a particular interface on a device at a given time, but be missing the data for another interface on the same device at that moment. The gaps are frequent and occurring on all devices however, effectively making them unusable right now.

image.png (535×965 px, 59 KB)

The gaps can be seen in the raw stats returned querying Thanos - see P73487, so it's not simply a matter of the promql queries we are using in Grafana.

It may just be a performance thing, in which case perhaps moving to K8s is a way forward, however it's a mystery why codfw is ok so I'd rather get to the bottom of it before assuming that. First-step will be to run gnmic in debug mode on netflow1002.eqiad.wmnet and see what that shows.

Event Timeline

cmooney triaged this task as Low priority.

I ran gnmic in debug mode on netflow1002 but nothing is jumping out at me as a problem, at least on a basic review of the logs.

One thing I do notice, and I suspected this before when we had gaps but wanted to confirm, is that we have different missed samples on each Prometheus instance in eqiad. I did a manual curl of one sample metric from both prometheus1005 and prometheus1006, and using a quick script we can see only slightly over 50% of the samples are in both (P73492).

This somewhat explains what can be observed in the Grafana graphs when using the Prometheus data source directly. The gaps in the graph can change suddenly when the dashboard refreshes. This is dependant on which Prometheus back-end our load-balancers send the Grafana request to, if it lands on a different host than the previous query we are missing different samples and that gaps move.

I'll see if I can determine why this might be and perhaps even ask on the gnmic github if they have any ideas what might cause it.

Also fwiw I grabbed the same stats for 24 hours from both prometheus servers, and compared the total stats.

In total there are 115 gaps in the data, 68 of these are just a single sample missing (120 seconds between the ones we have). Another 27 we are missing two samples, 11 missing 3 and 18 missing 4. The rest are outliers caused when I restarted the gnmic service in testing (P73493).

The general learning there is that between both prometheus servers we have *most* of the data, albeit with some gaps. It does explain why the Thanos-based graphs are more complete, though looking the gaps are still fairly pronounced on the below comparison graphs:

https://grafana.wikimedia.org/goto/6yLToBcHR

Change #1121590 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] gNMIc: Increase prometheus worker threads and cache time

https://gerrit.wikimedia.org/r/1121590

I increased the "cache timeout" for stats received from routers in eqiad, and upped the number of threads for the prometheus output from 4 to 8.

Since the change all the graphs are looking good :)

Also if I compare the stats from a manual curl pull, this is what I got yesterday for 24h of stats, note the discrepancy between the two prometheus hosts:

1037 samples in prom1005
667 samples in prom1006
1197 total samples in the Thanos data
1239 unique samples across the two prometheus servers

Now, doing so for the past 12 hours (half the time as above):

1040 samples in prom1005
999 samples in prom1006
1058 total samples in the Thanos data
1086 unique samples across the two prometheus servers

So not an exact match but much closer.

Change #1121590 merged by Cathal Mooney:

[operations/puppet@production] gNMIc: Increase prometheus worker threads and cache time

https://gerrit.wikimedia.org/r/1121590

Gonna close this one at this point. All has been ok in eqiad and codfw since the increase in thread count last week - gaps are no longer present.

image.png (602×1 px, 130 KB)