Page MenuHomePhabricator

Productionize gnmic network telemetry pipeline
Open, MediumPublic

Description

Network Telemetry

We have some basic interface metrics now being streamed by compatible devices at most of our POPs over gNMI RPC, which is being done by gnmic running on our Netflow VMs, and exposing a Prometheus endpoint to export the data. The work was mostly done under T326322, with a high-level overview on Wikitech here:

https://wikitech.wikimedia.org/wiki/Network_telemetry

The tool and workflow is working reasonably well, and giving us more granular network usage metrics and per-queue stats we do not have in LibreNMS. So it can be considered a successful proof-of-concept, on which we want to build and begin to make this approach a fully integrated part of our network monitoring and alerting.

Next Steps

Over the next while we should work out how to add collection of the following:

  • LAG / AE interface stats
  • Sub-interface stats
  • Firewall filter stats
  • BGP groups / neighbor states
  • OSPF interface states
  • RPKI sessions
  • More tbc

For each we need to identify the correct telemetry paths to use, in fact some of them are available from multiple different paths so it's worth assessing what the differences between them are and pros/cons of each. We also need to consider how we configure gnmic processors to manipulate and group how the data is exposed as prometheus metrics.

In terms of the overall setup we should probably also assess:

  • Where to run gnmic from long term (i.e. Netflow VMs or dedicated hosts)
  • What tuning might be a good idea for our workflow, for example:
    • Timeouts
    • Caching
    • Subscription intervals
  • Improvements to our current TLS/PKI for network devices
    • More recent JunOS might help here

Finally when we have things on a stable footing we can consider adding some alertmanager alerting based on the prometheus stats, to fill gaps not covered by LibreNMS or indeed replace existing ones. Creating this task to track progress.

Event Timeline

cmooney triaged this task as Medium priority.Jul 5 2024, 5:11 PM
cmooney created this task.
cmooney renamed this task from Poructionize gnmic network telemetry pipeline to Productionize gnmic network telemetry pipeline.Jul 5 2024, 5:12 PM

Icinga downtime and Alertmanager silence (ID=23e26d8b-bf98-4528-9f4f-f796eb123261) set by cmooney@cumin1002 for 0:15:00 on 1 host(s) and their services with reason: reboot netflow2003

netflow2003.codfw.wmnet

VM netflow2003.codfw.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

So, we hit a bit of a speed-bump in codfw with the gnmic stats once the new switches were made live there. We now have 36 active gnmic subscriptions from netflow2003 (and 3 more configured which fail, asw/fasw). Unfortunately once puppet updated the gnmic config in codfw with the new switches we ceased to get any stats in prometheus 😦

Problems

Looking at the VM I noticed it was using swap heavily, and the CPU was regularly pegged at 100%. So I updated the resources on the VM, giving it 4GB of RAM and 4 vCPUs. This definitely seems like the correct decision, RAM usage is now over 2GB and there is no swapping. But despite that the stats were still not working.

Digging into what was happening I could see from a prometheus host that getting the stats over HTTP was taking longer than our configured timeout:

cmooney@prometheus2005:~$ time curl http://netflow2003.codfw.wmnet:9804/metrics | tee /tmp/gnmic_data1  

<--- output cut --->

real	0m34.581s
user	0m0.040s
sys	0m0.139s

In some of the tests it was taking over 90 seconds! As can be seen there the actual time the prometheus host is spending working is very low, but the request takes a long time to complete. In real-time you can see that after the GET request is made nothing happens for a long time, and then the stats arrive suddenly.

Troubleshooting

Reading the gnmic docs it seems that our enabling "caching" for the prometheus output means that the raw GNMI stats are only processed into prometheus metrics after gnmic receives a HTTP request for them. This is the cause of the big delay between the prometheus server making the GET request and the stats being served. Disabling the caching function reduced the curl time to well under a second, but unfortunately it prevents some of the "processors" / data mangling we have configured from working. Specifically it means most stats don't have the associated interface labels added which makes them pretty much useless. So we need caching enabled.

I did find that increasing the "expiration" value for the cache helped a lot. This completely resolved the cases where gathering the stats took 90 seconds or more. I suspect what is happening - without it - is that some stats "expire" from the local gnmic cache, and when the prometheus server requests /metrics gnmic has to wait for fresh data from some of the routers before respoinding. Increasing the 'expiration' means it still has the last results and can instead serve them. Timestamps are set to what the router reported so while this may mean less "fresh" data in prometheus, it shouldn't be inaccurate.

Given the fact the event processors are the largest contributor to the delayed response to the HTTP request I also tried removing the processors I added to drop stats for 'disabled' interfaces, which reduced the time further. This is no big deal as actual counters aren't returned for those interfaces, we just get some static data like their MTU. Somewhat wasteful to store, but doesn't seem worth it to drop if running that is impacting performance. Prometheus should store it sensibly afaik.

Actions

So overall to get things working I think we should:

  • Increase the timeouts to 50 seconds
    • Gives the maximum time for gnmic to mangle the data on receipt of a HTTP request for /metrics
    • Although this is a long time, 99.9% of it the prometheus server is idle waiting on the response
    • The actual time spent processing by the prometheus server is still a fraction of a second
  • Set the 'expiration' for the gnmic promehteus cache to 120s
    • Prevents gnmic waiting for fresh data from routers before it responds to prometheus
  • Remove the processors to delete stats for disabled interfaces
    • As mentioned it's not a huge saving on storage, no counters are returned for these
    • It does seem like being cautious with the number of processors is wise
  • Set the "num-workers" for the gnmic prometheus output to 2
    • This adds some parallelism to the processing of stats that takes place when they are requested
    • The effect of this is not dramatic but it seems sensible to do
    • Especially as we scrape from two prometheus hosts

Testing in anger with these settings, having disabled puppet on netflow2003 and modified the conf file manually, we see a big improvement:

cmooney@prometheus2005:~$ time curl http://netflow2003.codfw.wmnet:9804/metrics | tee /tmp/gnmic_data1  

<--- output cut --->

real	0m14.955s
user	0m0.020s
sys	0m0.162s

All of this does suggest we should probably look at running distributed collectors as we move to productionize this, potentially on Kubernetes. To be discussed, gnmic does have extensive support for running in a distributed way anyway.

For now I'll prep a patch and we can discuss.

Change #1056136 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Tweak gnmic parameters to improve performance

https://gerrit.wikimedia.org/r/1056136

Change #1056136 merged by Cathal Mooney:

[operations/puppet@production] Tweak gnmic parameters to improve performance

https://gerrit.wikimedia.org/r/1056136

VM netflow1002.eqiad.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

VM netflow3003.esams.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

VM netflow4002.ulsfo.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

VM netflow5002.eqsin.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

VM netflow6001.drmrs.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

In Eqiad our netflow VM was also running a little hot, and swapping to disk.

I've now increased the resources for it and also the other netflow VMs in the estate. Setup as of now is:

eqiad, codfw: 4 vCPU, 4GB RAM
POPs: 2 vCPU, 3GB RAM

I only increased the RAM at POPs, as the VMs there weren't showing signs of any constraint (although RAM usage fairly high). Also left CPUs at 2 for those as the number of devices for gnmic won't ever be very high so not as much processing for them.

VM netflow7001.magru.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

Change #1075548 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Prometheus gNMI check: use TCP check instead

https://gerrit.wikimedia.org/r/1075548

Change #1075548 merged by Ayounsi:

[operations/puppet@production] Prometheus gNMI check: use TCP check instead

https://gerrit.wikimedia.org/r/1075548

Change #1100488 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Increase the number of gnmic worker and writer threads

https://gerrit.wikimedia.org/r/1100488

Change #1100488 merged by Cathal Mooney:

[operations/puppet@production] Increase the number of gnmic worker and writer threads

https://gerrit.wikimedia.org/r/1100488

Just a few notes on this. Firstly we are now getting the AE/LAG interface stats for our core routers since they were all upgraded to a more recent JunOS.

Seconly we have minor imperfections with the collection/graphing right now, I documented them probably on the wrong task here (cc @fgiunchedi ):

https://phabricator.wikimedia.org/T372457#10408775

No worries at all @cmooney, I've opened T382396: Investigate gnmic metric gaps and counters going to zero to investigate/followup on the two issues you mentioned