Page MenuHomePhabricator

Productionize gnmic network telemetry pipeline
Closed, ResolvedPublic

Description

Network Telemetry

We have some basic interface metrics now being streamed by compatible devices at most of our POPs over gNMI RPC, which is being done by gnmic running on our Netflow VMs, and exposing a Prometheus endpoint to export the data. The work was mostly done under T326322, with a high-level overview on Wikitech here:

https://wikitech.wikimedia.org/wiki/Network_telemetry

The tool and workflow is working reasonably well, and giving us more granular network usage metrics and per-queue stats we do not have in LibreNMS. So it can be considered a successful proof-of-concept, on which we want to build and begin to make this approach a fully integrated part of our network monitoring and alerting.

Next Steps

Over the next while we should work out how to add collection of the following:

  • LAG / AE interface stats
  • Sub-interface stats
  • Firewall filter stats
  • BGP groups / neighbor states
  • OSPF interface states
  • RPKI sessions
  • More tbc

For each we need to identify the correct telemetry paths to use, in fact some of them are available from multiple different paths so it's worth assessing what the differences between them are and pros/cons of each. We also need to consider how we configure gnmic processors to manipulate and group how the data is exposed as prometheus metrics.

In terms of the overall setup we should probably also assess:

  • Where to run gnmic from long term (i.e. Netflow VMs or dedicated hosts)
  • What tuning might be a good idea for our workflow, for example:
    • Timeouts
    • Caching
    • Subscription intervals
  • Improvements to our current TLS/PKI for network devices
    • More recent JunOS might help here

Finally when we have things on a stable footing we can consider adding some alertmanager alerting based on the prometheus stats, to fill gaps not covered by LibreNMS or indeed replace existing ones. Creating this task to track progress.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
cmooney triaged this task as Medium priority.Jul 5 2024, 5:11 PM
cmooney renamed this task from Poructionize gnmic network telemetry pipeline to Productionize gnmic network telemetry pipeline.Jul 5 2024, 5:12 PM

Icinga downtime and Alertmanager silence (ID=23e26d8b-bf98-4528-9f4f-f796eb123261) set by cmooney@cumin1002 for 0:15:00 on 1 host(s) and their services with reason: reboot netflow2003

netflow2003.codfw.wmnet

VM netflow2003.codfw.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

So, we hit a bit of a speed-bump in codfw with the gnmic stats once the new switches were made live there. We now have 36 active gnmic subscriptions from netflow2003 (and 3 more configured which fail, asw/fasw). Unfortunately once puppet updated the gnmic config in codfw with the new switches we ceased to get any stats in prometheus 😦

Problems

Looking at the VM I noticed it was using swap heavily, and the CPU was regularly pegged at 100%. So I updated the resources on the VM, giving it 4GB of RAM and 4 vCPUs. This definitely seems like the correct decision, RAM usage is now over 2GB and there is no swapping. But despite that the stats were still not working.

Digging into what was happening I could see from a prometheus host that getting the stats over HTTP was taking longer than our configured timeout:

cmooney@prometheus2005:~$ time curl http://netflow2003.codfw.wmnet:9804/metrics | tee /tmp/gnmic_data1  

<--- output cut --->

real	0m34.581s
user	0m0.040s
sys	0m0.139s

In some of the tests it was taking over 90 seconds! As can be seen there the actual time the prometheus host is spending working is very low, but the request takes a long time to complete. In real-time you can see that after the GET request is made nothing happens for a long time, and then the stats arrive suddenly.

Troubleshooting

Reading the gnmic docs it seems that our enabling "caching" for the prometheus output means that the raw GNMI stats are only processed into prometheus metrics after gnmic receives a HTTP request for them. This is the cause of the big delay between the prometheus server making the GET request and the stats being served. Disabling the caching function reduced the curl time to well under a second, but unfortunately it prevents some of the "processors" / data mangling we have configured from working. Specifically it means most stats don't have the associated interface labels added which makes them pretty much useless. So we need caching enabled.

I did find that increasing the "expiration" value for the cache helped a lot. This completely resolved the cases where gathering the stats took 90 seconds or more. I suspect what is happening - without it - is that some stats "expire" from the local gnmic cache, and when the prometheus server requests /metrics gnmic has to wait for fresh data from some of the routers before respoinding. Increasing the 'expiration' means it still has the last results and can instead serve them. Timestamps are set to what the router reported so while this may mean less "fresh" data in prometheus, it shouldn't be inaccurate.

Given the fact the event processors are the largest contributor to the delayed response to the HTTP request I also tried removing the processors I added to drop stats for 'disabled' interfaces, which reduced the time further. This is no big deal as actual counters aren't returned for those interfaces, we just get some static data like their MTU. Somewhat wasteful to store, but doesn't seem worth it to drop if running that is impacting performance. Prometheus should store it sensibly afaik.

Actions

So overall to get things working I think we should:

  • Increase the timeouts to 50 seconds
    • Gives the maximum time for gnmic to mangle the data on receipt of a HTTP request for /metrics
    • Although this is a long time, 99.9% of it the prometheus server is idle waiting on the response
    • The actual time spent processing by the prometheus server is still a fraction of a second
  • Set the 'expiration' for the gnmic promehteus cache to 120s
    • Prevents gnmic waiting for fresh data from routers before it responds to prometheus
  • Remove the processors to delete stats for disabled interfaces
    • As mentioned it's not a huge saving on storage, no counters are returned for these
    • It does seem like being cautious with the number of processors is wise
  • Set the "num-workers" for the gnmic prometheus output to 2
    • This adds some parallelism to the processing of stats that takes place when they are requested
    • The effect of this is not dramatic but it seems sensible to do
    • Especially as we scrape from two prometheus hosts

Testing in anger with these settings, having disabled puppet on netflow2003 and modified the conf file manually, we see a big improvement:

cmooney@prometheus2005:~$ time curl http://netflow2003.codfw.wmnet:9804/metrics | tee /tmp/gnmic_data1  

<--- output cut --->

real	0m14.955s
user	0m0.020s
sys	0m0.162s

All of this does suggest we should probably look at running distributed collectors as we move to productionize this, potentially on Kubernetes. To be discussed, gnmic does have extensive support for running in a distributed way anyway.

For now I'll prep a patch and we can discuss.

Change #1056136 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Tweak gnmic parameters to improve performance

https://gerrit.wikimedia.org/r/1056136

Change #1056136 merged by Cathal Mooney:

[operations/puppet@production] Tweak gnmic parameters to improve performance

https://gerrit.wikimedia.org/r/1056136

VM netflow1002.eqiad.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

VM netflow3003.esams.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

VM netflow4002.ulsfo.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

VM netflow5002.eqsin.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

VM netflow6001.drmrs.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

In Eqiad our netflow VM was also running a little hot, and swapping to disk.

I've now increased the resources for it and also the other netflow VMs in the estate. Setup as of now is:

eqiad, codfw: 4 vCPU, 4GB RAM
POPs: 2 vCPU, 3GB RAM

I only increased the RAM at POPs, as the VMs there weren't showing signs of any constraint (although RAM usage fairly high). Also left CPUs at 2 for those as the number of devices for gnmic won't ever be very high so not as much processing for them.

VM netflow7001.magru.wmnet rebooted by cmooney@cumin1002 with reason: increase VM RAM

Change #1075548 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Prometheus gNMI check: use TCP check instead

https://gerrit.wikimedia.org/r/1075548

Change #1075548 merged by Ayounsi:

[operations/puppet@production] Prometheus gNMI check: use TCP check instead

https://gerrit.wikimedia.org/r/1075548

Change #1100488 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Increase the number of gnmic worker and writer threads

https://gerrit.wikimedia.org/r/1100488

Change #1100488 merged by Cathal Mooney:

[operations/puppet@production] Increase the number of gnmic worker and writer threads

https://gerrit.wikimedia.org/r/1100488

Just a few notes on this. Firstly we are now getting the AE/LAG interface stats for our core routers since they were all upgraded to a more recent JunOS.

Seconly we have minor imperfections with the collection/graphing right now, I documented them probably on the wrong task here (cc @fgiunchedi ):

https://phabricator.wikimedia.org/T372457#10408775

No worries at all @cmooney, I've opened T382396: Investigate gnmic metric gaps and counters going to zero to investigate/followup on the two issues you mentioned

No worries at all @cmooney, I've opened T382396: Investigate gnmic metric gaps and counters going to zero to investigate/followup on the two issues you mentioned

Thanks with the help on that one, happy to say the imperfections in the graphs were due to the way we were building them in Grafana, the data itself and gnmi pipeline seem to be working fine.

Icinga downtime and Alertmanager silence (ID=d0f01fc7-5a29-49c5-8292-aebad021ff73) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow7001.magru.wmnet

Icinga downtime and Alertmanager silence (ID=26b7dbb9-1906-4b10-a433-cc2ffb6bdb61) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow7001.magru.wmnet

Icinga downtime and Alertmanager silence (ID=fe2806ef-4f5c-4485-981c-52b89f9e3154) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow7001.magru.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-01-22T10:38:52Z] <topranks> disable-pupept on netflow7001 to run gnmic in foregrand for debug/development T369384

Change #1113449 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add BGP data collection from network devices over GNMI

https://gerrit.wikimedia.org/r/1113449

The above patch adds BGP stats collection to our current setup. Tested in Magru and working well, albeit with a few quirks discovered on the way. It will give us stats like these:

gnmi_bgp_neighbor_prefixes_accepted{afi_safi_afi_safi_name="IPV4_UNICAST",neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 4162 1737554440063

gnmi_bgp_neighbor_prefixes_installed{afi_safi_afi_safi_name="IPV4_UNICAST",neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 1 1737554440063

gnmi_bgp_neighbor_prefixes_received{afi_safi_afi_safi_name="IPV4_UNICAST",neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 4164 1737554440063

gnmi_bgp_neighbor_prefixes_received_pre_policy{afi_safi_afi_safi_name="IPV4_UNICAST",neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 4164 1737554440063

gnmi_bgp_neighbor_prefixes_rejected{afi_safi_afi_safi_name="IPV4_UNICAST",neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 2 1737554440063

gnmi_bgp_neighbor_prefixes_sent{afi_safi_afi_safi_name="IPV4_UNICAST",neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 3 1737554440063

gnmi_bgp_neighbor_state_enabled{neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 1 1737554439786

gnmi_bgp_neighbor_state_established_transitions{neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 94 1737554439576

gnmi_bgp_neighbor_state_interface_error{neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 0 1737554440032

gnmi_bgp_neighbor_state_last_established{neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 1.737535503e+18 1737554439992

gnmi_bgp_neighbor_state_session_state{neighbor_neighbor_address="187.16.222.90",network_instance_name="master",peer_as="15169",peer_descr="Google LLC",peer_group="IX4",peer_type="EXTERNAL",target="cr2-magru.wikimedia.org"} 6 1737554439827

gnmi_bgp_total_paths{network_instance_name="master",target="cr2-magru.wikimedia.org"} 2.676663e+06 1737554439704

gnmi_bgp_total_prefixes{network_instance_name="master",target="cr2-magru.wikimedia.org"} 2.676663e+06 1737554439713

A few notes on things I discovered while getting the config right:

Open Config Paths

JunOS uses OpenConfig paths to export BGP information since version 18.4. The model seems to support everything we'd need for basic alerting, and hopefully means the same gnmic config will work for other vendors.

Starting with Junos OS Release 18.4R1, BGP operational states are aligned and compliant with OpenConfig data model openconfig-bgp-operational.yang. To stream BGP operational states, use the resource path /network-instances/network-instance/protocols/protocol/bgp/. Previously, the path was /bgp/.

I used fairly specific paths for the subscription, as the base bgp path returns a lot of metrics that never change or we don't need to keep in monitoring. A bit annoying to set up / get them all but better than keeping a load of junk.

Text Values

There is a bit of strange behaviour with text values returned to gnmic. For instance the bgp-neighbor-state endpoint returns text values like "IDLE", "ESTABLISHED" etc. The SNMP MIBs had something very similar (bgpPeerState), however being SNMP it returned an integer value, and the MIB was required to interpret it as a string.

The YANG model/gnmi output instead sends the string itself. That's a problem with Prometheus as it does not allow metric values to be strings. So for instance gnmi_bgp_neighbor_state_session_state cannot be stored in Prometheus with a value of 'IDLE'. What we can do is use the gnmic "event-strings" processor on the Prometheus output, and replace the various text strings with numbers instead. This is cumbersome in the config but worth it here so I've done it, using the numeric values from the SNMP MIB.

Sample Mode

I also tried to make the subscription "on-change" instead of "sample". This method causes the router to only send values that have changed to gnmic, so for lots of things (like prefixes received from our own hosts, or session state up/down) this will save bandwidth/cpu as they change very rarely. However this doesn't play nice with the Prometheus output, it means only the stuff that has changed in the most recent 60 seconds is exported in a scrape. We can adjust the 'expiry' for the prometheus stats, so gnmic will keep exporting the last value it received, but I think the timestamp will always be the last one, and then on the Prometheus side if that's more than 5 mins old it'll be discarded.

For now I've moved back to using 'sample', so routers export every metric every 60 seconds. I think this will be ok at the volume we are at, but in future we maybe should look into this deeper and work out if we can conserve resources by not exporting metrics that don't change.

https://github.com/karimra/gnmic/discussions/264

BGP Peer Admin State

The export of the BGP session "admin state" is not working as I expected. With this dummy peer configured on cr2-magru as follows:

set protocols bgp group IX4 neighbor 187.16.216.187 shutdown

The "enabled" value returned to gnmic is still set to true:

elem:{name:"protocol"} elem:{name:"bgp"} elem:{name:"neighbors"} elem:{name:"neighbor" key:{key:"neighbor-address" value:"187.16.216.187"}}} update:{path:{elem:{name:"state"} elem:{name:"enabled"}} val:{bool_val:true}}}

In practice this may not cause us many problems (at least with Juniper). We rarely admin-down sessions, and when we do we tend to use the "deactivate" config option to do so. With that option the gnmi output is like the peer didn't exist at all, and thus no stats are returned. So if for instance we want to alert on sessions that are not "ESTABLISHED" we won't see an issue if we are using deactivate. I've left the collection and processing for this metric in place, however, as it seems like a bug on the Juniper side which may be fixed, or perhaps it'll work better with other vendors (where we may not have a 'deactivate' option).

Icinga downtime and Alertmanager silence (ID=ba072b6c-6957-428b-a932-dfcf0b3f8103) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow7001.magru.wmnet

All of this does suggest we should probably look at running distributed collectors as we move to productionize this, potentially on Kubernetes. To be discussed, gnmic does have extensive support for running in a distributed way anyway.

The aux clusters are waiting for us :D and we do have one in codfw as well now.

For now I've moved back to using 'sample', so routers export every metric every 60 seconds. I think this will be ok at the volume we are at, but in future we maybe should look into this deeper and work out if we can conserve resources by not exporting metrics that don't change.

If the values stay the same, it doesn't actually take up much space in Prom, AIUI.

Icinga downtime and Alertmanager silence (ID=fe40d399-fce9-41c4-b12a-4bcb36770f4b) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow7001.magru.wmnet

The aux clusters are waiting for us :D and we do have one in codfw as well now.

Yep it's worth investigating. We run it at POPs as well, but with less devices there we might be able to keep it on a VM, and use K8s at the core sites with an order of magnitude more devices.

If the values stay the same, it doesn't actually take up much space in Prom, AIUI.

Yeah I believe Prometheus is fairly efficient on that score. The potential benefits are probably to CPU cycles on the routers, and especially the gnmic processing elements. But right now should not be an issue.

Thanks for the review btw, appreciate it :)

Change #1113449 merged by Cathal Mooney:

[operations/puppet@production] Add BGP data collection from network devices over GNMI

https://gerrit.wikimedia.org/r/1113449

Icinga downtime and Alertmanager silence (ID=a6b392ba-8b36-4fa0-8d3d-10c8b2d2eb48) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow1002.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=f0f61f83-b1f7-48c8-9e4a-2e436917a7d3) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow1002.eqiad.wmnet

So I rolled-back the patch to collect the BGP metrics. The config puppet produced worked fine in magru and esams, but for some reason in eqiad stats stopped. I stoped the systemd service and ran gnmic in debug mode but couldn't quite see what the issue was. There are a few devices it can't connect to (but that would not have changed), but it didn't seem to report any issue and was getting stats.

Prometheus scrape initially was taking way over 60 seconds though. I manually edited the config and reduced the number of devices it was connecting to, and things started to work. So it _seems_ like just a resourcing issue but will need to do more checks. What we need to bear in mind here is that the Prometheus output stats are only created when a client scrapes them. So the raw stats are converted - including all our formatting stuff - at that point. I suspect with the additional stats that is simply taking too long now. Perhaps there are some options there, or we are better going the K8s way with multiple instances each mangling a smaller set of metrics.

WMCS were doing some ceph rebalancing and we needed the stats back working so I reverted the patch. I'll do some more testing see if I can work out what's up.

Icinga downtime and Alertmanager silence (ID=7b39f587-684b-42ab-a96c-cf552c03a29d) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow7001.magru.wmnet

Fwiw I thought I saw a potential optimisation to allow us to go back to the "on change" style subscription. gNMIc has a parameter that can be configured for each subscription called "heartbeat_interval":

# duration, Golang duration format, e.g: 1s, 1m30s, 1h.
# The heartbeat interval value can be specified along with `ON_CHANGE` or `SAMPLE` 
# stream subscriptions modes and has the following meanings in each case:
# - `ON_CHANGE`: The value of the data item(s) MUST be re-sent once per heartbeat 
#                interval regardless of whether the value has changed or not.

So I thought we could set this to something like 5 minutes or more, to ensure we get stats at least that frequently, and still have timely data as any changes will get pushed before then. Reducing overhead on the routers and processing for gnmic. Unfortunately it does not seem to be an option (at least for Juniper), in a test with it configured I observed this:

2025/01/23 21:28:08.954855 /home/runner/work/gnmic/gnmic/pkg/app/collector.go:123: [gnmic] target "cr1-magru.wikimedia.org:32767": subscription bgp rcv error: rpc error: code = Unimplemented desc = heartbeat_interval set for /network-instances/network-instance/protocols/protocol/bgp/global/state/total-paths, heartbeat_interval is not supported by the system

Icinga downtime and Alertmanager silence (ID=3f0feb1a-6c73-4906-bb5a-2df62eb7e156) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow1002.eqiad.wmnet

The current configuration we have requires us to enable gnmic caching, as we group certain metrics together so that the same additional tags get added:

If processors are defined under the output config section, they are applied to the whole list of events at once. This allows for augmentation of messages with values from other messages even if they where received in separate updates or collected from a different target/subscription.

For instance we add the tags for BGP peer group to all metrics with the same neighbor IP address. At least as currently configured I don't think that would be possible without caching on. But the way caching works for gnmic seems like it might not play nice when multiple Prometheus servers are scraping at once:

When caching is enabled for a certain output, the received gNMI updates are not written directly to the output remote server (for e.g: InfluxDB server), but rather cached locally until the cache-flush-timer is reached (in the case of an influxdb output) or when the output receives a Prometheus scrape request

At the POPs where we had tested the performance and it worked we only have one prometheus server scraping stats. So this "cache until scraped" methodology ought to work. At eqiad/codfw, where we have multiple, it is unclear how the system will behave, and it occurs to me the multiple requests while another is being served could be conflicting.

Certainly things perform relatively well in magru in terms of overall time:

cmooney@netflow7001:~$ time curl localhost:9804/metrics > metrics
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  725k    0  725k    0     0  1887k      0 --:--:-- --:--:-- --:--:-- 1888k

real	0m0.404s
user	0m0.004s
sys	0m0.021s
cmooney@netflow7001:~$ grep -c bgp metrics
1286
cmooney@netflow7001:~$ grep -c interface metrics 
2367

Whereas with the same config in eqiad it the curl never completes, after 3-4 minutes it's still waiting to get the first metric back.

We do have many more devices, interfaces, and active bgp sessions in eqiad. But I wondered if the multiple scrapes - potentially arriving while another was in progress - was playing a role. So with netflow1002 downtimed I blocked the prometheus connections with some sneaky iptables rules:

cmooney@netflow1002:~$ sudo ip6tables -L INPUT -v --line 
Chain INPUT (policy DROP 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1        0     0 ACCEPT     all  --  any    any     localhost            anywhere            
2       82  6560 DROP       tcp  --  any    any     anywhere             anywhere             tcp dpt:9804

After which I ran the collection again. It still took an insanely long time, but this time it completed:

cmooney@netflow1002:~$ time curl localhost:9804/metrics > /tmp/metrics 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9774k    0 9774k    0     0   133k      0 --:--:--  0:01:13 --:--:-- 2076k

real	1m13.043s
user	0m0.034s
sys	0m0.116s
cmooney@netflow1002:~$ grep -c interface /tmp/metrics
23295
cmooney@netflow1002:~$ grep -c bgp /tmp/metrics 
18302

It's not clear to me what is the best way to proceed is. On the face of it given our need to add other tags (bgp neighbor address, group, asn etc) we can't disable caching. But perhaps we should have multiple instances of gnmic running on a system, each subscribing to different groups of metrics? We could run the prometheus endpoint on different ports possibly, and even have different scrape times configured in prometheus for each (it's not as critical we've up to the time bgp stats).

Or maybe we could investigate using the Prometheus push gateway, and disable caching. But I'm not sure if the resulting metrics will have all the metadata we need for graphing/alerts. It would almost make you love SNMP and RRDs tbh :P

FWIW I used the config from P72314 in the most recent tests. I'd tried to use some of the advice from this issue to improve performance, and it did seem to help a bit, but our issues are more fundamental.

Icinga downtime and Alertmanager silence (ID=43ff15dd-e256-46b3-aea6-882240b9fe64) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow1002.eqiad.wmnet

After more testing it appears the event-value-tag processing is really what is killing us here. Without any of those in the config a scrape in eqiad takes 2-3 seconds, even with all the BGP metrics. Adding back the single one for interface descriptions pushes the scrape time to 10 seconds or so. The various grouping and renaming processors don't seem to have a significant impact at all.

I opened an issue on their github to ask if this was normal:

https://github.com/openconfig/gnmic/issues/588

The metadata event-value-tag adds - like interface descriptions and BGP group names - is really useful to us. For instance we can graph all interfaces that are 'Transit', or all our internal 'Anycast' BGP peers. Without it graphing these things would be extremely cumbersome, we'd probably need to manually maintain groups of interface names or peer IPs to do it.

Icinga downtime and Alertmanager silence (ID=892c37cf-859a-4da6-8f59-c75b5d153219) set by cmooney@cumin1002 for 3:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow2003.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=7b04d5bf-ab80-4626-96ba-3c376dfc52c2) set by cmooney@cumin1002 for 3:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow3003.esams.wmnet

Change #1114770 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] gnmic: use event-value-tag-v2 to improve performance

https://gerrit.wikimedia.org/r/1114770

I'm very happy to say Karim Radhouani, one of the gnmic devs, has been extremely helpful in response to the github issue I posted.

He's made several patches to improve the performance of event-value-tag, and added a new processor event-value-tag-v2, that can be used with caching disabled for the Prometheus output. These have been included in version 0.40 that was released today, which I have rolled out to our netflow VMs now.

In tests the new processor works much better for our use-case, reducing the Prometheus scraping considerably (from over 20 seconds in codfw to 0.4 seconds!). CPU usage was also noticeably better on systems using the new processor in testing. The above patch adjusts our config to use it which should improve things.

One side-effect of disabling caching is we lose the "target" tag, but we can just use "source" instead. I'll update our dashboards to reflect this.

Icinga downtime and Alertmanager silence (ID=e5ab529a-1fb4-461d-b85a-a2d5a66a020a) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow3003.esams.wmnet

Change #1114770 merged by Cathal Mooney:

[operations/puppet@production] gnmic: use event-value-tag-v2 to improve performance

https://gerrit.wikimedia.org/r/1114770

Change #1114967 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Prometheus: change gnmi label rewrite from 'target' to 'source'

https://gerrit.wikimedia.org/r/1114967

Change #1114967 merged by Cathal Mooney:

[operations/puppet@production] Prometheus: change gnmi label rewrite from 'target' to 'source'

https://gerrit.wikimedia.org/r/1114967

Icinga downtime and Alertmanager silence (ID=36d26c8a-4d30-4345-8682-54b6b4882e38) set by cmooney@cumin1002 for 3:00:00 on 1 host(s) and their services with reason: disabling alerts as I'm running gnmic manually rather than with systemd

netflow2003.codfw.wmnet

Change #1115002 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] gNMIc: Add BGP stats collection for network devices

https://gerrit.wikimedia.org/r/1115002

Moving to event-value-tag-v2 has been pushed out to all our Netflow VMs and we've seen a nice reduction in CPU usage, plus a massive drop in Prometheus scrape time 😃

One side-effect of disabling caching is we lose the "target" tag, but we can just use "source" instead. I'll update our dashboards to reflect this.

We were actually re-writing "target" to "instance" with Prometheus label re-write, so I've adjusted that to re-write 'source' to 'instance' instead. Which means the metrics will be exactly the same as before and no need to change anything.

The latest patch will add BGP metrics, again using event-value-tag-v2. Does increase CPU but it's well within norms and spread across cores. Not particularly urgent given we fixed the LibreNMS BGP polling issue that made me first look, but good to have it in there still.

Change #1115002 merged by Cathal Mooney:

[operations/puppet@production] gNMIc: Add BGP stats collection for network devices

https://gerrit.wikimedia.org/r/1115002

Looks like the "last_established" stat is being exported with different granularity by different JunOS platforms:

gnmi_bgp_neighbor_last_established{address="10.192.11.8", instance="lsw1-b2-codfw", job="gnmi", network_instance_name="PRODUCTION", peer_as="64600", peer_descr="lvs2012", peer_group="PyBal", peer_type="EXTERNAL", prometheus="ops", site="codfw"}   1738671870
gnmi_bgp_neighbor_last_established{address="10.64.0.80", instance="cr1-eqiad", job="gnmi", network_instance_name="DEFAULT", peer_as="64600", peer_descr="lvs1017", peer_group="PyBal", peer_type="EXTERNAL", prometheus="ops", protocol_identifier="BGP", protocol_name="DEFAULT", site="eqiad"} 1738672085000000000

I built some dashboards in Grafana based on the latter, and had to divide the number by 100,000 to fit the Grafana "datetime" format. But when I look at a stat from a switch (like the first one) it says it's been up for 55 years :)

My robot friend suggested this which works to adjust the result of the promql to the right units in both cases:

gnmi_bgp_neighbor_last_established{instance="$device", peer_group="$bgp_group", peer_descr="$bgp_neighbor"} 
* 10^(12 - on () group_left() floor(log10(gnmi_bgp_neighbor_last_established)))
ayounsi claimed this task.

Closing that never-ending tracking task to focus on more specific sub-tasks now that all the ground work is done.