Network Telemetry
We have some basic interface metrics now being streamed by compatible devices at most of our POPs over gNMI RPC, which is being done by gnmic running on our Netflow VMs, and exposing a Prometheus endpoint to export the data. The work was mostly done under T326322, with a high-level overview on Wikitech here:
https://wikitech.wikimedia.org/wiki/Network_telemetry
The tool and workflow is working reasonably well, and giving us more granular network usage metrics and per-queue stats we do not have in LibreNMS. So it can be considered a successful proof-of-concept, on which we want to build and begin to make this approach a fully integrated part of our network monitoring and alerting.
Next Steps
Over the next while we should work out how to add collection of the following:
- LAG / AE interface stats
- Sub-interface stats
- Firewall filter stats
- BGP groups / neighbor states
- OSPF interface states
- RPKI sessions
- More tbc
For each we need to identify the correct telemetry paths to use, in fact some of them are available from multiple different paths so it's worth assessing what the differences between them are and pros/cons of each. We also need to consider how we configure gnmic processors to manipulate and group how the data is exposed as prometheus metrics.
In terms of the overall setup we should probably also assess:
- Where to run gnmic from long term (i.e. Netflow VMs or dedicated hosts)
- What tuning might be a good idea for our workflow, for example:
- Timeouts
- Caching
- Subscription intervals
- Improvements to our current TLS/PKI for network devices
- More recent JunOS might help here
Finally when we have things on a stable footing we can consider adding some alertmanager alerting based on the prometheus stats, to fill gaps not covered by LibreNMS or indeed replace existing ones. Creating this task to track progress.