Page MenuHomePhabricator

Remove librenms -> graphite integration, replace with gnmi
Closed, ResolvedPublic

Assigned To
Authored By
fgiunchedi
Aug 14 2024, 8:14 AM
Referenced Files
F58310645: image.png
Jan 29 2025, 7:28 PM
F58138763: image.png
Jan 7 2025, 1:29 PM
F58138760: image.png
Jan 7 2025, 1:29 PM
Restricted File
Dec 17 2024, 11:36 AM
F58025286: image.png
Dec 17 2024, 11:31 AM

Description

As part of parent task (sunset graphite) we are removing all graphite protocol producers, including librenms. This task track the deprecation of librenms -> graphite metrics.

I did a quick audit of dashboards using librenms metrics and the following came up:

My understanding is that once T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ is done (i.e. we have all switches upgraded) we can fully use gnmi to collect all interesting switch metrics in Prometheus (cfr T369384 too). Therefore it'll be possible to port the dashboards above to use Prometheus and stop using graphite.

Action plan:

  • Confirm switch/router metrics we're after are indeed in Prometheus
  • Port (or delete, as appropriate) the dashboards above to use Prometheus/Thanos
  • Remove librenms -> graphite integration via librenms config

I'm adding WMCS for awareness, heads up, feedback, etc. In terms of timeline there's T316544 for sure blocking this for now, so it won't happen in the short term

Event Timeline

I took another look at the dashboards and it looks like to me we now have all interesting switch port metrics in Prometheus via gnmi for cloudsw devices. Specifically port in/out bytes and discards seem the most/only graphed metrics. Does that track @cmooney @dcaro ? If so I'll start a test conversion of https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance to swap out graphite for prometheus for said metrics.

@fgiunchedi thanks for bringing this one up.

In Netops we make little use of the LibreNMS stats exported to Graphite, so it's fine if they go.

WMCS do use them, however, so we need to check on that side. I think everything that we need should be there, and if not we can add those path's to the gnmic collection. My only worry is that right now the gnmic stats have some problems, namely that we observe gaps in the graphs like this from time to time:

image.png (573×963 px, 59 KB)

I'm not 100% sure what the issue is here. I'm fairly certain it is not related to the way those graphs are set up, or anything like rollovers in counter max values etc. But I've not had time to dig into the issue fully. When we first rolled out the gnmic stats we had similar gaps, but much bigger and more frequently. Increasing scraper timeouts and worker threads solved it for the most part, but we still see it sometimes. That makes me suspect the issue is still some sort of occasional performance bottleneck. The netflow VMs don't seem to be overly taxed, however (CPU hits max on them scraping at certain points, but it's not constant so there should be cycles for it to do whatever it needs).

We also see another type of discrepancy, which I'm not so sure about (perhaps is related to counter rollovers?). Here we do appear to have measurements but the counter goes to zero, even though it's pretty much impossible that was actually the case:

{F58025300 width=600}

I've been meaning to look at the raw counter values in Prometheus/Thanos to see if that reveals anything on those.

Anyway probably should be dealt with on the overall task for gnmi stats, but I want to mention it here to make sure @dcaro is aware and can factor the current state into his thinking. Personally I use the gnmi stats daily and love them, but in my mind they are still somewhat "beta" for this reason. At the end of the day we have LibreNMS/SNMP graphs to fall back on if there is confusion, so maybe we can move forward with this even before we fully get it sorted out?

Also btw if you want to look at those I created a folder in Grafana for Netops while you were on leave, hope that wasn't being cheeky!

Thank you for the extensive explanation @cmooney ! Yes definitely let's go over the issues you outlined in the gnmi task and I'm happy to assist! Also thank you for the dashboards, I'll take a look and no cheekiness has been detected

Hi! Late to the task :), just coming back from a long PTO, I had started playing with https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance, though I had not yet figured out the exact metric names that replace the counters we are using from grafana (and of course, my session expired long time ago xd, and did not hit save anywhere).

Having some small gaps currently is not an issue, as the graphs are more "hints" than something we rely on (we don't alert or react on it, yet), so I'm ok with using "still beta" metrics for the time being.

Thank you for the feedback @dcaro, appreciate it! We did resolve the gaps issue by extending the rate() window in T382396 and that indeed fixed the issue.

re: equivalent metrics in prometheus, if I'm reading this panel correctly for example (https://grafana-rw.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance?forceLogin=true&orgId=1&viewPanel=107) we're after interface in/out bits, with the need to match on port description in the gnmi case.

In other words for the panel above the following expressions would work I think?

rate(gnmi_interfaces_interface_state_counters_in_octets{instance=~"cloudsw.*",interface_description=~"(Core|Transport): cloudsw.*"}[5m])

and

rate(gnmi_interfaces_interface_state_counters_out_octets{instance=~"cloudsw.*",interface_description=~"(Core|Transport): cloudsw.*"}[5m])

On thanos: https://w.wiki/CePo

Just an example of course, let me know what you think!

Thanks for the work on this guys. @dcaro my concerns above were misplaced, the gaps were due to the queries I'd used in Grafana, all the data was in prometheus.

If there is any particular metric/info you're looking for let me know I can potentially help, and indeed add new metrics if they are missing. You might find looking at some of the panels in this folder a help (throughput, queue and totals in particular):

https://grafana.wikimedia.org/dashboards/f/a080404a-368d-4619-8124-ee04d175d5a8/sre-netops

@fgiunchedi just created a version of that graph with the new stats (gnmi), but I'm seeing some discrepancies.

  • The traffic on the gnmi stats, does not seem to differentiate between in/out
  • If the traffic in the new stats is bot in+out, then the values are quite off (as the traffic of the previous ones were either in, or out, with ~1-2Gb/s each)

Note that it might be graphite stats/old graph that's wrong xd

Using graphite:

image.png (887×2 px, 202 KB)

Using gnmi:

image.png (887×2 px, 163 KB)

The traffic on the gnmi stats, does not seem to differentiate between in/out

I'm the worst xd, _in_octects means in traffic, in octets xd, not just "in octets"

I'm the worst xd, _in_octects means in traffic, in octets xd, not just "in octets"

haha - easy mistake to make!

FWIW you need to bear in mind that the Graphite values are the 5-min average rate calculated by LibreNMS from the raw device counters.

The Prometheus/gnmi stats are the raw counter values themselves, at 1 minute rather than 5 minute granularity. That means they are more accurate but also the evaluation of the rate at a given moment needs to be done in the Prometheus query. So the exact resulting figure may be different than what LibreNMS found due to differences in timestamp of the samples, their frequency and how the rate calculation is done.

I had a stab at making that graph myself (and a few others) if you want to see how it compares. Fwiw you want to show all the 40G core links you can simply graph all interfaces with name starting "et-" across all cloudsw (servers at 1G or 10G start with 'ge' or 'xe'):

https://grafana.wikimedia.org/goto/8WLzDHvNR

I had a stab at making that graph myself (and a few others) if you want to see how it compares. Fwiw you want to show all the 40G core links you can simply graph all interfaces with name starting "et-" across all cloudsw (servers at 1G or 10G start with 'ge' or 'xe'):

https://grafana.wikimedia.org/goto/8WLzDHvNR

That's nice!
I'd like to retain the info of which rack to which rack it goes, as that's what we are more interested than the bare interface (and avoids a lookup on netbox), maybe that info can be added to each interface stats somehow as a label?

I was able to move another graph to gnmi too, now the one that's left is the one about drops, but it has a bunch data queries :/
I might try editting the json directly for easy copy-paste, I'll let you know

I'd like to retain the info of which rack to which rack it goes, as that's what we are more interested than the bare interface (and avoids a lookup on netbox), maybe that info can be added to each interface stats somehow as a label?

You can use {{interface_description}} in the legend to add the interface description to what is shown. That will show the far-side in all cases. Unfortunately right now we have nothing shorter than that though, and picking out the device/interface from the full description is not trivial in gnmic. If you put the legend at the bottom rather than the side it's probably workable. I made that change now if you want to take another look.

I was able to move another graph to gnmi too, now the one that's left is the one about drops, but it has a bunch data queries :/
I might try editting the json directly for easy copy-paste, I'll let you know

Cool. I think the interesting metrics there are:

MetricDescription
gnmi_interfaces_interface_state_counters_out_queue_tail_drop_pktsDropped packets due to full buffer for a specific queue
gnmi_interfaces_interface_state_counters_out_queue_tail_drop_bytesDropped bytes due to full buffer for a specific queue
gnmi_interfaces_interface_state_counters_out_queue_red_drop_pktsRandom early detection dropped packets per queue
gnmi_interfaces_interface_state_counters_out_queue_red_drop_bytesRandom early detection dropped bytes per queue
gnmi_interfaces_interface_state_counters_out_discardsTotal dropped packets outbound on an interface
gnmi_interfaces_interface_state_counters_in_errorsInput errors on the link, usually due to hw issue with SFP module or cabling
gnmi_interfaces_interface_state_counters_out_errorsOutput errors, quite rare but the number of packets it tried to send that failed locally. Usually a hw fault or similar but quite rare - usually with those problems link goes down and it doesn't attempt to send

Given we have the qos stuff added I'd probably look at the tail-drops in each queue. And the input errors. Rest are not so important or common.

@cmooney Just noticed that all the drop and discards metrics there only show for cr* switches, do we have the equivalent for cloudsw? (or maybe I'm using the wrong data source? thanos/default?)

@cmooney Just noticed that all the drop and discards metrics there only show for cr* switches, do we have the equivalent for cloudsw? (or maybe I'm using the wrong data source? thanos/default?)

Huh I'd actually not noticed that. Collection is the same so presumably they are just not returned on the QFX platform.

The output queue counters show the same thing. You can add the RED and tail-drop number together to get the total if needed, or just look at tail drops.

This comment was removed by cmooney.
This comment was removed by cmooney.

@cmooney Just noticed that all the drop and discards metrics there only show for cr* switches, do we have the equivalent for cloudsw? (or maybe I'm using the wrong data source? thanos/default?)

FWIW I re-built the current discards graph with the gnmi stats. There are some small diffs, mostly due to the new stats being sampled at a higher frequency. I set the rate interval to 5m to try to match librenms/graphite stats, but inevitably samples don't line up exactly.

image.png (749×962 px, 164 KB)

I think given we are mapping the ceph traffic to specific qos queus it might make more sense to graph those specifically rather than using 'sum' to get the total across all queues. Likewise for RED vs TAIL drop packets, which we now have separate counters for.

Did a round with @cmooney on the current dashboards we have to make sure we are not missing any other metrics (and we do not :) ), waiting now for stabilization of gnmi T369384: Productionize gnmic network telemetry pipeline before replacing the graphs. Might do some new graphs just to make sure they work/advance work.

Change #1122543 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Rename text interface state values returned by GNMI to ints

https://gerrit.wikimedia.org/r/1122543

Change #1122543 merged by Cathal Mooney:

[operations/puppet@production] Rename text interface state values returned by GNMI to ints

https://gerrit.wikimedia.org/r/1122543

@dcaro the interface status should now be there for you.

In terms of monitoring if gnmi_interfaces_interface_state_enabled=1 (interface is enabled in the config) and gnmi_interfaces_interface_state_oper_status!=1 there is a problem, i.e.

gnmi_interfaces_interface_state_enabled{interface_description="cloudvirt1039",interface_name="xe-0/0/45",instance="cloudsw1-d5-eqiad"} 1 1740493547517
gnmi_interfaces_interface_state_oper_status{interface_description="cloudvirt1039",interface_name="xe-0/0/45",instance="cloudsw1-d5-eqiad"} 1 1740493547517

@dcaro the interface status should now be there for you.

In terms of monitoring if gnmi_interfaces_interface_state_enabled=1 (interface is enabled in the config) and gnmi_interfaces_interface_state_oper_status!=1 there is a problem, i.e.

gnmi_interfaces_interface_state_enabled{interface_description="cloudvirt1039",interface_name="xe-0/0/45",instance="cloudsw1-d5-eqiad"} 1 1740493547517
gnmi_interfaces_interface_state_oper_status{interface_description="cloudvirt1039",interface_name="xe-0/0/45",instance="cloudsw1-d5-eqiad"} 1 1740493547517

Thanks! Will start playing with it.

FWIW the gnmic export has been stable now for a while and we're beginning to use it for alerts in netops.

Change #1136603 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] librenms: stop sending data to graphite

https://gerrit.wikimedia.org/r/1136603

Change #1136603 merged by Filippo Giunchedi:

[operations/puppet@production] librenms: stop sending data to graphite

https://gerrit.wikimedia.org/r/1136603

fgiunchedi claimed this task.

I'm boldly resolving this since AFAICT we're done, feel free to reopen otherwise