Page MenuHomePhabricator

Remove librenms -> graphite integration, replace with gnmi
Open, MediumPublic

Description

As part of parent task (sunset graphite) we are removing all graphite protocol producers, including librenms. This task track the deprecation of librenms -> graphite metrics.

I did a quick audit of dashboards using librenms metrics and the following came up:

My understanding is that once T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ is done (i.e. we have all switches upgraded) we can fully use gnmi to collect all interesting switch metrics in Prometheus (cfr T369384 too). Therefore it'll be possible to port the dashboards above to use Prometheus and stop using graphite.

Action plan:

  • Confirm switch/router metrics we're after are indeed in Prometheus
  • Port (or delete, as appropriate) the dashboards above to use Prometheus/Thanos
  • Remove librenms -> graphite integration via librenms config

I'm adding WMCS for awareness, heads up, feedback, etc. In terms of timeline there's T316544 for sure blocking this for now, so it won't happen in the short term

Event Timeline

I took another look at the dashboards and it looks like to me we now have all interesting switch port metrics in Prometheus via gnmi for cloudsw devices. Specifically port in/out bytes and discards seem the most/only graphed metrics. Does that track @cmooney @dcaro ? If so I'll start a test conversion of https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance to swap out graphite for prometheus for said metrics.

@fgiunchedi thanks for bringing this one up.

In Netops we make little use of the LibreNMS stats exported to Graphite, so it's fine if they go.

WMCS do use them, however, so we need to check on that side. I think everything that we need should be there, and if not we can add those path's to the gnmic collection. My only worry is that right now the gnmic stats have some problems, namely that we observe gaps in the graphs like this from time to time:

image.png (573×963 px, 59 KB)

I'm not 100% sure what the issue is here. I'm fairly certain it is not related to the way those graphs are set up, or anything like rollovers in counter max values etc. But I've not had time to dig into the issue fully. When we first rolled out the gnmic stats we had similar gaps, but much bigger and more frequently. Increasing scraper timeouts and worker threads solved it for the most part, but we still see it sometimes. That makes me suspect the issue is still some sort of occasional performance bottleneck. The netflow VMs don't seem to be overly taxed, however (CPU hits max on them scraping at certain points, but it's not constant so there should be cycles for it to do whatever it needs).

We also see another type of discrepancy, which I'm not so sure about (perhaps is related to counter rollovers?). Here we do appear to have measurements but the counter goes to zero, even though it's pretty much impossible that was actually the case:

{F58025300 width=600}

I've been meaning to look at the raw counter values in Prometheus/Thanos to see if that reveals anything on those.

Anyway probably should be dealt with on the overall task for gnmi stats, but I want to mention it here to make sure @dcaro is aware and can factor the current state into his thinking. Personally I use the gnmi stats daily and love them, but in my mind they are still somewhat "beta" for this reason. At the end of the day we have LibreNMS/SNMP graphs to fall back on if there is confusion, so maybe we can move forward with this even before we fully get it sorted out?

Also btw if you want to look at those I created a folder in Grafana for Netops while you were on leave, hope that wasn't being cheeky!

Thank you for the extensive explanation @cmooney ! Yes definitely let's go over the issues you outlined in the gnmi task and I'm happy to assist! Also thank you for the dashboards, I'll take a look and no cheekiness has been detected