Remove librenms -> graphite integration, replace with gnmi
Open, MediumPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Aug 14 2024, 8:14 AM

Description

As part of parent task (sunset graphite) we are removing all graphite protocol producers, including librenms. This task track the deprecation of librenms -> graphite metrics.

I did a quick audit of dashboards using librenms metrics and the following came up:

My understanding is that once T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ is done (i.e. we have all switches upgraded) we can fully use gnmi to collect all interesting switch metrics in Prometheus (cfr T369384 too). Therefore it'll be possible to port the dashboards above to use Prometheus and stop using graphite.

Action plan:

Confirm switch/router metrics we're after are indeed in Prometheus
Port (or delete, as appropriate) the dashboards above to use Prometheus/Thanos
Remove librenms -> graphite integration via librenms config

I'm adding WMCS for awareness, heads up, feedback, etc. In terms of timeline there's T316544 for sure blocking this for now, so it won't happen in the short term

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T228380 Tech debt: sunsetting of Graphite
		Open		None	T372457 Remove librenms -> graphite integration, replace with gnmi

Event Timeline

fgiunchedi created this task.Aug 14 2024, 8:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 14 2024, 8:14 AM

fgiunchedi edited projects, added SRE Observability; removed SRE Observability (FY2024/2025-Q3), Observability-Metrics.Aug 14 2024, 8:14 AM

fgiunchedi mentioned this in T229542: Export LibreNMS data to Prometheus.Aug 14 2024, 8:17 AM

fgiunchedi mentioned this in T228380: Tech debt: sunsetting of Graphite.Aug 14 2024, 8:40 AM

lmata edited projects, added SRE Observability (FY2024/2025-Q2); removed SRE Observability.Aug 14 2024, 2:16 PM

Andrew added a subscriber: dcaro.Aug 21 2024, 2:05 PM

joanna_borun triaged this task as Medium priority.Sep 18 2024, 2:18 PM

taavi added a project: Cloud-VPS.Sep 28 2024, 12:40 PM

taavi moved this task from Unsorted to Network on the Cloud-VPS board.Nov 1 2024, 7:04 PM

lmata moved this task from Inbox to Up Next on the SRE Observability (FY2024/2025-Q2) board.Nov 5 2024, 5:07 PM

I took another look at the dashboards and it looks like to me we now have all interesting switch port metrics in Prometheus via gnmi for cloudsw devices. Specifically port in/out bytes and discards seem the most/only graphed metrics. Does that track @cmooney @dcaro ? If so I'll start a test conversion of https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance to swap out graphite for prometheus for said metrics.

@fgiunchedi thanks for bringing this one up.

In Netops we make little use of the LibreNMS stats exported to Graphite, so it's fine if they go.

WMCS do use them, however, so we need to check on that side. I think everything that we need should be there, and if not we can add those path's to the gnmic collection. My only worry is that right now the gnmic stats have some problems, namely that we observe gaps in the graphs like this from time to time:

I'm not 100% sure what the issue is here. I'm fairly certain it is not related to the way those graphs are set up, or anything like rollovers in counter max values etc. But I've not had time to dig into the issue fully. When we first rolled out the gnmic stats we had similar gaps, but much bigger and more frequently. Increasing scraper timeouts and worker threads solved it for the most part, but we still see it sometimes. That makes me suspect the issue is still some sort of occasional performance bottleneck. The netflow VMs don't seem to be overly taxed, however (CPU hits max on them scraping at certain points, but it's not constant so there should be cycles for it to do whatever it needs).

We also see another type of discrepancy, which I'm not so sure about (perhaps is related to counter rollovers?). Here we do appear to have measurements but the counter goes to zero, even though it's pretty much impossible that was actually the case:

{F58025300 width=600}

I've been meaning to look at the raw counter values in Prometheus/Thanos to see if that reveals anything on those.

Anyway probably should be dealt with on the overall task for gnmi stats, but I want to mention it here to make sure @dcaro is aware and can factor the current state into his thinking. Personally I use the gnmi stats daily and love them, but in my mind they are still somewhat "beta" for this reason. At the end of the day we have LibreNMS/SNMP graphs to fall back on if there is confusion, so maybe we can move forward with this even before we fully get it sorted out?

Also btw if you want to look at those I created a folder in Grafana for Netops while you were on leave, hope that wasn't being cheeky!

Thank you for the extensive explanation @cmooney ! Yes definitely let's go over the issues you outlined in the gnmi task and I'm happy to assist! Also thank you for the dashboards, I'll take a look and no cheekiness has been detected

cmooney mentioned this in T369384: Productionize gnmic network telemetry pipeline.Tue, Dec 17, 4:13 PM

fgiunchedi mentioned this in T382396: Investigate gnmic metric gaps and counters going to zero.Wed, Dec 18, 9:12 AM

	Restricted File
	Tue, Dec 17, 11:36 AM

	F58025286: image.png
	Tue, Dec 17, 11:31 AM

Remove librenms -> graphite integration, replace with gnmiOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Remove librenms -> graphite integration, replace with gnmi
Open, MediumPublic
Actions

Related Objects
Search...