As part of parent task (sunset graphite) we are removing all graphite protocol producers, including librenms. This task track the deprecation of librenms -> graphite metrics.
I did a quick audit of dashboards using librenms metrics and the following came up:
- https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance
- https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health
- https://grafana.wikimedia.org/d/DpbFWWCGk/wmcs-ceph-eqiad-capacity
- https://grafana.wikimedia.org/d/aell5G4nk/dcaro-dumps-playground
My understanding is that once T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ is done (i.e. we have all switches upgraded) we can fully use gnmi to collect all interesting switch metrics in Prometheus (cfr T369384 too). Therefore it'll be possible to port the dashboards above to use Prometheus and stop using graphite.
Action plan:
- Confirm switch/router metrics we're after are indeed in Prometheus
- Port (or delete, as appropriate) the dashboards above to use Prometheus/Thanos
- Remove librenms -> graphite integration via librenms config
I'm adding WMCS for awareness, heads up, feedback, etc. In terms of timeline there's T316544 for sure blocking this for now, so it won't happen in the short term