Page MenuHomePhabricator

Evaluate LibreNMS' Graphite backend
Closed, ResolvedPublic

Description

LibreNMS seems to have acquired a Graphite backend/extension. It doesn't have any frontend code for it (and recommending Grafana instead), but works in parallel to their existing RRD capabilities, essentially mirroring the metrics.

Having all of the LibreNMS data in Graphite could allow us to conveniently present and aggregate data that we otherwise can't with LibreNMS, such as the power usage per datacenter (something we used to use Torrus for). The setup looks fairly easy, but it may be a little heavy on the Graphite/Whisper side of things, considering the amount of ports LibreNMS monitors. We should test and evalute that, it shouldn't take long!

Details

Related Gerrit Patches:
operations/puppet : productionlibrenms: explicit graphite port
operations/puppet : productionlibrenms: enable graphite extension

Event Timeline

faidon created this task.Jul 20 2017, 1:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 20 2017, 1:40 PM
faidon moved this task from Inbox to Up next on the observability board.Jul 20 2017, 1:41 PM
faidon moved this task from Backlog to Monitoring on the netops board.

Change 366836 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] librenms: enable graphite extension

https://gerrit.wikimedia.org/r/366836

Space wise, librenms has ~15k rrds now, assuming it'll create the same number on the graphite side, at ~350k per whisper file that's ~5GB used on graphite

ema added a subscriber: ema.Jul 21 2017, 12:02 PM
faidon moved this task from Up next to In progress on the observability board.Jul 24 2017, 3:09 PM
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Jul 25 2017, 8:50 AM

Change 366836 merged by Filippo Giunchedi:
[operations/puppet@production] librenms: enable graphite extension

https://gerrit.wikimedia.org/r/366836

Change 367875 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] librenms: explicit graphite port

https://gerrit.wikimedia.org/r/367875

Change 367875 merged by Filippo Giunchedi:
[operations/puppet@production] librenms: explicit graphite port

https://gerrit.wikimedia.org/r/367875

fgiunchedi closed this task as Resolved.Jul 26 2017, 10:35 AM

This is the resolved, note that the port in https://gerrit.wikimedia.org/r/367875 is required since $config isn't merged in librenms and we overwrite it with a brand new array, thus blanking out port

Turns out this is more data than I expected (just slowly increasing by now)

$ du -hcs /var/lib/carbon/whisper/librenms/
44G	/var/lib/carbon/whisper/librenms/
$ find /var/lib/carbon/whisper/librenms/ -type f | wc -l
140129
elukey added a subscriber: elukey.Aug 14 2017, 4:34 PM

Just removed some old log files from /var/log/carbon/* on graphite2001, they were full of things like invalid line (librenms.x.y.z abd they were filling the small root partition.

Indeed it looks like librenms sends both metrics with whitespace in the name and metrics without values:

librenms.asw-b-codfw_mgmt_codfw_wmnet.ports.esi.INUCASTPKTS  1502879264
librenms.asw-b-codfw_mgmt_codfw_wmnet.ports.esi.OUTUCASTPKTS  1502879264
librenms.asw-b-codfw_mgmt_codfw_wmnet.ports.esi.INNUCASTPKTS  1502879264
librenms.cr2-knams_wikimedia_org.storage.storage.hrstorage./dev/da1s1f, mounted on: /var.used 819449856 1502879405

I'll report the bugs upstream

Both issues have been fixed upstream! Pending deployment of latest version of librenms to production.

Both issues have been fixed upstream! Pending deployment of latest version of librenms to production.

@ayounsi Want to have the honours ? Or should I ?

Those changes should land in the august release of LibreNMS (https://github.com/librenms/librenms/releases), most likely due next week.
I think it's worth waiting, if urgent though we could backport the changes.

akosiaris changed the task status from Open to Stalled.Aug 21 2017, 8:05 AM

Those changes should land in the august release of LibreNMS (https://github.com/librenms/librenms/releases), most likely due next week.
I think it's worth waiting, if urgent though we could backport the changes.

Agreed.

fgiunchedi removed fgiunchedi as the assignee of this task.Aug 21 2017, 9:59 AM
fgiunchedi added a subscriber: fgiunchedi.

Unassigned from me since the deployment part is pending

fgiunchedi moved this task from Doing to Radar on the User-fgiunchedi board.Aug 21 2017, 10:00 AM

I was looking for PDU power usage metrics. Since we don't have a Grafana dashboard yet, I tried to query Graphite manually with e.g. this query: librenms.ps*eqiad*.sensor.sensor.current.*.*.sensor. (actually, what we really need is the sum() of that, but it's less obvious to see what's happening in that one).

The output seems to be very spotty, like a lot of data points are missing. Any ideas of why this may be happening?

Volans added a subscriber: Volans.Aug 22 2017, 10:43 AM

At first sight it might just be that the update frequency of the data and the smallest retention period set in graphite do not match to each other, having a much smaller retention period than the update frequency.

Looks like librenms polls every 5 minutes, so the gaps are there because no data has actually been sent.

@Volans yeah that used to be a problem when aggregating into other periods, though now xfilesfactor is basically 0.01 across the board.

akosiaris changed the task status from Stalled to Open.Aug 29 2017, 8:06 AM
akosiaris assigned this task to fgiunchedi.

Since the upgrade is done, I am reverting actions taken in T171167#3536747 and T171167#3536898 namely status and assignment.

fgiunchedi closed this task as Resolved.Aug 29 2017, 8:21 AM

Thanks @akosiaris @ayounsi ! No more invalid metrics in graphite logs AFAICS, resolving!