Page MenuHomePhabricator

Setup graphs for power usage readings in Grafana
Open, HighPublic

Description

Forking from T148541

Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.
I see LibreNMS seems to be exporting its PDU data (from SNMP) into Graphite, so probably these graphs can be created in Graphite as well. Please setup graphs in Graphite that allow power usage readings in a way where they are useful and meaningful for managing data center operations.

At least these graphs/aggregations would be needed AFAIUI:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 18 2019, 5:13 PM
faidon triaged this task as High priority.Mar 7 2019, 1:20 PM
faidon added a subscriber: faidon.Mar 7 2019, 2:53 PM

For the per-site usage, LibreNMS besides being clunky, is non-public and not accessible to all.

It would be useful to have a public Grafana dashboard for per-site figures as well. Please work on this with priority.

fgiunchedi moved this task from Backlog to Up next on the observability board.Mar 18 2019, 2:00 PM
fgiunchedi moved this task from Up next to In progress on the observability board.Apr 29 2019, 3:27 PM
fgiunchedi updated the task description. (Show Details)Apr 30 2019, 8:49 AM

I've started a new dashboard with per-pdu data here: https://grafana.wikimedia.org/d/cq0ZowkZz/pdus whereas per-site data is https://grafana.wikimedia.org/d/000000397/site-power-usage still missing is per-row and per-rack data.

RobH added a subscriber: RobH.Jun 4 2019, 7:38 PM

@fgiunchedi: https://grafana.wikimedia.org/d/cq0ZowkZz/pdus?orgId=1 lists:

The data is collected from eqiad and codfw sites PDUs via SNMP by LibreNMS, exported to Graphite and calculated as:
sum(current) * max(voltage) / sqrt(3)
PDUs in ulsfo expose pre-calculated watts readings, thus data is taken as-is.
In all cases data is cumulative for all PDUs inlet cords.
See also T171823

Since ulsfo is now using the same servertechs with snmp data in librenms, this should be updated right?

@fgiunchedi: https://grafana.wikimedia.org/d/cq0ZowkZz/pdus?orgId=1 lists:

The data is collected from eqiad and codfw sites PDUs via SNMP by LibreNMS, exported to Graphite and calculated as:
sum(current) * max(voltage) / sqrt(3)
PDUs in ulsfo expose pre-calculated watts readings, thus data is taken as-is.
In all cases data is cumulative for all PDUs inlet cords.
See also T171823

Since ulsfo is now using the same servertechs with snmp data in librenms, this should be updated right?

At the time I did the dashboard ulsfo was already using servertech pdus, though with servertech4 MIB over snmp not servertech3 like the rest of codfw/eqiad at the time (i.e. here https://www.servertech.com/support/sentry-mib-oid-tree-downloads). I see now since b5-eqiad PDU is newer it is also exposing servertech4 instead of servertech3, I've updated https://grafana.wikimedia.org/d/000000397/site-power-usage?orgId=1 to cater for this fact. Although the individual pdu dashboard you linked should already account for this difference, hope that answers your question!

Volans assigned this task to fgiunchedi.Jun 24 2019, 3:05 PM