Page MenuHomePhabricator

Setup graphs for power usage readings in Grafana
Closed, ResolvedPublic

Description

Forking from T148541

Right now I can only find a single graph with eqiad/codfw total (aggregated) power usage, but proper per-rack power usage data is still entirely missing. This makes it currently very hard to determine the total amount of power used per rack (across all phases) and to monitor things like phase imbalance.

I see LibreNMS seems to be exporting its PDU data (from SNMP) into Graphite, so probably these graphs can be created in Graphite as well. Please setup graphs in Graphite that allow power usage readings in a way where they are useful and meaningful for managing data center operations.

At least these graphs/aggregations would be needed AFAIUI:

Event Timeline

faidon triaged this task as High priority.Mar 7 2019, 1:20 PM

For the per-site usage, LibreNMS besides being clunky, is non-public and not accessible to all.

It would be useful to have a public Grafana dashboard for per-site figures as well. Please work on this with priority.

I've started a new dashboard with per-pdu data here: https://grafana.wikimedia.org/d/cq0ZowkZz/pdus whereas per-site data is https://grafana.wikimedia.org/d/000000397/site-power-usage still missing is per-row and per-rack data.

@fgiunchedi: https://grafana.wikimedia.org/d/cq0ZowkZz/pdus?orgId=1 lists:

The data is collected from eqiad and codfw sites PDUs via SNMP by LibreNMS, exported to Graphite and calculated as:

sum(current) * max(voltage) / sqrt(3)

PDUs in ulsfo expose pre-calculated watts readings, thus data is taken as-is.

In all cases data is cumulative for all PDUs inlet cords.

See also T171823

Since ulsfo is now using the same servertechs with snmp data in librenms, this should be updated right?

@fgiunchedi: https://grafana.wikimedia.org/d/cq0ZowkZz/pdus?orgId=1 lists:

The data is collected from eqiad and codfw sites PDUs via SNMP by LibreNMS, exported to Graphite and calculated as:

sum(current) * max(voltage) / sqrt(3)

PDUs in ulsfo expose pre-calculated watts readings, thus data is taken as-is.

In all cases data is cumulative for all PDUs inlet cords.

See also T171823

Since ulsfo is now using the same servertechs with snmp data in librenms, this should be updated right?

At the time I did the dashboard ulsfo was already using servertech pdus, though with servertech4 MIB over snmp not servertech3 like the rest of codfw/eqiad at the time (i.e. here https://www.servertech.com/support/sentry-mib-oid-tree-downloads). I see now since b5-eqiad PDU is newer it is also exposing servertech4 instead of servertech3, I've updated https://grafana.wikimedia.org/d/000000397/site-power-usage?orgId=1 to cater for this fact. Although the individual pdu dashboard you linked should already account for this difference, hope that answers your question!

Status update: I've been working on a dashboard with wattage from sentry3 + sentry4. It has got a global stacked graph + drilldown per-site: https://grafana.wikimedia.org/d/OBD1jy1Zk/filippo-pdu

This should be complete now, the new dashboard at https://grafana.wikimedia.org/d/f64mmDzMz/power-usage offers a global overview, top power-hungry racks and full drilldown per-site.

Boldly resolving, please reopen if sth is amiss