Page MenuHomePhabricator

Record per-server power usage
Open, Needs TriagePublic

Description

The 2019 Wikimedia Foundation Sustainability Assessment states that server electricity usage accounts for 54.6% of Wikimedia's carbon footprint. It would be nice to be able to break that figure down further, for example, by prometheus cluster. This would improve our ability to identify potential efficiency projects.

For Dell servers there is dellhw_exporter, although this requires OMSA to be installed. This is no easy task since there are no recent Debian packages available. However, the same information is apparently available with IPMI:

$ ipmitool -c -I lanplus -H mw1333.mgmt.eqiad.wmnet -U root -E delloem powermonitor powerconsumptionhistory
Power Consumption History

Statistic                   Last Minute     Last Hour     Last Day     Last Week

Average Power Consumption   155 W           155 W         160 W        165 W   
Max Power Consumption       199 W           199 W         220 W        244 W   
Min Power Consumption       117 W           117 W         105 W         97 W   

Max Power Time
Last Minute     : Thu Nov  7 02:05:33 2019
Last Hour       : Thu Nov  7 02:05:33 2019
Last Day        : Wed Nov  6 12:27:57 2019
Last Week       : Tue Nov  5 22:25:05 2019
Min Power Time
Last Minute     : Thu Nov  7 01:36:15 2019
Last Hour       : Thu Nov  7 01:36:15 2019
Last Day        : Wed Nov  6 19:55:55 2019
Last Week       : Sun Nov  3 02:05:35 2019

The idea would be to write a Prometheus plugin which runs this command and parses the response to extract the one minute average power consumption. Resolution is only 1W, but the same resolution is shown in the iDRAC web UI so it is probably the best that is physically available.

For HP ProLiant, there is ilo-exporter, which consumes the iLO RESTful API.

Event Timeline

tstarling created this task.Thu, Nov 7, 4:52 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Nov 7, 4:52 AM
tstarling updated the task description. (Show Details)Thu, Nov 7, 4:58 AM

I've some concerns to proceed with this. In our experience the BMCs are not that stable and an excessive interaction with them seems to aggravate the situation, statistically causing more BMCs to become unresponsive and requiring a reset.
For this reason we've kept to a minimum our checks of BMCs and I'd rather not add something that query the BMC so often.

I think that for what you're looking for some one shot gathering of data repeated maybe once a month or something like that might be enough. Also take into account that any power consumption data is heavily related to how much "used" the host is overall, making it harder to draw conclusions based only on power consumption and maybe traffic data (e.g. a change in globally installed daemons or different kernels might lead to different data).

FWIW you don't need the remote IPMI for the Dells, you can gather them directly on the host with ipmi-oem, the related available commands are:

get-power-consumption-data
get-instantaneous-power-consumption-data [power_supply_instance]
get-power-head-room
get-power-consumption-statistics <average|max|min>

To be used like:

ipmi-oem Dell get-power-consumption-statistics average

In my experience the get-power-consumption-statistics average is not reliable as the one minute average doesn't change if I stress the host for a minute, while the instantaneous one seems accurate.

AFAIK ipmi-oem doesn't support HP according to ipmi-oem -L, but I didn't look deeper.

There is also T214183: Setup graphs for power usage readings in Grafana for per rack, row, and pdu power stats.

I like the overall idea. Regarding balancing data gathering frequency with accuracy since we have at least a daily cycle in power usage (i.e. matching traffic). I think starting with sampling four times a day should get us representative figures while not being a problem for ilo/idrac. Thoughts ?