Page MenuHomePhabricator

Record per-server power usage
Closed, ResolvedPublic

Description

The 2019 Wikimedia Foundation Sustainability Assessment states that server electricity usage accounts for 54.6% of Wikimedia's carbon footprint. It would be nice to be able to break that figure down further, for example, by prometheus cluster. This would improve our ability to identify potential efficiency projects.

For Dell servers there is dellhw_exporter, although this requires OMSA to be installed. This is no easy task since there are no recent Debian packages available. However, the same information is apparently available with IPMI:

$ ipmitool -c -I lanplus -H mw1333.mgmt.eqiad.wmnet -U root -E delloem powermonitor powerconsumptionhistory
Power Consumption History

Statistic                   Last Minute     Last Hour     Last Day     Last Week

Average Power Consumption   155 W           155 W         160 W        165 W   
Max Power Consumption       199 W           199 W         220 W        244 W   
Min Power Consumption       117 W           117 W         105 W         97 W   

Max Power Time
Last Minute     : Thu Nov  7 02:05:33 2019
Last Hour       : Thu Nov  7 02:05:33 2019
Last Day        : Wed Nov  6 12:27:57 2019
Last Week       : Tue Nov  5 22:25:05 2019
Min Power Time
Last Minute     : Thu Nov  7 01:36:15 2019
Last Hour       : Thu Nov  7 01:36:15 2019
Last Day        : Wed Nov  6 19:55:55 2019
Last Week       : Sun Nov  3 02:05:35 2019

The idea would be to write a Prometheus plugin which runs this command and parses the response to extract the one minute average power consumption. Resolution is only 1W, but the same resolution is shown in the iDRAC web UI so it is probably the best that is physically available.

For HP ProLiant, there is ilo-exporter, which consumes the iLO RESTful API.

Event Timeline

I've some concerns to proceed with this. In our experience the BMCs are not that stable and an excessive interaction with them seems to aggravate the situation, statistically causing more BMCs to become unresponsive and requiring a reset.
For this reason we've kept to a minimum our checks of BMCs and I'd rather not add something that query the BMC so often.

I think that for what you're looking for some one shot gathering of data repeated maybe once a month or something like that might be enough. Also take into account that any power consumption data is heavily related to how much "used" the host is overall, making it harder to draw conclusions based only on power consumption and maybe traffic data (e.g. a change in globally installed daemons or different kernels might lead to different data).

FWIW you don't need the remote IPMI for the Dells, you can gather them directly on the host with ipmi-oem, the related available commands are:

get-power-consumption-data
get-instantaneous-power-consumption-data [power_supply_instance]
get-power-head-room
get-power-consumption-statistics <average|max|min>

To be used like:

ipmi-oem Dell get-power-consumption-statistics average

In my experience the get-power-consumption-statistics average is not reliable as the one minute average doesn't change if I stress the host for a minute, while the instantaneous one seems accurate.

AFAIK ipmi-oem doesn't support HP according to ipmi-oem -L, but I didn't look deeper.

I like the overall idea. Regarding balancing data gathering frequency with accuracy since we have at least a daily cycle in power usage (i.e. matching traffic). I think starting with sampling four times a day should get us representative figures while not being a problem for ilo/idrac. Thoughts ?

I've some concerns to proceed with this. In our experience the BMCs are not that stable and an excessive interaction with them seems to aggravate the situation, statistically causing more BMCs to become unresponsive and requiring a reset.
For this reason we've kept to a minimum our checks of BMCs and I'd rather not add something that query the BMC so often.

Is there any bug report about this? Are you sure it affects the components we would be using? I understand ipmi-oem does not use the network stack.

In my experience the get-power-consumption-statistics average is not reliable as the one minute average doesn't change if I stress the host for a minute, while the instantaneous one seems accurate.

I tested this on scandium and found that the one minute power consumption reported by this method is always the same as the one hour power consumption. So the machine is evidently not collecting a one-minute average and is mislabelling the one-hour average as a one-minute average. I loaded it for 22 minutes. Here is the data collected from the minute/hour averages (blue) versus a model (pink) assuming that it is an hourly average with a step from 70W to 143W at t=0:

ipmi-power.png (721×990 px, 21 KB)

It's a bit weird and glitchy, but maybe it is converging on the right answer.

Instantaneous power consumption is noisy, and sampling it once a month would not give you much ability to average over that noise. I would say it's better to collect the daily or weekly average than to use instantaneous power. We could collect both and verify that they converge to the same thing.

Is there any bug report about this? Are you sure it affects the components we would be using? I understand ipmi-oem does not use the network stack.

@tstarling I don't have any specific URL at hand, sorry, but it's an empirical team knowledge from different past/present experiences. I agree that most of the time the remote ipmi stack was involved, so maybe with the local ipmi-oem we're safer.

I propose that whatever we end up deciding, we set it up only on the canary hosts of some cluster first, and then after a while we expand them if there have been no issue. For ballpark numbers we could even use the data from one host per cluster for those clusters where the load is evenly distributed.

Nice to see the progress on this task. We now have the Prometheus IPMI exporter on 585 servers.

fyi i added prometheus-ipmi-exporter to buster hosts as well https://gerrit.wikimedia.org/r/c/operations/puppet/+/824193. i dont think it would be too hard to add to stretch as well but will require a bit more then a straight copy as i get the following. however for now i didn;t think it worth as we are phasing them out.

tstarling claimed this task.

I'm going to call this done, since about 90% of PDU power usage now appears in server power usage, in both eqiad and codfw.