Page MenuHomePhabricator

Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created
Closed, ResolvedPublic8 Story Points

Description

The idea was mentioned in the parent task by @EBernhardson: parse the output of /opt/rocm/bin/rocm-smi to get GPU's info like temperature, usage, consumption, etc..

The generic output to stdout is very cumbersome to parse:

elukey@stat1005:~$ /opt/rocm/bin/rocm-smi


========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
1     23.0c  7.0W     852Mhz  167Mhz  8.0GT/s, x16   14.9%   auto    170.0W   0%        0%       0%
================================================================================================
========================               End of ROCm SMI Log              ========================

But there is an option (--save) to store the above content in a (temporary) json file, but for some reason it doesn't log temperature (probably something to fix). I opened a gh issue with upstream to investigate if they would accept the idea of having something like --print-json to get the json content via stdout directly.

https://github.com/RadeonOpenCompute/ROC-smi/issues/61

Event Timeline

elukey triaged this task as Normal priority.Apr 12 2019, 6:09 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2019, 6:09 AM

+1, something that parses the json and write metrics in text format for node-exporter to pick up sounds good to me

elukey moved this task from Backlog to In Progress on the User-Elukey board.Apr 15 2019, 12:54 PM
elukey changed the task status from Open to Stalled.Apr 18 2019, 11:50 AM

Upstream told me that they are already working on a more generic version of my pull request (see gh issue), and that they'll probably release it for RocM 2.5 (we are currently running 2.3, I guess that 2.4 is upcoming). Setting the task as stalled until the new feature is ready.

Miriam added a subscriber: Miriam.Jul 8 2019, 2:46 PM

Change 521826 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add prometheus node exporter for AMD ROCm's GPU stats

https://gerrit.wikimedia.org/r/521826

elukey changed the task status from Stalled to Open.Jul 10 2019, 7:18 AM

I filed a code review to create the initial version of the node exporter, with the following metrics:

  • usage percent
  • power consumption (in watts)
  • fan usage percent
  • temperature (in celsius)

This comes from the following:

elukey@stat1005:~$ sudo /opt/rocm/bin/rocm-smi --json --showuse --showfan --showtemp --showpower


========================ROCm System Management Interface========================
{"card1": {"Temperature (Sensor #1)": "28.0 c", "Current GPU use": "0%", "Fan Level": "38 (14%)", "Average Graphics Package Power": "7.0W"}}
==============================End of ROCm SMI Log ==============================

As you can see the output of the command is not only JSON, and the keys/values are a bit overloaded with info and could be formatted in a simpler way. I am trying to follow up with upstream in https://github.com/RadeonOpenCompute/ROC-smi/issues/61, but any upgrade/fix will need to wait for the next ROCm release.

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

Parsing "radeontop -d" might also be an interesting data source.

Change 521826 merged by Elukey:
[operations/puppet@production] Add prometheus node exporter for AMD ROCm's GPU stats

https://gerrit.wikimedia.org/r/521826

elukey set the point value for this task to 8.Jul 11 2019, 10:43 AM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.
Nuria closed this task as Resolved.Jul 18 2019, 8:35 PM