Page MenuHomePhabricator

Add more metrics to prometheus-amd-rocm-stats Python script
Closed, ResolvedPublic

Description

With the new driver, we should be able to export more metrics, e.g. memory usage.

Check what numbers we can shake out of rocm-smi and include the useful ones in the Python script for export.

If there is a lot of additional metrics, consider refactoring the parsing/metric-assignment part of the code.

Event Timeline

klausman created this task.Sep 9 2020, 3:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 9 2020, 3:11 PM

Change 626386 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] prometheus: Add more stats to AMD ROCm GPU exporter

https://gerrit.wikimedia.org/r/626386

Change 626386 merged by Klausman:
[operations/puppet@production] prometheus: Add more stats to AMD ROCm GPU exporter

https://gerrit.wikimedia.org/r/626386

elukey triaged this task as Medium priority.Sep 21 2020, 2:19 PM
elukey added a project: Analytics-Kanban.
elukey set Final Story Points to 5.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.
elukey closed this task as Resolved.Oct 26 2020, 10:08 AM