Page MenuHomePhabricator

Update prometheus-amd-rocm-stats Python script to work with new JSON output
Closed, ResolvedPublic

Description

Updating the rocm kernel drivers in the context of T260442 revealed that the JSON output fro rocm-smi has changed.

Before:

$ /opt/rocm-3.3.0/bin/rocm_smi.py "--showuse" "--showpower" "--showtemp" "--showfan" "--json"|jq .
{
  "card1": {
    "Fan Speed (level)": "35",
    "Fan Speed (%)": "13",
    "GPU use (%)": "40",
    "Average Graphics Package Power (W)": "26.0"
    "Temperature (Sensor #1) (C)": "34.0",
  }
}

After:

$ /opt/rocm-3.3.0/bin/rocm_smi.py "--showuse" "--showpower" "--showtemp" "--showfan" "--json"|jq .
{
  "card1": {
    "Fan Speed (level)": "38",
    "Fan Speed (%)": "14",
    "GPU use (%)": "0",
    "Average Graphics Package Power (W)": "17.0",
    "Temperature (Sensor mem) (C)": "28.0",
    "Temperature (Sensor junction) (C)": "28.0",
    "Temperature (Sensor edge) (C)": "28.0"
  }
}

(note absence of unlabeled temp, and temp measurements with location labels)

I'll update the script to support both old and new outputs, keeping the metric name the same, i.e.:

on old driver wit will show:

amd_rocm_gpu_temperature_celsius{card="card1"} 33

on new driver:

amd_rocm_gpu_temperature_celsius{card="card1", location="mem"} 33
amd_rocm_gpu_temperature_celsius{card="card1", location="junction"} 34
amd_rocm_gpu_temperature_celsius{card="card1", location="edge"} 35

I briefly considered changing the old metric to now include a location="sensor1" or similar label, but I think continuity in time series name is more important than comaparability. I don't expect us to run old/new drivers in parallel for a longer time period.

Event Timeline

Change 626150 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] Update prometheus-amd-rocm-stats.py to handle new driver readings

https://gerrit.wikimedia.org/r/626150

Change 626150 merged by Klausman:
[operations/puppet@production] Update prometheus-amd-rocm-stats.py to handle new driver readings

https://gerrit.wikimedia.org/r/626150

Turns out, the Prometheus lcient libs do not allow for not specifying a label. Thus, I will use the "sensor1" location as describe at the end of the task description.
We could special case the two different driver versions entirely, but that seems like more trouble than it's worth.

Submitted and live on stat1005 and stat1008, confirmed working as intended.