Updating the rocm kernel drivers in the context of T260442 revealed that the JSON output fro rocm-smi has changed.
Before:
$ /opt/rocm-3.3.0/bin/rocm_smi.py "--showuse" "--showpower" "--showtemp" "--showfan" "--json"|jq . { "card1": { "Fan Speed (level)": "35", "Fan Speed (%)": "13", "GPU use (%)": "40", "Average Graphics Package Power (W)": "26.0" "Temperature (Sensor #1) (C)": "34.0", } }
After:
$ /opt/rocm-3.3.0/bin/rocm_smi.py "--showuse" "--showpower" "--showtemp" "--showfan" "--json"|jq . { "card1": { "Fan Speed (level)": "38", "Fan Speed (%)": "14", "GPU use (%)": "0", "Average Graphics Package Power (W)": "17.0", "Temperature (Sensor mem) (C)": "28.0", "Temperature (Sensor junction) (C)": "28.0", "Temperature (Sensor edge) (C)": "28.0" } }
(note absence of unlabeled temp, and temp measurements with location labels)
I'll update the script to support both old and new outputs, keeping the metric name the same, i.e.:
on old driver wit will show:
amd_rocm_gpu_temperature_celsius{card="card1"} 33
on new driver:
amd_rocm_gpu_temperature_celsius{card="card1", location="mem"} 33 amd_rocm_gpu_temperature_celsius{card="card1", location="junction"} 34 amd_rocm_gpu_temperature_celsius{card="card1", location="edge"} 35
I briefly considered changing the old metric to now include a location="sensor1" or similar label, but I think continuity in time series name is more important than comaparability. I don't expect us to run old/new drivers in parallel for a longer time period.