Hi,
It would be great to have 'rocm-smi' installed on stat1005 and stat1008 to monitor the GPU usage. Please help in getting the same installed.
Best,
Akhil
Aroraakhil | |
May 5 2020, 6:32 PM |
F31806726: Screenshot 2020-05-07 19.52.04.png | |
May 7 2020, 5:54 PM |
Hi,
It would be great to have 'rocm-smi' installed on stat1005 and stat1008 to monitor the GPU usage. Please help in getting the same installed.
Best,
Akhil
@Aroraakhil there are two ways of checking metrics:
rocm-smi is unfortunately a python script that requires full root to execute, so we decided to use radeontop.
Documented on wikitech https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU
@elukey thanks much for your response. However, none of these monitoring tools give information about the pids of the processes or the number of processes currently using the GPU. 'nvidia-smi' provides that, and thus, I am assuming its equivalent 'rocm-smi' should also provide that.
Is there no reasonable workaround to install 'rocm-smi'?
Best,
Akhil
@Aroraakhil this is the output of rocm-smi (I executed manually via sudo) by default is:
elukey@stat1005:~$ sudo /opt/rocm/bin/rocm-smi ========================ROCm System Management Interface======================== ================================================================================ GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 1 48.0c 87.0W 1500Mhz 945Mhz 14.9% auto 170.0W N/A 100% ================================================================================ ==============================End of ROCm SMI Log ==============================
Anything in particular that you were used with nvidia-smi that I should check? (namely, can you give me an example?)
We can think about allowing rocm-smi, but it is a script (not a binary) and we'd need to give it sudo/root permissions, something not really great security wise. Let's figure out if it is really needed, if so I'll try do to something :)
@elukey thanks much for your prompt response. This is what I get from 'nvidia-smi' in our EPFL machine. As you can see it displays the number of processes currently running, and the pids. However, I am not sure if there is a flag in rocm-smi that displays similar information.
I did some testing just now, and it looks like the current version of rocm_smi.py does not try to re-execute itself through sudo when the --showpidgpus or --showpids flags are used. Luca tells me that this used to be the case, but it looks like it has changed since. @Aroraakhil can you test whether the tool now works for you?
@Aroraakhil yep it should be fine, during the next releases they changed the script to use less privileges, so all good! Thanks for the patience :)