Page MenuHomePhabricator

Monitoring GPU Usage on stat Machines
Closed, ResolvedPublic

Description

Hi,

It would be great to have 'rocm-smi' installed on stat1005 and stat1008 to monitor the GPU usage. Please help in getting the same installed.

Best,
Akhil

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 5 2020, 6:32 PM
elukey added a subscriber: elukey.May 7 2020, 3:31 PM

@Aroraakhil there are two ways of checking metrics:

  1. sudo radeontop
  2. https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu?orgId=1

rocm-smi is unfortunately a python script that requires full root to execute, so we decided to use radeontop.

Milimetric closed this task as Resolved.May 7 2020, 3:34 PM
Milimetric claimed this task.
Milimetric moved this task from Incoming to Radar on the Analytics board.

@elukey thanks much for your response. However, none of these monitoring tools give information about the pids of the processes or the number of processes currently using the GPU. 'nvidia-smi' provides that, and thus, I am assuming its equivalent 'rocm-smi' should also provide that.
Is there no reasonable workaround to install 'rocm-smi'?

Best,
Akhil

elukey added a comment.May 7 2020, 5:43 PM

@Aroraakhil this is the output of rocm-smi (I executed manually via sudo) by default is:

elukey@stat1005:~$ sudo /opt/rocm/bin/rocm-smi


 ========================ROCm System Management Interface========================
================================================================================
GPU  Temp   AvgPwr  SCLK     MCLK    Fan    Perf  PwrCap  VRAM%  GPU%
1    48.0c  87.0W   1500Mhz  945Mhz  14.9%  auto  170.0W  N/A    100%
================================================================================
==============================End of ROCm SMI Log ==============================

Anything in particular that you were used with nvidia-smi that I should check? (namely, can you give me an example?)

We can think about allowing rocm-smi, but it is a script (not a binary) and we'd need to give it sudo/root permissions, something not really great security wise. Let's figure out if it is really needed, if so I'll try do to something :)

@elukey thanks much for your prompt response. This is what I get from 'nvidia-smi' in our EPFL machine. As you can see it displays the number of processes currently running, and the pids. However, I am not sure if there is a flag in rocm-smi that displays similar information.

elukey reopened this task as Open.May 14 2020, 6:00 AM

I see /opt/rocm-3.3.0/bin/rocm-smi --showpids that could help, will investigate!

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM
Restricted Application edited projects, added Analytics; removed Analytics-Radar. · View Herald TranscriptJun 10 2020, 6:33 AM
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:36 AM
Restricted Application edited projects, added Analytics; removed Analytics-Radar. · View Herald TranscriptJun 10 2020, 6:36 AM
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:41 AM
elukey removed Milimetric as the assignee of this task.Jun 16 2020, 7:20 AM
elukey added a subscriber: Milimetric.

@elukey just curious if there are any updates on this?

@Aroraakhil sorry still no progress, I hope to get something done this Quarter :(

Ottomata assigned this task to klausman.Sep 16 2020, 4:57 PM

I did some testing just now, and it looks like the current version of rocm_smi.py does not try to re-execute itself through sudo when the --showpidgpus or --showpids flags are used. Luca tells me that this used to be the case, but it looks like it has changed since. @Aroraakhil can you test whether the tool now works for you?

@elukey and @klausman thanks!
It works fine for me.
Just to clarify, I use the following /opt/rocm/bin/rocm-smi --showpids without sudo, and it works just fine. I am assuming this is how it is supposed to be, rite?

elukey closed this task as Resolved.Sep 21 2020, 2:20 PM

@Aroraakhil yep it should be fine, during the next releases they changed the script to use less privileges, so all good! Thanks for the patience :)