I have been working with Miriam to figure out why the rocm-smi tool (provided by ROCm) is not displaying the memory usage of our GPUs:
elukey@stat1005:~$ sudo /opt/rocm/bin/rocm-smi ========================ROCm System Management Interface======================== ================================================================================ GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 1 25.0c 10.0W 852Mhz 167Mhz 14.9% auto 170.0W N/A 0% ================================================================================ ==============================End of ROCm SMI Log ============================== elukey@stat1005:~$ sudo /opt/rocm/bin/rocm-smi --showmeminfo vram --log=DEBUG ========================ROCm System Management Interface======================== ================================================================================ DEBUG: Unable to get vram_used ERROR: GPU[1] : Unable to get vram memory usage information ================================================================================ ==============================End of ROCm SMI Log ==============================
The rocm-smi tool (that we also use to export metrics for Prometheus) looks for a certain location on sysfs, but doesn't find anything. I opened an issue to upstream, https://github.com/RadeonOpenCompute/ROC-smi/issues/90, and I think that the problem is the fact that the kernel that we are running (4.19.132) doesn't contain https://github.com/torvalds/linux/commit/55c374e9eb72be0de5d4fe2ef4d7803cd4ea6329, that is only present on 5.x if I read the tags correctly.
Why do we need the memory usage metric? Is it critical?
In theory we can live without it, but in practice it would have been useful recently to debug some issues. For example, the last time Miriam contacted me since tensorflow wasn't able to launch any kernel on the GPU, and after a bit of tries we noticed via the radeontop tool that all the RAM of the GPU was used, but the overall usage % was zero. We are still not sure why this happens, my speculation is that an unclean shutdown of GPU tools may lead to inconsistencies, but I'll need to research more. Having an alarm that monitors GPU memory usage vs overall % usage could help.
What can we do to fix this?
I see on stat1005 that a 5.7 kernel is available via backport:
elukey@stat1005:~$ apt-cache policy linux-image-amd64 linux-image-amd64: Installed: 4.19+105+deb10u5 Candidate: 4.19+105+deb10u5 Version table: 5.7.10-1~bpo10+1 100 100 http://mirrors.wikimedia.org/debian buster-backports/main amd64 Packages *** 4.19+105+deb10u5 500 500 http://mirrors.wikimedia.org/debian buster/main amd64 Packages 100 /var/lib/dpkg/status
But I am not sure if we'd get too much into the future or not :D We could also think to use the ROCm kernel drivers provided by AMD (without relying on the ones provided by the Linux Kernel), but it doesn't seem a great idea to me.