Page MenuHomePhabricator

AMD ROCm kernel drivers on stat1005/stat1008 don't support some features
Closed, ResolvedPublic

Description

I have been working with Miriam to figure out why the rocm-smi tool (provided by ROCm) is not displaying the memory usage of our GPUs:

elukey@stat1005:~$ sudo /opt/rocm/bin/rocm-smi


========================ROCm System Management Interface========================
================================================================================
GPU  Temp   AvgPwr  SCLK    MCLK    Fan    Perf  PwrCap  VRAM%  GPU%
1    25.0c  10.0W   852Mhz  167Mhz  14.9%  auto  170.0W  N/A    0%
================================================================================
==============================End of ROCm SMI Log ==============================

elukey@stat1005:~$ sudo /opt/rocm/bin/rocm-smi --showmeminfo vram --log=DEBUG


========================ROCm System Management Interface========================
================================================================================
DEBUG: Unable to get vram_used
ERROR: GPU[1] 		: Unable to get vram memory usage information
================================================================================
==============================End of ROCm SMI Log ==============================

The rocm-smi tool (that we also use to export metrics for Prometheus) looks for a certain location on sysfs, but doesn't find anything. I opened an issue to upstream, https://github.com/RadeonOpenCompute/ROC-smi/issues/90, and I think that the problem is the fact that the kernel that we are running (4.19.132) doesn't contain https://github.com/torvalds/linux/commit/55c374e9eb72be0de5d4fe2ef4d7803cd4ea6329, that is only present on 5.x if I read the tags correctly.

Why do we need the memory usage metric? Is it critical?
In theory we can live without it, but in practice it would have been useful recently to debug some issues. For example, the last time Miriam contacted me since tensorflow wasn't able to launch any kernel on the GPU, and after a bit of tries we noticed via the radeontop tool that all the RAM of the GPU was used, but the overall usage % was zero. We are still not sure why this happens, my speculation is that an unclean shutdown of GPU tools may lead to inconsistencies, but I'll need to research more. Having an alarm that monitors GPU memory usage vs overall % usage could help.

What can we do to fix this?
I see on stat1005 that a 5.7 kernel is available via backport:

elukey@stat1005:~$ apt-cache policy linux-image-amd64
linux-image-amd64:
  Installed: 4.19+105+deb10u5
  Candidate: 4.19+105+deb10u5
  Version table:
     5.7.10-1~bpo10+1 100
        100 http://mirrors.wikimedia.org/debian buster-backports/main amd64 Packages
 *** 4.19+105+deb10u5 500
        500 http://mirrors.wikimedia.org/debian buster/main amd64 Packages
        100 /var/lib/dpkg/status

But I am not sure if we'd get too much into the future or not :D We could also think to use the ROCm kernel drivers provided by AMD (without relying on the ones provided by the Linux Kernel), but it doesn't seem a great idea to me.

Event Timeline

There is 5.7.10-1~bpo10+1 in buster-backports, but it comes with many downsides and I don't think any of those warrant the sysfs addition; there can be random incompatibilities between buster userland and a 5.7 kernel and in case of security issues we'd need to continously upgrade kernels (as no changes get backported). https://github.com/torvalds/linux/commit/55c374e9eb72be0de5d4fe2ef4d7803cd4ea6329 also doesn't match the criteria to get accepted to the 4.19 LTS kernel Buster is based on. But it sounds like a good idea to test the DKMS drivers and use these until Bullseye.

Background info for @klausman: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU

We have two hosts with one AMD GPU each (stat1005.eqiad.wmnet, stat1008.eqiad.wmnet), two client nodes running Debian Buster. AMD is doing a great job in keeping all their stack and drivers (except microcode) open source, to the extent that GPU drivers are in linux mainline. We chose, when we started, to use the kernel drivers rather than the ones packaged from AMD, but as described above it is not enough for our use case. So this task should be about:

  1. add the http://repo.radeon.com/rocm/apt/3.3/pool/main/r/rock-dkms/ package to our internal apt's repository in puppet, and force a sync on the apt host (apt1001.eqiad.wmnet) to make it available.
  2. deploy it on stat100[5,8] via puppet.
  3. make sure that everything works as expected, and that the bug mentioned in the description is gone.

I'll of course be available to help, the above is more a high level overview :)

  1. add the http://repo.radeon.com/rocm/apt/3.3/pool/main/r/rock-dkms/ package to our internal apt's repository in puppet, and force a sync on the apt host (apt1001.eqiad.wmnet) to make it available.

When we've confirmed that updated drivers as shipped by rock-dkms fix the issue we also need to figure out a way to ship these. By default it compiles on the current host, but there's also a mode to create debs and then deploy these via apt.wikimedia.org (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=554843)

So we should also explore this mode and wire it up on our build host (deneb.codfw.wmnet)

Change 626112 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aptrepo: add rock-dkms in the list of packages for the rocm33 component

https://gerrit.wikimedia.org/r/626112

Mentioned in SAL (#wikimedia-analytics) [2020-09-09T10:11:28Z] <klausman> Rebooting stat1005 for clearing GPU status and testing new DKMS driver (T260442)

Mentioned in SAL (#wikimedia-operations) [2020-09-09T10:11:32Z] <klausman> Rebooting stat1005 for clearing GPU status and testing new DKMS driver (T260442)

Notes from the install:

  • rmmod amdgpu segfaulted. Not very encouranging. rock-dkms comes with a module blacklist, so another reboot will likely clear that wedge state
  • Without kernel headers, rock-dkms just doesn't compile the kernel module. Easy to miss the warning message

Install log:

Setting up rock-dkms (3.3-19) ...
Removing old amdgpu-3.3-19 DKMS files...

-------- Uninstall Beginning --------
Module:  amdgpu
Version: 3.3-19
Kernel:  4.19.0-10-amd64 (amd64)
-------------------------------------

Status: Before uninstall, this module version was ACTIVE on this kernel.

amdgpu.ko:
 - Uninstallation
   - Deleting from: /lib/modules/4.19.0-10-amd64/updates/dkms/
rmdir: failed to remove 'updates/dkms': Directory not empty
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.


amdttm.ko:
 - Uninstallation
   - Deleting from: /lib/modules/4.19.0-10-amd64/updates/dkms/
rmdir: failed to remove 'updates/dkms': Directory not empty
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.


amdkcl.ko:
 - Uninstallation
   - Deleting from: /lib/modules/4.19.0-10-amd64/updates/dkms/
rmdir: failed to remove 'updates/dkms': Directory not empty
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.


amd-sched.ko:
 - Uninstallation
   - Deleting from: /lib/modules/4.19.0-10-amd64/updates/dkms/
 - Original module
   - No original module was found for this module on this kernel.
   - Use the dkms install command to reinstall any previous module version.


Running the post_remove script:
depmod.....

update-initramfs........

DKMS: uninstall completed.

------------------------------
Deleting module version: 3.3-19
completely from the DKMS tree.
------------------------------
Done.
Loading new amdgpu-3.3-19 DKMS files...
Building for 4.19.0-10-amd64
Building for architecture amd64
Building initial module for 4.19.0-10-amd64
Done.
Forcing installation of amdgpu

amdgpu.ko:
Running module version sanity check.
 - Original module
 - Installation
   - Installing to /lib/modules/4.19.0-10-amd64/updates/dkms/

amdttm.ko:
Running module version sanity check.
 - Original module
 - Installation
   - Installing to /lib/modules/4.19.0-10-amd64/updates/dkms/

amdkcl.ko:
Running module version sanity check.
 - Original module
 - Installation
   - Installing to /lib/modules/4.19.0-10-amd64/updates/dkms/

amd-sched.ko:
Running module version sanity check.
 - Original module
 - Installation
   - Installing to /lib/modules/4.19.0-10-amd64/updates/dkms/

Running the post_install script:
update-initramfs: Generating /boot/initrd.img-4.19.0-10-amd64
W: Possible missing firmware /lib/firmware/tigon/tg3_tso5.bin for module tg3
W: Possible missing firmware /lib/firmware/tigon/tg3_tso.bin for module tg3
W: Possible missing firmware /lib/firmware/tigon/tg3.bin for module tg3

depmod...

Backing up initrd.img-4.19.0-10-amd64 to /boot/initrd.img-4.19.0-10-amd64.old-dkms
Making new initrd.img-4.19.0-10-amd64
(If next boot fails, revert to initrd.img-4.19.0-10-amd64.old-dkms image)
update-initramfs.......

DKMS: install completed.

rocm-smi now logs the memory usage:

root@stat1005:~# /opt/rocm/bin/rocm-smi --showmeminfo vram --log=DEBUG


========================ROCm System Management Interface========================
================================================================================
DEBUG: GPU[1] 		: vram Total Memory (B): 17163091968
GPU[1] 		: vram Total Memory (B): 17163091968
DEBUG: GPU[1] 		: vram Total Used Memory (B): 17817600
GPU[1] 		: vram Total Used Memory (B): 17817600
================================================================================
==============================End of ROCm SMI Log ==============================

I just did a few tests on stat1005 (image classification using 2 different classifiers) with the new settings.
The GPU is detected and parallel tasks are running smoothly without conflict.

elukey and I discussed a bit how will proceed from here. Open things:

  • Update performance stuff so we get more insight in Grafana (T262427)
  • Let the new driver soak in 3.3 for ten days or see, to see if there are any issues.
  • After the soak time, update 1008 to be the same level
  • In parallel to all this, consider bumping ROCm to 3.7 (latest version).

Looking at the install procedure for the rocm upstream drivers, we considered turning the DKMS package (compiling the driver(s) ad-hoc during install) into a more static one (no compilation, just install binary files)

This would have the upside that we know for sure which binaries are installed (reproducible builds), as well as decreasing the time and complexity involved (fewer things can break).

It does, however, also have considerable downsides: we would have to put in both the up-front work to make a static package out of the DKMS one, as well as maintain that package as time goes on. On top of this, if it breaks, we likely are on our own, since roughly nobody else would run such a setup.

I think with our current way of running things (less than 100 machines, and not much churn in updates), the downsides outweigh the upsides by quite a margin. Thus I believe we should keep using the upstream DKMS packages unless the outside parameters change.

Change 626112 merged by Elukey:
[operations/puppet@production] aptrepo: add rock-dkms in the list of packages for the rocm33 component

https://gerrit.wikimedia.org/r/626112

klausman reopened this task as Open.
klausman moved this task from In Progress to Done on the Analytics-Kanban board.