Today I have reimaged an-worker1096, one of the Hadoop workers with a GPU, deploying ROCm 3.8 on top (the other stretch nodes run 3.3 with the DKMS package, meanwhile on Buster we prefer to rely on the 5.x Kernel's drivers).
Puppet complained about a rocm-dev not being able to be installed, due to rocm-gdb, requiring libpython-38. We have ROCm 3.8 on stat100[5,8] too, both on Buster, so why on an-worker1096 it doesn't work?
elukey@stat1005:~$ sudo apt-cache policy rocm-gdb rocm-gdb: Installed: 9.2-rocm-rel-3.7-20 Candidate: 9.2-rocm-rel-3.8-30 Version table: 9.2-rocm-rel-3.8-30 1001 1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/thirdparty/amd-rocm38 amd64 Packages *** 9.2-rocm-rel-3.7-20 100 100 /var/lib/dpkg/status
The version of the package contains the ROCm release, 3.7 and 3.8: the former wants libpython3.7, the latter 3.8. On stat1005 we have probably not cleaned up the 3.7 version, and it worked fine. To unblock an-worker1096 I just copied via transfer.py the 3.7 deb from stat1005 (I know it is horrible). We have libpython3.8 from the pyall component, but it is a virtual package, and apt complains that it cannot be installed.
This is only one example of problems that we'll have in the future with ROCm, we should decide what to do, and also establish a better procedure to wipe/upgrade an host (to avoid the stat1005 use case again).
More info in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU