Page MenuHomePhabricator

Review ROCm deployment procedures and current packages
Closed, DeclinedPublic

Description

Today I have reimaged an-worker1096, one of the Hadoop workers with a GPU, deploying ROCm 3.8 on top (the other stretch nodes run 3.3 with the DKMS package, meanwhile on Buster we prefer to rely on the 5.x Kernel's drivers).

Puppet complained about a rocm-dev not being able to be installed, due to rocm-gdb, requiring libpython-38. We have ROCm 3.8 on stat100[5,8] too, both on Buster, so why on an-worker1096 it doesn't work?

elukey@stat1005:~$ sudo apt-cache policy rocm-gdb
rocm-gdb:
  Installed: 9.2-rocm-rel-3.7-20
  Candidate: 9.2-rocm-rel-3.8-30
  Version table:
     9.2-rocm-rel-3.8-30 1001
       1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/thirdparty/amd-rocm38 amd64 Packages
 *** 9.2-rocm-rel-3.7-20 100
        100 /var/lib/dpkg/status

The version of the package contains the ROCm release, 3.7 and 3.8: the former wants libpython3.7, the latter 3.8. On stat1005 we have probably not cleaned up the 3.7 version, and it worked fine. To unblock an-worker1096 I just copied via transfer.py the 3.7 deb from stat1005 (I know it is horrible). We have libpython3.8 from the pyall component, but it is a virtual package, and apt complains that it cannot be installed.

This is only one example of problems that we'll have in the future with ROCm, we should decide what to do, and also establish a better procedure to wipe/upgrade an host (to avoid the stat1005 use case again).

More info in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU

Event Timeline

Two things: The pyall component isn't enabled on an-worker1096, so that's expected? And this seems to be a bug in the package, which should be reported? We're importing these unmodified and they are supposed to be built for xenial, which only includes Python 3.5, so having a dep on libpython3.8 is a bug on their end?

@MoritzMuehlenhoff you are completely right, we indeed don't include pyall on buster nodes (since we have profile::python37). On Friday I tried to add the pyall component but probably used the wrong one (likely the one for stretch), now I retried with a more fresh mind and indeed libpython38 can be installed.

I'll follow up with upstream, but in the meantime what should we do? ROCm released 4.0 and I think this problem is there (will need to check), maybe for the time being we could include python38 on those nodes? Otherwise we can use the package from the 3.7 release as I did with an-worker1096, but it seems a little hacky (since we'll deploy rocm also on ml-serve etc..)

https://github.com/RadeonOpenCompute/ROCm/tree/roc-3.7.x
https://github.com/RadeonOpenCompute/ROCm/tree/roc-3.8.x

They support, even in 4.x, 18.x and 20.x, so in theory Python 3.6+ afaics. There shouldn't be a hard requirement for libpython38 for the rocm-gdb package (that is also mandatory for rocm-dev).

Found https://github.com/RadeonOpenCompute/ROCm/issues/1236, that it is exactly our issue. It seems that they are not really going to do anything about it..

I'll follow up with upstream, but in the meantime what should we do? ROCm released 4.0 and I think this problem is there (will need to check), maybe for the time being we could include python38 on those nodes? Otherwise we can use the package from the 3.7 release as I did with an-worker1096, but it seems a little hacky (since we'll deploy rocm also on ml-serve etc..)

Let's add a similar profile to profile::python37, but for 3.8 which pulls in pyall. Given that Py 3.8 is only used for gdb that seems harmless enough.

Found https://github.com/RadeonOpenCompute/ROCm/issues/1236, that it is exactly our issue. It seems that they are not really going to do anything about it..

That task is wrong in so many aspects...
"As we do not support Ubuntu 16.04 officially, we recommend to try with Ubuntu 18.04.x or Ubuntu 20.04.x and comeback with your observations."

  1. the Dists file they are releasing the debs in is for xenial which is 16.04...
  2. 18.04 doesn't have 3.8 either

Ideally they would simply publish their deb-src lines, so that people can simply rebuild in a sane manner...

Change 667575 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::amd_gpu: add python 3.8 on Buster

https://gerrit.wikimedia.org/r/667575

elukey triaged this task as Medium priority.Mar 1 2021, 11:56 AM

Change 667575 merged by Elukey:
[operations/puppet@production] profile::amd_gpu: add python 3.8 on Buster

https://gerrit.wikimedia.org/r/667575

The patch seems working, I was able to install rocm-gdb v3.8 on an-worker1096 :)

Tobias opened https://github.com/RadeonOpenCompute/ROCm/issues/1396

New procedure found and documented in T295661