Page MenuHomePhabricator

Upgrade ROCm to 5.4
Closed, ResolvedPublic

Description

With https://github.com/tensorflow/io/pull/1551 we will be able to use tensorflow-io and tensorflow-rocm (the io package contains functionalities like an HDFS client and it was created for the release of tensorflow 2.6).

We were not able to upgrade to ROCm 4.3.1 due to this problem, but now we should be able to upgrade to something like 4.5 when upstream will release the new tensorflow-io package containing the fix (likely version 0.23).

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Upgrade_the_Debian_packages

Event Timeline

Change 738615 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Import new ROCm version 4.5.1

https://gerrit.wikimedia.org/r/738615

From https://github.com/RadeonOpenCompute/ROCm/issues/761 it seems that hsa-ext-rocr-dev is not a concern anymore, so we can simplify the deployment procedure even further.

Change 738615 merged by Elukey:

[operations/puppet@production] Import new ROCm version 4.5

https://gerrit.wikimedia.org/r/738615

Mentioned in SAL (#wikimedia-operations) [2021-11-15T15:15:16Z] <elukey> reprepro --delete clearvanished on apt1001 to clean-up thirdparty/amd-rocm38 (buster and stretch) - T295661

Change 738947 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] aptrepo: update amd-rocm45 component's suite

https://gerrit.wikimedia.org/r/738947

Change 738947 merged by Elukey:

[operations/puppet@production] aptrepo: update amd-rocm45 component's suite

https://gerrit.wikimedia.org/r/738947

Mentioned in SAL (#wikimedia-operations) [2021-11-15T15:24:29Z] <elukey> import AMD ROCm 4.5 in thirdparty/amd-rocm45 for buster-wikimedia - T295661

ROCm 4.5 imported in apt. Next steps:

  • Wait for the release of the pypi package tensorflow-io
  • Test the new suite on one node (will need the help of @Miriam)

Time flies and both ROCm and tensorflow-io got several releases.

https://github.com/tensorflow/io/releases/tag/v0.23.0 is out and contains the pull request that I made for tensorflow-io (to allow tensorflow-rocm) so in theory we could test ROCm 4.5 and see if we can proceed (even if they have already released 5.x).

@Miriam do you have any preference? Nothing really urgent :)

Upstream already reached 5.x, we should probably upgrade to a more recent version as well to keep up and have better support (especially if we want to support more up-to-date GPUs).

Upgrading the ROCm version, and ideally having all hosts with GPUs use the same version, would be great.

I tried to use the recently released pytorch 2.0 with the AMD GPUs on the stat machines - and to my surprise it seem to work out of the box, despite the ROCm version mismatch (currently 4.x vs 5.4.2).

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2

I have done limited testing, but using huggingface models (e.g. stable diffusion, sentence transformers for text embeddings) work with the GPUs. This is promising, my previous GPU experiences were finicky and tensorflow oriented. Upgrading the ROCm versions to the required version for pytorch would further increase my optimism.

This is the complete snippet to use an AMD GPU for stable diffusion

from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to('cuda')
prompt = 'drawing of a wikipedia moderator arguing with an editor guilty of vandalism. colorful, playful, 4K.'
pipe(prompt, height=720, width=720).images[0]

image.png (720×720 px, 1 MB)

Hi @fkaelin, we'll definitely try to upgrade during the next quarter to the latest ROCm release :)

Change 908208 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] aptrepo: import AMD ROCm 5.4 to bullseye-wikimedia

https://gerrit.wikimedia.org/r/908208

Change 908208 merged by Elukey:

[operations/puppet@production] aptrepo: import AMD ROCm 5.4 to bullseye-wikimedia

https://gerrit.wikimedia.org/r/908208

Mentioned in SAL (#wikimedia-operations) [2023-04-12T13:26:11Z] <elukey> upload AMD ROCm 5.4 debian packages to wikimedia-bullseye:thirdparty/amd-rocm54 - T295661

First attempt on dse-k8s-worker1001 ended up in some errors, among them:

The following packages have unmet dependencies:
 rocm-llvm : Depends: libstdc++-5-dev but it is not installable or
                      libstdc++-7-dev but it is not installable or
                      libstdc++-11-dev but it is not installable
             Depends: libgcc-5-dev but it is not installable or
                      libgcc-7-dev but it is not installable or
                      libgcc-11-dev but it is not installable

On Debian bullseye I see libstdc++-(9|10)-dev and libgcc-(9|10)-dev, not sure why there is this difference..

Had a chat with Moritz, and here some relevant readings:

rocm-llvm is among the dependencies of rocm-dev and others, so we are not able to drop it:

elukey@dse-k8s-worker1001:~$ apt-cache rdepends rocm-llvm
rocm-llvm
Reverse Depends:
  rocm-dev
  hip-runtime-amd
  rocm-clang-ocl
  openmp-extras-dev

We need to find a workaround for gcc and stdc++ :)

Change 908474 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] aptrepo: add more packages in the update list of rocm54

https://gerrit.wikimedia.org/r/908474

I followed what outlined in https://github.com/RadeonOpenCompute/ROCm/issues/1125#issuecomment-925362329 and created the two fake packages, it worked on dse-k8s-worker1001. I will add those packages to the rocm54 component.

Change 908476 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] adm_rocm: add support for Debian Bullseye

https://gerrit.wikimedia.org/r/908476

Change 908474 merged by Elukey:

[operations/puppet@production] aptrepo: add more packages in the update list of rocm54

https://gerrit.wikimedia.org/r/908474

Change 908476 merged by Elukey:

[operations/puppet@production] amd_rocm: add support for Debian Bullseye

https://gerrit.wikimedia.org/r/908476

Change 908485 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] prometheus: use python3 for /usr/local/bin/prometheus-amd-rocm-stats

https://gerrit.wikimedia.org/r/908485

Change 908485 merged by Elukey:

[operations/puppet@production] prometheus: use python3 for /usr/local/bin/prometheus-amd-rocm-stats

https://gerrit.wikimedia.org/r/908485

Updated the docs, I was able to run tensorflow on dse-k8s-worker1001 successfully. The remaining issue is to add the proper users to the render group, so that the gpu can be accessible. Need to figure out what is the best way in the k8s context.

The last issue has been fixed in T333009: for k8s nodes we just allow others to read the devices.

The new ROCm suite has been imported for Bullseye only, stat100x nodes will get it once they'll upgrade (soon since DE is migrating hosts to Bullseye).

elukey renamed this task from Upgrade ROCm to 4.5 to Upgrade ROCm to 5.4.Apr 27 2023, 8:36 AM