Page MenuHomePhabricator

Upgrade ROCm to 4.5
Open, Needs TriagePublic

Description

With https://github.com/tensorflow/io/pull/1551 we will be able to use tensorflow-io and tensorflow-rocm (the io package contains functionalities like an HDFS client and it was created for the release of tensorflow 2.6).

We were not able to upgrade to ROCm 4.3.1 due to this problem, but now we should be able to upgrade to something like 4.5 when upstream will release the new tensorflow-io package containing the fix (likely version 0.23).

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Upgrade_the_Debian_packages

Event Timeline

Change 738615 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Import new ROCm version 4.5.1

https://gerrit.wikimedia.org/r/738615

From https://github.com/RadeonOpenCompute/ROCm/issues/761 it seems that hsa-ext-rocr-dev is not a concern anymore, so we can simplify the deployment procedure even further.

Change 738615 merged by Elukey:

[operations/puppet@production] Import new ROCm version 4.5

https://gerrit.wikimedia.org/r/738615

Mentioned in SAL (#wikimedia-operations) [2021-11-15T15:15:16Z] <elukey> reprepro --delete clearvanished on apt1001 to clean-up thirdparty/amd-rocm38 (buster and stretch) - T295661

Change 738947 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] aptrepo: update amd-rocm45 component's suite

https://gerrit.wikimedia.org/r/738947

Change 738947 merged by Elukey:

[operations/puppet@production] aptrepo: update amd-rocm45 component's suite

https://gerrit.wikimedia.org/r/738947

Mentioned in SAL (#wikimedia-operations) [2021-11-15T15:24:29Z] <elukey> import AMD ROCm 4.5 in thirdparty/amd-rocm45 for buster-wikimedia - T295661

ROCm 4.5 imported in apt. Next steps:

  • Wait for the release of the pypi package tensorflow-io
  • Test the new suite on one node (will need the help of @Miriam)