Page MenuHomePhabricator

Update ROCm version on GPU instances.
Closed, ResolvedPublic

Description

While evaluating pre-built ML models for a portion of T250436 one of the most promissing we evaluated is provided through onnxruntime. This runs acceptably on CPU's, but ideally it would run on our GPU instance as well. ROCm support in onnxruntime is provided through MIGraphX, AMD's inference acceleration engine. Reviewing the ROCm repositories, MIGraphX was added to the ROCm repositories in 3.9, but we are currently on 3.8. It's not clear exactly what version of MIGraphX is required, but updating to at least ROCm 3.9 would allow evaluating.

Event Timeline

+1 for upgrading ROCm to support ONNX runtime. It's certainly worth evaluating imo, as it seems that ONNX would help enable us to use an AMD GPU with any arbitrary ML-framework

odimitrijevic added a subscriber: odimitrijevic.

@elukey is this work that the ML team plans on implementing/owning?

@odimitrijevic yes definitely we can work on it (either by ourselves or working with Ben/Razzi if they are not super busy with other projects). Lemme know :)

odimitrijevic edited projects, added Analytics-Radar; removed Analytics.
odimitrijevic added subscribers: BTullis, razzi.

That's great! @BTullis and @razzi are busy but do reach out if you have any questions.

Change 725887 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] amd_rocm: import ROCm suite 4.3.1

https://gerrit.wikimedia.org/r/725887

Change 725887 merged by Elukey:

[operations/puppet@production] amd_rocm: import ROCm suite 4.3.1

https://gerrit.wikimedia.org/r/725887

Change 725904 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] aptrepo: add missing amd-rocm431 settings

https://gerrit.wikimedia.org/r/725904

Change 725904 merged by Elukey:

[operations/puppet@production] aptrepo: add missing amd-rocm431 settings

https://gerrit.wikimedia.org/r/725904

Mentioned in SAL (#wikimedia-operations) [2021-10-04T14:19:57Z] <elukey> import AMD ROCm 4.3.1 packages in buster-wikimedia's thirdparty/amd-rocm431 - T287267

Change 726389 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set new AMD ROCm version for an-worker1096

https://gerrit.wikimedia.org/r/726389

Change 726507 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] amd_rocm: update settings/packages for ROCm 4.3.1

https://gerrit.wikimedia.org/r/726507

Change 726507 merged by Elukey:

[operations/puppet@production] amd_rocm: update settings/packages for ROCm 4.3.1

https://gerrit.wikimedia.org/r/726507

Change 726389 merged by Elukey:

[operations/puppet@production] Set new AMD ROCm version for an-worker1096

https://gerrit.wikimedia.org/r/726389

Change 726539 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Upgrade all an-workers with GPUs to ROCm 4.3.1

https://gerrit.wikimedia.org/r/726539

Change 726539 merged by Elukey:

[operations/puppet@production] Upgrade all an-workers with GPUs to ROCm 4.3.1

https://gerrit.wikimedia.org/r/726539

Change 726578 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move stat100[5,8] to AMD ROCm 4.3.1

https://gerrit.wikimedia.org/r/726578

Change 726606 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] amd_rocm: add support for ROCm 4.2

https://gerrit.wikimedia.org/r/726606

Change 726606 merged by Elukey:

[operations/puppet@production] amd_rocm: add support for ROCm 4.2

https://gerrit.wikimedia.org/r/726606

Change 726611 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Downgrade AMD ROCm to 4.2 (from 4.3.1) on an-worker1096

https://gerrit.wikimedia.org/r/726611

Mentioned in SAL (#wikimedia-operations) [2021-10-05T12:43:25Z] <elukey> import AMD ROCm 4.2 to buster-wikimedia's thirdparty/amd-rocm42 - T287267

Change 726611 merged by Elukey:

[operations/puppet@production] Downgrade AMD ROCm to 4.2 (from 4.3.1) on an-worker1096

https://gerrit.wikimedia.org/r/726611

Change 726619 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Downgrade AMD ROCm to 4.2 on all GPU-based Hadoop workers

https://gerrit.wikimedia.org/r/726619

Change 726619 merged by Elukey:

[operations/puppet@production] Downgrade AMD ROCm to 4.2 on all GPU-based Hadoop workers

https://gerrit.wikimedia.org/r/726619

Change 726759 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] prometheus-amd-rocm-stats.py: support ROCm 4.2.0's smi output

https://gerrit.wikimedia.org/r/726759

Change 726759 merged by Elukey:

[operations/puppet@production] prometheus-amd-rocm-stats.py: support ROCm 4.2.0's smi output

https://gerrit.wikimedia.org/r/726759

To keep archives happy:

  • We decided to target ROCm 4.3.1 (current latest upstream) and tensorflow-rocm 2.6.
  • Instead of rolling out the packages on stat100[5,8], we started from the Hadoop workers.
  • Basic checks after install went fine, but then we realized that tensorflow-rocm 2.6 (following its tensorflow counterpart) doesn't support anymore reading from HDFS natively. It needs a new package, tensorflow-io, that in turn requires tensorflow (non-rocm version). We'll need to follow up with upstream (AMD) to ask what are their plans.
  • We decided to target ROCm 4.2 and tensorflow 2.5.0

Change 726864 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Upgrade stat100[5,8] to ROCm 4.2

https://gerrit.wikimedia.org/r/726864

Change 726864 merged by Elukey:

[operations/puppet@production] Upgrade stat100[5,8] to ROCm 4.2

https://gerrit.wikimedia.org/r/726864

Change 726578 abandoned by Elukey:

[operations/puppet@production] Move stat100[5,8] to AMD ROCm 4.3.1

Reason:

https://gerrit.wikimedia.org/r/726578

elukey claimed this task.

This is done! It seems that ROCm 4.2 is the only viable option for the moment, I'll keep working on https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/1461 to see what upstream suggests.

Please re-open the task if anything is missing!