While evaluating pre-built ML models for a portion of T250436 one of the most promissing we evaluated is provided through onnxruntime. This runs acceptably on CPU's, but ideally it would run on our GPU instance as well. ROCm support in onnxruntime is provided through MIGraphX, AMD's inference acceleration engine. Reviewing the ROCm repositories, MIGraphX was added to the ROCm repositories in 3.9, but we are currently on 3.8. It's not clear exactly what version of MIGraphX is required, but updating to at least ROCm 3.9 would allow evaluating.
Description
Details
Related Objects
- Mentioned In
- T292306: [DSE Hackathon] Sounds of the Commons: Neural Audio Mashups
- Mentioned Here
- T250436: [Epic] Query Completion
Event Timeline
+1 for upgrading ROCm to support ONNX runtime. It's certainly worth evaluating imo, as it seems that ONNX would help enable us to use an AMD GPU with any arbitrary ML-framework
@odimitrijevic yes definitely we can work on it (either by ourselves or working with Ben/Razzi if they are not super busy with other projects). Lemme know :)
Change 725887 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] amd_rocm: import ROCm suite 4.3.1
Things to review: 3.8 -> 4.3.1
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/ROCm-Version-History.html
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html#amd-rocm-release-notes-v4-3
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/ROCm-Version-History.html#new-features-and-enhancements-in-rocm-v4-2
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/ROCm-Version-History.html#new-features-and-enhancements-in-rocm-v4-1
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/ROCm-Version-History.html#new-features-and-enhancements-in-rocm-v4-0
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/ROCm-Version-History.html#new-features-and-enhancements-in-rocm-v3-10
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/ROCm-Version-History.html#new-features-and-enhancements-in-rocm-v3-9
Change 725887 merged by Elukey:
[operations/puppet@production] amd_rocm: import ROCm suite 4.3.1
Change 725904 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] aptrepo: add missing amd-rocm431 settings
Change 725904 merged by Elukey:
[operations/puppet@production] aptrepo: add missing amd-rocm431 settings
Mentioned in SAL (#wikimedia-operations) [2021-10-04T14:19:57Z] <elukey> import AMD ROCm 4.3.1 packages in buster-wikimedia's thirdparty/amd-rocm431 - T287267
Change 726389 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] Set new AMD ROCm version for an-worker1096
Change 726507 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] amd_rocm: update settings/packages for ROCm 4.3.1
Change 726507 merged by Elukey:
[operations/puppet@production] amd_rocm: update settings/packages for ROCm 4.3.1
Change 726389 merged by Elukey:
[operations/puppet@production] Set new AMD ROCm version for an-worker1096
Change 726539 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] Upgrade all an-workers with GPUs to ROCm 4.3.1
Change 726539 merged by Elukey:
[operations/puppet@production] Upgrade all an-workers with GPUs to ROCm 4.3.1
Change 726578 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] Move stat100[5,8] to AMD ROCm 4.3.1
Change 726606 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] amd_rocm: add support for ROCm 4.2
Change 726606 merged by Elukey:
[operations/puppet@production] amd_rocm: add support for ROCm 4.2
Change 726611 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] Downgrade AMD ROCm to 4.2 (from 4.3.1) on an-worker1096
Mentioned in SAL (#wikimedia-operations) [2021-10-05T12:43:25Z] <elukey> import AMD ROCm 4.2 to buster-wikimedia's thirdparty/amd-rocm42 - T287267
Change 726611 merged by Elukey:
[operations/puppet@production] Downgrade AMD ROCm to 4.2 (from 4.3.1) on an-worker1096
Change 726619 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] Downgrade AMD ROCm to 4.2 on all GPU-based Hadoop workers
Change 726619 merged by Elukey:
[operations/puppet@production] Downgrade AMD ROCm to 4.2 on all GPU-based Hadoop workers
Change 726759 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] prometheus-amd-rocm-stats.py: support ROCm 4.2.0's smi output
Change 726759 merged by Elukey:
[operations/puppet@production] prometheus-amd-rocm-stats.py: support ROCm 4.2.0's smi output
To keep archives happy:
- We decided to target ROCm 4.3.1 (current latest upstream) and tensorflow-rocm 2.6.
- Instead of rolling out the packages on stat100[5,8], we started from the Hadoop workers.
- Basic checks after install went fine, but then we realized that tensorflow-rocm 2.6 (following its tensorflow counterpart) doesn't support anymore reading from HDFS natively. It needs a new package, tensorflow-io, that in turn requires tensorflow (non-rocm version). We'll need to follow up with upstream (AMD) to ask what are their plans.
- We decided to target ROCm 4.2 and tensorflow 2.5.0
Change 726864 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/puppet@production] Upgrade stat100[5,8] to ROCm 4.2
Change 726864 merged by Elukey:
[operations/puppet@production] Upgrade stat100[5,8] to ROCm 4.2
Change 726578 abandoned by Elukey:
[operations/puppet@production] Move stat100[5,8] to AMD ROCm 4.3.1
Reason:
This is done! It seems that ROCm 4.2 is the only viable option for the moment, I'll keep working on https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/1461 to see what upstream suggests.
Please re-open the task if anything is missing!