Page MenuHomePhabricator

Upgrade AMD ROCm to latest upstream
Closed, ResolvedPublic

Description

We are currently running ROCm version 2.7.1, and the last upstream version is 3.10 (https://rocm-documentation.readthedocs.io/en/latest/Current_Release_Notes/Current-Release-Notes.html).

While testing with @Miriam we found some bugs of the combination tensorflow-rocm / ROCm that we hope to see it fixed after the upgrade (worst case scenario we'll provide a bug report to upstream with their last version of the code).

Before starting, it is worth to point out that the new version will probably only support tensorflow-rocm 2.x. Waiting for @Miriam's green light before proceeding :)

Event Timeline

elukey created this task.Mar 6 2020, 2:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 6 2020, 2:09 PM
Milimetric triaged this task as High priority.Mar 9 2020, 4:01 PM
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

Change 586310 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aptrepo: add configuration for AMD ROCm 3.3

https://gerrit.wikimedia.org/r/586310

Change 586310 merged by Elukey:
[operations/puppet@production] aptrepo: add configuration for AMD ROCm 3.3

https://gerrit.wikimedia.org/r/586310

Mentioned in SAL (#wikimedia-operations) [2020-04-06T11:18:28Z] <elukey> import AMD ROCm 3.3 packages in buster-wikimedia (component thirdparty/rocm33) - T247082

Change 586334 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::gpu: upgrade stat1008 to ROCm 3.3

https://gerrit.wikimedia.org/r/586334

Change 586334 merged by Elukey:
[operations/puppet@production] profile::statistics::gpu: upgrade stat1008 to ROCm 3.3

https://gerrit.wikimedia.org/r/586334

Change 586352 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] amd: update package list after 3.3 upgrade

https://gerrit.wikimedia.org/r/586352

Change 586352 merged by Elukey:
[operations/puppet@production] amd: update package list after 3.3 upgrade

https://gerrit.wikimedia.org/r/586352

Change 586362 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix AMD GPU prometheus exporter for ROCm 3.3

https://gerrit.wikimedia.org/r/586362

Change 586362 merged by Elukey:
[operations/puppet@production] Fix AMD GPU prometheus exporter for ROCm 3.3

https://gerrit.wikimedia.org/r/586362

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

From Miriam's tests it seems that stat1008 (with ROCm 3.3) works better than before, with some positive effects also on T248574. Since some tests are ongoing on stat1005 I'll update it in a couple of weeks.

elukey moved this task from In Progress to Paused on the Analytics-Kanban board.Apr 9 2020, 12:50 PM

Change 596457 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add rocm-smi path as parameter to the Prometheus AMD GPU exporter

https://gerrit.wikimedia.org/r/596457

Change 596457 merged by Elukey:
[operations/puppet@production] Add rocm-smi path as parameter to the Prometheus AMD GPU exporter

https://gerrit.wikimedia.org/r/596457

Change 603814 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::gpu: upgrade stat1005 to rocm 3.3

https://gerrit.wikimedia.org/r/603814

Change 603814 merged by Elukey:
[operations/puppet@production] profile::statistics::gpu: upgrade stat1005 to rocm 3.3

https://gerrit.wikimedia.org/r/603814

elukey set Final Story Points to 13.Jun 9 2020, 7:41 AM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.
Nuria added a subscriber: Nuria.Jun 19 2020, 4:57 PM

making sure @Miriam is not waiting for any work here before proceeding

Didn't add a message in here, but the upgrade is completed, I synced with Miriam and Martin (who uses the GPU) before proceeding, all good.

Nuria closed this task as Resolved.Jun 19 2020, 5:03 PM
Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM