Page MenuHomePhabricator

Upgrade AMD ROCm drivers/tools to latest upstream
Open, HighPublic

Description

We are currently running ROCm 3.3 and upstream has already reached 3.8. Since we keep having some problems with GPUs stuck into weird states (requiring a host reboot), I'd try to keep upgrading as attempt to solve problems.

Event Timeline

elukey created this task.Oct 2 2020, 9:55 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2020, 9:55 AM

Change 631725 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] aptrepo: Add rocm 3.8 packages to reprepro

https://gerrit.wikimedia.org/r/631725

Change 631725 merged by Klausman:
[operations/puppet@production] aptrepo: Add rocm 3.8 packages to reprepro

https://gerrit.wikimedia.org/r/631725

Change 632219 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] Drop mivisionx from the RE of wanted packages

https://gerrit.wikimedia.org/r/632219

Change 632219 merged by Klausman:
[operations/puppet@production] Drop mivisionx from the RE of wanted packages

https://gerrit.wikimedia.org/r/632219

Change 632248 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] modules: Add functionality to allow use of 3.8 rocm packages

https://gerrit.wikimedia.org/r/632248

Change 632248 merged by Klausman:
[operations/puppet@production] modules: Add functionality to allow use of 3.8 rocm packages

https://gerrit.wikimedia.org/r/632248

Milimetric triaged this task as High priority.Oct 19 2020, 3:49 PM
Milimetric edited projects, added Analytics-Clusters; removed Analytics.

Change 635260 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics_cluster: Add rocm38 group for driver update testing

https://gerrit.wikimedia.org/r/635260

Change 635260 merged by Klausman:
[operations/puppet@production] analytics_cluster: set one machine to receive rocm38

https://gerrit.wikimedia.org/r/635260

Change 635279 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] aptrepo: Add missing rocm38 deps

https://gerrit.wikimedia.org/r/635279

Change 635279 merged by Klausman:
[operations/puppet@production] aptrepo: Add missing rocm38 deps

https://gerrit.wikimedia.org/r/635279

Change 635282 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics_cluster: Revert rocm38 change for an-worker1101

https://gerrit.wikimedia.org/r/635282

Change 635282 merged by Klausman:
[operations/puppet@production] analytics_cluster: Revert rocm38 change for an-worker1101

https://gerrit.wikimedia.org/r/635282

As the experiment today has shown, rocm38 ultimately depends on a few packages that are not available on Stretch. I think we have the following options:

  1. Wait with rocm upgrades until everything is on Buster
  2. Backport the needed packages (this is very hard, we're talking stuff like libstdc++)
  3. Try intermediate rocm versions (like 3.7)

Other proposals?

I would upgrade the two stats to 3.8 leaving the hadoop workers to 3.3, so we could keep testing the drivers keeping the stretch stack stable. What do you think?

Sounds good to me. I will make an announcement that we'll have some disruption on stat1005 and 1008 soonish and then do the upgrade via puppet host overrides. For two machines that seems simple enough. If-when we get more of them, we can make a separate role like the original change I made.

Change 635755 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics: Switch stat1005 to use rocm 3.8

https://gerrit.wikimedia.org/r/635755

Change 635755 merged by Klausman:
[operations/puppet@production] analytics: Switch stat1005 to use rocm 3.8

https://gerrit.wikimedia.org/r/635755

Unfortunately, the rocm38 kernel module does not compile against our current Buster kernel (4.19.0-12):

  CC [M]  /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_dma_buf.o
  CC [M]  /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_vm.o
/var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_bios.c: In function ‘amdgpu_read_platform_bios’:
/var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_bios.c:200:9: error: implicit declaration of function ‘pci_platform_rom’; did you mean ‘pci_map_rom’? [-Werror=implicit-function-declaration]
  bios = pci_platform_rom(adev->pdev, &size);
         ^~~~~~~~~~~~~~~~
         pci_map_rom
/var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_bios.c:200:7: warning: assignment to ‘uint8_t *’ {aka ‘unsigned char *’} from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
  bios = pci_platform_rom(adev->pdev, &size);
       ^
  CC [M]  /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_ib.o
  CC [M]  /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_pll.o

Change 635764 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] amd-rocm: Add 3.7 to reprepro

https://gerrit.wikimedia.org/r/635764

Change 635764 merged by Klausman:
[operations/puppet@production] amd-rocm: Add 3.7 to reprepro

https://gerrit.wikimedia.org/r/635764

Change 635787 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics: switch stat1005 to use rocm 3.7

https://gerrit.wikimedia.org/r/635787

Change 635787 merged by Klausman:
[operations/puppet@production] analytics: switch stat1005 to use rocm 3.7

https://gerrit.wikimedia.org/r/635787

And 3.7 has the same problem:

  LD [M]  /var/lib/dkms/amdgpu/3.7-20/build/amd/amdkcl/amdkcl.o
  CC [M]  /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_ctx.o
  CC [M]  /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_sync.o
  CC [M]  /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_gtt_mgr.o
/var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_bios.c: In function ‘amdgpu_read_platform_bios’:
/var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_bios.c:200:9: error: implicit declaration of function ‘pci_platform_rom’; did you mean ‘pci_map_rom’? [-Werror=implicit-function-declaration]
  bios = pci_platform_rom(adev->pdev, &size);
         ^~~~~~~~~~~~~~~~
         pci_map_rom
/var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_bios.c:200:7: warning: assignment to ‘uint8_t *’ {aka ‘unsigned char *’} from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
  bios = pci_platform_rom(adev->pdev, &size);
       ^
/var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_cs.c:1167:12: warning: ‘amdgpu_cs_process_syncobj_timeline_out_dep’ defined but not used [-Wunused-function]
 static int amdgpu_cs_process_syncobj_timeline_out_dep(struct amdgpu_cs_parser *p,
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_cs.c:1107:12: warning: ‘amdgpu_cs_process_syncobj_timeline_in_dep’ defined but not used [-Wunused-function]
 static int amdgpu_cs_process_syncobj_timeline_in_dep(struct amdgpu_cs_parser *p,
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  CC [M]  /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_vram_mgr.o

Change 635789 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics: Remove rocm 3.7 overrid for stat1005

https://gerrit.wikimedia.org/r/635789

Change 635789 merged by Klausman:
[operations/puppet@production] analytics: Remove rocm 3.7 override for stat1005

https://gerrit.wikimedia.org/r/635789

After some more experimenting, I have found that at least rocm33 compiles fine against 4.19.0-11, but fails with 4,19.0-12, with the above errors referring to the pci_platform_rom/pci_map_rom symbols.

https://wiki.debian.org/AMDGPUDriverOnStretchAndBuster2 indicates that soem people are experimenting with rocm on Debian. The page mentions two patches, but neither of them mentions the missing PCI symbols that cause the compile to fail with 4.19.0-12.

I had a quick chat with Moritz about the kernel version/rocm siutuation, and we agreed that we'd test 5.8.0 (a backport to Buster) on stat1005 and see if it works better with either vanilla amdgp, or the rock-dkms. Will update here as soon as I have results. The test is likely going to happen this Friday (2020-10-30).

Change 637671 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics: Switch stat1005 to use rocm 3.8

https://gerrit.wikimedia.org/r/637671

Change 637671 merged by Klausman:
[operations/puppet@production] analytics: Switch stat1005 to use rocm 3.8

https://gerrit.wikimedia.org/r/637671

Change 637676 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] aptrepo: add mor rocm 3.8 dependencies

https://gerrit.wikimedia.org/r/637676

Change 637676 merged by Klausman:
[operations/puppet@production] aptrepo: add mor rocm 3.8 dependencies

https://gerrit.wikimedia.org/r/637676

Change 637682 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] amd_rocm: Only add DKMS+firmware for rocm33 installs

https://gerrit.wikimedia.org/r/637682

Change 637682 merged by Klausman:
[operations/puppet@production] amd_rocm: Only add DKMS+firmware for rocm33 installs

https://gerrit.wikimedia.org/r/637682

Change 641147 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics: Switch stat1008 to use rocm 3.8

https://gerrit.wikimedia.org/r/641147

Change 641147 merged by Klausman:
[operations/puppet@production] analytics: Switch stat1008 to use rocm 3.8

https://gerrit.wikimedia.org/r/641147