We are currently running ROCm 3.3 and upstream has already reached 3.8. Since we keep having some problems with GPUs stuck into weird states (requiring a host reboot), I'd try to keep upgrading as attempt to solve problems.
Description
Details
Event Timeline
Change 631725 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] aptrepo: Add rocm 3.8 packages to reprepro
Change 631725 merged by Klausman:
[operations/puppet@production] aptrepo: Add rocm 3.8 packages to reprepro
Change 632219 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] Drop mivisionx from the RE of wanted packages
Change 632219 merged by Klausman:
[operations/puppet@production] Drop mivisionx from the RE of wanted packages
Change 632248 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] modules: Add functionality to allow use of 3.8 rocm packages
Change 632248 merged by Klausman:
[operations/puppet@production] modules: Add functionality to allow use of 3.8 rocm packages
Change 635260 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics_cluster: Add rocm38 group for driver update testing
Change 635260 merged by Klausman:
[operations/puppet@production] analytics_cluster: set one machine to receive rocm38
Change 635279 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] aptrepo: Add missing rocm38 deps
Change 635279 merged by Klausman:
[operations/puppet@production] aptrepo: Add missing rocm38 deps
Change 635282 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics_cluster: Revert rocm38 change for an-worker1101
Change 635282 merged by Klausman:
[operations/puppet@production] analytics_cluster: Revert rocm38 change for an-worker1101
As the experiment today has shown, rocm38 ultimately depends on a few packages that are not available on Stretch. I think we have the following options:
- Wait with rocm upgrades until everything is on Buster
- Backport the needed packages (this is very hard, we're talking stuff like libstdc++)
- Try intermediate rocm versions (like 3.7)
Other proposals?
I would upgrade the two stats to 3.8 leaving the hadoop workers to 3.3, so we could keep testing the drivers keeping the stretch stack stable. What do you think?
Sounds good to me. I will make an announcement that we'll have some disruption on stat1005 and 1008 soonish and then do the upgrade via puppet host overrides. For two machines that seems simple enough. If-when we get more of them, we can make a separate role like the original change I made.
Change 635755 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics: Switch stat1005 to use rocm 3.8
Change 635755 merged by Klausman:
[operations/puppet@production] analytics: Switch stat1005 to use rocm 3.8
Unfortunately, the rocm38 kernel module does not compile against our current Buster kernel (4.19.0-12):
CC [M] /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_dma_buf.o CC [M] /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_vm.o /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_bios.c: In function ‘amdgpu_read_platform_bios’: /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_bios.c:200:9: error: implicit declaration of function ‘pci_platform_rom’; did you mean ‘pci_map_rom’? [-Werror=implicit-function-declaration] bios = pci_platform_rom(adev->pdev, &size); ^~~~~~~~~~~~~~~~ pci_map_rom /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_bios.c:200:7: warning: assignment to ‘uint8_t *’ {aka ‘unsigned char *’} from ‘int’ makes pointer from integer without a cast [-Wint-conversion] bios = pci_platform_rom(adev->pdev, &size); ^ CC [M] /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_ib.o CC [M] /var/lib/dkms/amdgpu/3.8-30/build/amd/amdgpu/amdgpu_pll.o
Change 635764 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] amd-rocm: Add 3.7 to reprepro
Change 635764 merged by Klausman:
[operations/puppet@production] amd-rocm: Add 3.7 to reprepro
Change 635787 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics: switch stat1005 to use rocm 3.7
Change 635787 merged by Klausman:
[operations/puppet@production] analytics: switch stat1005 to use rocm 3.7
And 3.7 has the same problem:
LD [M] /var/lib/dkms/amdgpu/3.7-20/build/amd/amdkcl/amdkcl.o CC [M] /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_ctx.o CC [M] /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_sync.o CC [M] /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_gtt_mgr.o /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_bios.c: In function ‘amdgpu_read_platform_bios’: /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_bios.c:200:9: error: implicit declaration of function ‘pci_platform_rom’; did you mean ‘pci_map_rom’? [-Werror=implicit-function-declaration] bios = pci_platform_rom(adev->pdev, &size); ^~~~~~~~~~~~~~~~ pci_map_rom /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_bios.c:200:7: warning: assignment to ‘uint8_t *’ {aka ‘unsigned char *’} from ‘int’ makes pointer from integer without a cast [-Wint-conversion] bios = pci_platform_rom(adev->pdev, &size); ^ /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_cs.c:1167:12: warning: ‘amdgpu_cs_process_syncobj_timeline_out_dep’ defined but not used [-Wunused-function] static int amdgpu_cs_process_syncobj_timeline_out_dep(struct amdgpu_cs_parser *p, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_cs.c:1107:12: warning: ‘amdgpu_cs_process_syncobj_timeline_in_dep’ defined but not used [-Wunused-function] static int amdgpu_cs_process_syncobj_timeline_in_dep(struct amdgpu_cs_parser *p, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CC [M] /var/lib/dkms/amdgpu/3.7-20/build/amd/amdgpu/amdgpu_vram_mgr.o
Change 635789 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics: Remove rocm 3.7 overrid for stat1005
Change 635789 merged by Klausman:
[operations/puppet@production] analytics: Remove rocm 3.7 override for stat1005
After some more experimenting, I have found that at least rocm33 compiles fine against 4.19.0-11, but fails with 4,19.0-12, with the above errors referring to the pci_platform_rom/pci_map_rom symbols.
https://wiki.debian.org/AMDGPUDriverOnStretchAndBuster2 indicates that soem people are experimenting with rocm on Debian. The page mentions two patches, but neither of them mentions the missing PCI symbols that cause the compile to fail with 4.19.0-12.
I had a quick chat with Moritz about the kernel version/rocm siutuation, and we agreed that we'd test 5.8.0 (a backport to Buster) on stat1005 and see if it works better with either vanilla amdgp, or the rock-dkms. Will update here as soon as I have results. The test is likely going to happen this Friday (2020-10-30).
Change 637671 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics: Switch stat1005 to use rocm 3.8
Change 637671 merged by Klausman:
[operations/puppet@production] analytics: Switch stat1005 to use rocm 3.8
Change 637676 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] aptrepo: add mor rocm 3.8 dependencies
Change 637676 merged by Klausman:
[operations/puppet@production] aptrepo: add mor rocm 3.8 dependencies
Change 637682 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] amd_rocm: Only add DKMS+firmware for rocm33 installs
Change 637682 merged by Klausman:
[operations/puppet@production] amd_rocm: Only add DKMS+firmware for rocm33 installs
Change 641147 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] analytics: Switch stat1008 to use rocm 3.8
Change 641147 merged by Klausman:
[operations/puppet@production] analytics: Switch stat1008 to use rocm 3.8