Page MenuHomePhabricator

Import AMD rocm packages in wikimedia-buster
Closed, ResolvedPublic13 Estimated Story Points

Description

In the parent task we have installed manually rocm packages (http://repo.radeon.com/rocm/apt/debian/pool/main/) on stat1005, tested them with Tensorflow and the GPU that we bought. In order to do things properly, it is now the turn of importing those packages to wikimedia-buster.

There are some caveats:

  • AMD provides packages only for Ubuntu 16.x and 18.x (https://rocm.github.io/ROCmInstall.html#supported-operating-systems---new-operating-systems-available). We tested them on Debian Buster and they seem to work fine.
  • Most of the packages provided have also the source available, but we'd need to verify proper licensing of the software. Point of entry for all the repos is https://github.com/RadeonOpenCompute
  • There is one package that contains binary libs that is not open source (yet). This package is in the dependencies of other ones that we need, and upstream seems not available to make it optional (https://github.com/RadeonOpenCompute/ROCm/issues/761). The package, IIUC, should be only used by image processing via OpenCL, something that is needed only in rare cases (non of the ones that we currently want to support like tensorflow). I manually removed the binary libs provided by the package and Tensorflow works as expected (as well as basic usage of OpenCL).

Some solutions that we could consider:

  • import all the packages trying to follow-up/help upstream to finally remove the non open source dependency as soon as possible (so importing the non open source binary libs).
  • import all the packages in boron, remove manually the non open source dependency from the related control files, rebuild all and import to wikimedia-buster.

Event Timeline

If Tensorflow works fine without hsa-ext-rocr-dev, we also have a third option, which seems cleaner and easier:

  • Import the existing repository (sans hsa-ext-rocr-dev) to a new thirdparty/rocm component
  • Create a dummy hsa-ext-rocr-dev deb using https://packages.debian.org/stable/equivs and import that to component/rocm

@MoritzMuehlenhoff you are completely right, forgot about that option, at this point I am +1 on proceeding with the dummy package without waiting upstream.

Going to create the new component and then try to import the packages :)

fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Tried to check in /var/log/apt/history the packages installed to make the Tensorflow and Thumbor (uses OpenCL) use case working:

cxlactivitylogger
hcc
hsa-rocr-dev
hsakmt-roct
miopen-hip
miopen-opencl
mivisionx
radeontop
rocblas
rocfft
rocm-cmake
rocm-dev
rocm-device-libs
rocm-opencl
rocm-opencl-dev
rocm-utils
rocrand

hsa-ext-rocr-dev seems not needed so far, and it is the only non-free package that we don't need to import.

Change 520848 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aptrepo: add component/amd-rocm

https://gerrit.wikimedia.org/r/520848

https://wikitech.wikimedia.org/wiki/Reprepro#Adding_a_new_external_repository

I think that the key used (http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key) is not in any public keyserver, so probably we'll need to import it manually first (via wget on https). @MoritzMuehlenhoff asking for your permission before proceeding since I don't want to make a mess :)

Mentioned in SAL (#wikimedia-operations) [2019-07-08T13:52:28Z] <elukey> import AMD ROCm's Debian repo key (9386B48A1A693C5C) manually on install1002 - T224723

Change 520848 merged by Elukey:
[operations/puppet@production] aptrepo: add thirdparty/amd-rocm

https://gerrit.wikimedia.org/r/520848

root@install1002:~# reprepro --noskipold --component thirdparty/amd-rocm checkupdate buster-wikimedia
Calculating packages to get...

I am probably missing some step, but I cannot get the packages for the moment..

More verbose run:

root@install1002:/srv/wikimedia# reprepro -V --noskipold --component thirdparty/amd-rocm checkupdate buster-wikimedia
aptmethod got 'http://downloads.linux.hpe.com/SDR/repo/mcp/dists/stretch/current/Release'
aptmethod got 'http://downloads.linux.hpe.com/SDR/repo/mcp/dists/stretch/current/Release.gpg'
aptmethod got 'http://hwraid.le-vert.net/debian/dists/stretch/Release'
aptmethod got 'http://hwraid.le-vert.net/debian/dists/stretch/Release.gpg'
Shutting down aptmethods...
Calculating packages to get...
  nothing to do for 'buster-wikimedia|thirdparty/amd-rocm|source'
  nothing to do for 'buster-wikimedia|thirdparty/amd-rocm|i386'
  nothing to do for 'buster-wikimedia|thirdparty/amd-rocm|amd64'

I don't see any attempt to fetch packages for ROCm, so I guess reprepro is not configured correctly?

Change 521319 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aptrepo: add missing update for amd-rocm

https://gerrit.wikimedia.org/r/521319

Change 521319 merged by Elukey:
[operations/puppet@production] aptrepo: add missing update for amd-rocm

https://gerrit.wikimedia.org/r/521319

Change 521417 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aptrepo: remove source from amd-rocm's update config

https://gerrit.wikimedia.org/r/521417

Change 521417 merged by Elukey:
[operations/puppet@production] aptrepo: remove source from amd-rocm's update config

https://gerrit.wikimedia.org/r/521417

Better now!

root@install1002:/srv/wikimedia# reprepro --noskipold --component thirdparty/amd-rocm checkupdate buster-wikimedia
Calculating packages to get...
Updates needed for 'buster-wikimedia|thirdparty/amd-rocm|amd64':
'cxlactivitylogger': newly installed as '5.6.7259' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/c/cxlactivitylogger/cxlactivitylogger_5.6.7259_amd64.deb
'hcc': newly installed as '1.3.19242' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/h/hcc/hcc_1.3.19242_amd64.deb
'hsa-ext-rocr-dev': newly installed as '1.1.9-87-g1566fdd' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/h/hsa-ext-rocr-dev/hsa-ext-rocr-dev_1.1.9-87-g1566fdd_amd64.deb
'hsa-rocr-dev': newly installed as '1.1.9-87-g1566fdd' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/h/hsa-rocr-dev/hsa-rocr-dev_1.1.9-87-g1566fdd_amd64.deb
'hsakmt-roct': newly installed as '1.0.9-171-g4be439e' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/h/hsakmt-roct/hsakmt-roct_1.0.9-171-g4be439e_amd64.deb
'miopen-hip': newly installed as '2.0.0-7a8f787' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/m/miopen-hip/miopen-hip_2.0.0-7a8f787_amd64.deb
'miopen-opencl': newly installed as '2.0.0-7a8f787' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/m/miopen-opencl/miopen-opencl_2.0.0-7a8f787_amd64.deb
'mivisionx': newly installed as '1.3.0' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/m/mivisionx/mivisionx_1.3.0_amd64.deb
'rocblas': newly installed as '2.2.11.0' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocblas/rocblas_2.2.11.0_amd64.deb
'rocfft': newly installed as '0.9.4.0' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocfft/rocfft_0.9.4.0_amd64.deb
'rocm-cmake': newly installed as '0.2.0-91316f9' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-cmake/rocm-cmake_0.2.0-91316f9_amd64.deb
'rocm-dev': newly installed as '2.6.22' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-dev/rocm-dev_2.6.22_amd64.deb
'rocm-device-libs': newly installed as '0.0.1' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-device-libs/rocm-device-libs_0.0.1_amd64.deb
'rocm-opencl': newly installed as '1.2.0-2019070446' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-opencl/rocm-opencl_1.2.0-2019070446_amd64.deb
'rocm-opencl-dev': newly installed as '1.2.0-2019070446' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-opencl-dev/rocm-opencl-dev_1.2.0-2019070446_amd64.deb
'rocm-utils': newly installed as '2.6.22' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-utils/rocm-utils_2.6.22_amd64.deb
'rocrand': newly installed as '2.6.0' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocrand/rocrand_2.6.0_amd64.deb

Change 521419 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aptrepo: remove packages from amd-rocm's whitelist

https://gerrit.wikimedia.org/r/521419

Change 521419 merged by Elukey:
[operations/puppet@production] aptrepo: remove packages from amd-rocm's whitelist

https://gerrit.wikimedia.org/r/521419

New list:

root@install1002:/srv/wikimedia# reprepro --noskipold --component thirdparty/amd-rocm checkupdate buster-wikimedia
Calculating packages to get...
Updates needed for 'buster-wikimedia|thirdparty/amd-rocm|amd64':
'cxlactivitylogger': newly installed as '5.6.7259' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/c/cxlactivitylogger/cxlactivitylogger_5.6.7259_amd64.deb
'hcc': newly installed as '1.3.19242' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/h/hcc/hcc_1.3.19242_amd64.deb
'hsa-rocr-dev': newly installed as '1.1.9-87-g1566fdd' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/h/hsa-rocr-dev/hsa-rocr-dev_1.1.9-87-g1566fdd_amd64.deb
'hsakmt-roct': newly installed as '1.0.9-171-g4be439e' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/h/hsakmt-roct/hsakmt-roct_1.0.9-171-g4be439e_amd64.deb
'miopen-hip': newly installed as '2.0.0-7a8f787' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/m/miopen-hip/miopen-hip_2.0.0-7a8f787_amd64.deb
'miopen-opencl': newly installed as '2.0.0-7a8f787' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/m/miopen-opencl/miopen-opencl_2.0.0-7a8f787_amd64.deb
'mivisionx': newly installed as '1.3.0' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/m/mivisionx/mivisionx_1.3.0_amd64.deb
'rocblas': newly installed as '2.2.11.0' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocblas/rocblas_2.2.11.0_amd64.deb
'rocfft': newly installed as '0.9.4.0' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocfft/rocfft_0.9.4.0_amd64.deb
'rocm-cmake': newly installed as '0.2.0-91316f9' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-cmake/rocm-cmake_0.2.0-91316f9_amd64.deb
'rocm-dev': newly installed as '2.6.22' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-dev/rocm-dev_2.6.22_amd64.deb
'rocm-device-libs': newly installed as '0.0.1' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-device-libs/rocm-device-libs_0.0.1_amd64.deb
'rocm-opencl': newly installed as '1.2.0-2019070446' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-opencl/rocm-opencl_1.2.0-2019070446_amd64.deb
'rocm-opencl-dev': newly installed as '1.2.0-2019070446' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-opencl-dev/rocm-opencl-dev_1.2.0-2019070446_amd64.deb
'rocm-utils': newly installed as '2.6.22' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocm-utils/rocm-utils_2.6.22_amd64.deb
'rocrand': newly installed as '2.6.0' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/r/rocrand/rocrand_2.6.0_amd64.deb

Now the annoying part:

elukey@stat1005:~$ apt-cache show hsa-rocr-dev
Package: hsa-rocr-dev
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 3341
Maintainer: James Edwards (JamesAdrian.Edwards@amd.com)
Architecture: amd64
Version: 1.1.9-68-gc862c1c
Depends: hsa-ext-rocr-dev (= 1.1.9-68-gc862c1c), hsakmt-roct-dev
Description: AMD Heterogeneous System Architecture HSA - Linux HSA Runtime for ROCm platforms
Description-md5: 1e143bde6017450a262b1c6aae032265
Homepage: https://github.com/RadeonOpenCompute/ROCR-Runtime

elukey@stat1005:~$ apt-cache show rocm-dev
Package: rocm-dev
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 13
Maintainer: Advanced Micro Devices Inc.
Architecture: amd64
Version: 2.4.25
Depends: hsa-rocr-dev, hsa-ext-rocr-dev, rocm-device-libs, rocm-utils, hcc, hip_base, hip_doc, hip_hcc, hip_samples, rocm-smi, hsakmt-roct, hsakmt-roct-dev, hsa-amd-aqlprofile, comgr, rocr_debug_agent
Description: Radeon Open Compute (ROCm) Runtime software stack
Description-md5: 6a6a8f854ad10a9802b980cf99b587d2
Homepage: https://github.com/RadeonOpenCompute/ROCm

elukey@stat1005:~$ apt-cache show hcc
Package: hcc
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 1803755
Maintainer: Siu Chi Chan <siuchi.chan@amd.com>
Architecture: amd64
Version: 1.3.19174
Depends: coreutils, g++-multilib, gcc-multilib, findutils, libelf1, libpci3, file, hsa-rocr-dev, hsa-ext-rocr-dev, rocm-utils
Description: HCC: An Open Source, Optimizing C++ Compiler for Heterogeneous Compute
Description-md5: 21e1eb128cebd5f6fed1f460126737b5

The above are the three packages that rdepends hsa-ext-rocr-dev shows. Sadly there is one that wants a specific version in its Depends, so the equiv package for hsa-ext-rocr-dev will need to be re-created/built every time we upgrade the rocm version.

Mentioned in SAL (#wikimedia-operations) [2019-07-09T10:39:26Z] <elukey> update wikimedia-buster thirparty/amd-rocm component with upstream packages - T224723

Change 521463 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::gpu: add packages from thirdparty/rocm

https://gerrit.wikimedia.org/r/521463

Change 521463 merged by Elukey:
[operations/puppet@production] profile::statistics::gpu: add packages from thirdparty/rocm

https://gerrit.wikimedia.org/r/521463

Change 521475 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aptrepo: add more packages to the amd-rocm's whitelist

https://gerrit.wikimedia.org/r/521475

Change 521476 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::gpu: fix packages installed

https://gerrit.wikimedia.org/r/521476

Change 521475 merged by Elukey:
[operations/puppet@production] aptrepo: add more packages to the amd-rocm's whitelist

https://gerrit.wikimedia.org/r/521475

Change 521476 merged by Elukey:
[operations/puppet@production] profile::statistics::gpu: fix packages installed

https://gerrit.wikimedia.org/r/521476

Change 521478 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::gpu: add rocm-smi to the packages required

https://gerrit.wikimedia.org/r/521478

Change 521478 merged by Elukey:
[operations/puppet@production] profile::statistics::gpu: add rocm-smi to the packages required

https://gerrit.wikimedia.org/r/521478

elukey set the point value for this task to 13.Jul 10 2019, 6:27 AM
elukey added a project: Analytics-Kanban.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.

Change 522029 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::gpu: add rccl package

https://gerrit.wikimedia.org/r/522029

Change 522029 merged by Elukey:
[operations/puppet@production] profile::statistics::gpu: add rccl package

https://gerrit.wikimedia.org/r/522029

Change 522031 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aptrepo: pin a specific version of the AMD ROCm suite

https://gerrit.wikimedia.org/r/522031

Change 522031 merged by Elukey:
[operations/puppet@production] aptrepo: pin a specific version of the AMD ROCm suite

https://gerrit.wikimedia.org/r/522031

Change 522036 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::gpu: add rocm-libs

https://gerrit.wikimedia.org/r/522036

Change 522036 merged by Elukey:
[operations/puppet@production] profile::statistics::gpu: add rocm-libs

https://gerrit.wikimedia.org/r/522036

Change 522039 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aptrepo: add thirdparty/amd-rocm25

https://gerrit.wikimedia.org/r/522039

Change 522039 merged by Elukey:
[operations/puppet@production] aptrepo: add thirdparty/amd-rocm25

https://gerrit.wikimedia.org/r/522039

Change 522042 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::gpu: switch to thirdparty/amd-rocm25

https://gerrit.wikimedia.org/r/522042

Change 522042 merged by Elukey:
[operations/puppet@production] profile::statistics::gpu: switch to thirdparty/amd-rocm25

https://gerrit.wikimedia.org/r/522042

All right so ROCm 2.5 and tensorflow-rocm 1.13.3 seems to work. Other versions of TF (1.13.4 and 1.14.0) lead to the following error:

Traceback (most recent call last):
  File "/srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/srv/home/elukey/test/lib/python3.7/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/srv/home/elukey/test/lib/python3.7/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: /srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN8hip_impl7kernarg6resizeEm

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_tf.py", line 1, in <module>
    import tensorflow as tf
  File "/srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/srv/home/elukey/test/lib/python3.7/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/srv/home/elukey/test/lib/python3.7/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: /srv/home/elukey/test/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN8hip_impl7kernarg6resizeEm


Failed to load the native TensorFlow runtime.

Opened https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/559 to upstream.

Change 523677 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce module amd_rocm

https://gerrit.wikimedia.org/r/523677

Change 523677 merged by Elukey:
[operations/puppet@production] Introduce module amd_rocm

https://gerrit.wikimedia.org/r/523677

Refactored the puppet code into a separate module called amd_rocm and updated the documentation. We'll need to follow up with upstream to fix https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/559, but at the moment ROCm 2.5 seems to work so I'd close this task.

Change 523942 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] aptrepo: replace the amd-rocm component with amd-rocm26

https://gerrit.wikimedia.org/r/523942

Change 523942 merged by Elukey:
[operations/puppet@production] aptrepo: replace the amd-rocm component with amd-rocm26

https://gerrit.wikimedia.org/r/523942