Page MenuHomePhabricator

ROCm can't find clang on stat1005
Closed, ResolvedPublic

Description

Attempting to run tensorflow on stat1005 with rocm doesn't seem to be working. The error can be reproduced without tensorflow, by executing /opt/rocm/bin/hipconfig. This looks to be localized to stat1005, in a cursory review stat1008 appears to work.

ebernhardson@stat1005:~$ /opt/rocm/bin/hipconfig
Can't exec "/opt/rocm/llvm/bin/clang++": No such file or directory at /opt/rocm/bin/hipconfig line 141.                                                                                                                                        
Use of uninitialized value $HIP_CLANG_VERSION in pattern match (m//) at /opt/rocm/bin/hipconfig line 142.                                                                                                                                      
Use of uninitialized value $HIP_CLANG_VERSION in concatenation (.) or string at /opt/rocm/bin/hipconfig line 145.      
HIP version  : 3.8.20371-d1886b0b                                          

== hipconfig             
HIP_PATH     : /opt/rocm-3.8.0/hip                           
ROCM_PATH    : /opt/rocm
HIP_COMPILER : clang
HIP_PLATFORM : hcc     
HIP_RUNTIME  : ROCclr                                                                            
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__=  -I/opt/rocm-3.8.0/hip/include -I/opt/rocm/llvm/bin/../lib/clang/ -I/opt/rocm/hsa/include -D__HIP_ROCclr__
                      
== hip-clang                                
HSA_PATH         : /opt/rocm/hsa
HIP_CLANG_PATH   : /opt/rocm/llvm/bin
Can't exec "/opt/rocm/llvm/bin/clang++": No such file or directory at /opt/rocm/bin/hipconfig line 236.
Can't exec "/opt/rocm/llvm/bin/llc": No such file or directory at /opt/rocm/bin/hipconfig line 237.
hip-clang-cxxflags : Can't exec "/opt/rocm/llvm/bin/clang++": No such file or directory at /opt/rocm-3.8.0/hip/bin/hipconfig line 141.
...
ebernhardson@stat1008:~$ /opt/rocm/bin/hipconfig
HIP version  : 3.8.20371-d1886b0b
                                     
== hipconfig                                                                                    
HIP_PATH     : /opt/rocm-3.8.0/hip
ROCM_PATH    : /opt/rocm
HIP_COMPILER : clang            
HIP_PLATFORM : hcc      
HIP_RUNTIME  : ROCclr   
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__=  -I/opt/rocm-3.8.0/hip/include -I/opt/rocm/llvm/bin/../lib/clang/11.0.0 -I/opt/rocm/hsa/include -D__HIP_ROCclr__
                                          
== hip-clang           
HSA_PATH         : /opt/rocm/hsa
HIP_CLANG_PATH   : /opt/rocm/llvm/bin
clang version 11.0.0 (/src/external/llvm-project/clang b98349b12ffa706d0e863a3f1176b20d2a6c438b)
...

Event Timeline

Mentioned in SAL (#wikimedia-analytics) [2021-06-28T17:00:25Z] <elukey> apt-get reinstall llvm-gpu on stat100[5-8] - T285495

Wow this is a great catch, really weird since people used tensorflow before and never reported. After checking the llvm-amdgpu package for https://github.com/RadeonOpenCompute/ROCm/issues/1140 I noticed (via dpkg -L) that it contained Rocm 3.7 files, not 3.8. Everything seems better now:

elukey@stat1008:~$ /opt/rocm/bin/hipconfig
HIP version  : 3.8.20371-d1886b0b

== hipconfig
HIP_PATH     : /opt/rocm-3.8.0/hip
ROCM_PATH    : /opt/rocm
HIP_COMPILER : clang
HIP_PLATFORM : hcc
HIP_RUNTIME  : ROCclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__=  -I/opt/rocm-3.8.0/hip/include -I/opt/rocm/llvm/bin/../lib/clang/11.0.0 -I/opt/rocm/hsa/include -D__HIP_ROCclr__

== hip-clang
HSA_PATH         : /opt/rocm/hsa
HIP_CLANG_PATH   : /opt/rocm/llvm/bin
clang version 11.0.0 (/src/external/llvm-project/clang b98349b12ffa706d0e863a3f1176b20d2a6c438b)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin
LLVM (http://llvm.org/):
  LLVM version 11.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: cascadelake

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -D__HIP_ROCclr__ -std=c++11 -isystem /opt/rocm-3.8.0/llvm/lib/clang/11.0.0/include/.. -isystem /opt/rocm/hsa/include -D__HIP_ROCclr__ -isystem /opt/rocm-3.8.0/hip/include -D__HIP_ARCH_GFX900__=1  -O3
hip-clang-ldflags  :  -L/opt/rocm-3.8.0/hip/lib -O3 -lgcc_s -lgcc -lpthread -lm

=== Environment Variables
PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games

== Linux Kernel
Hostname     : stat1008
Linux stat1008 5.8.0-0.bpo.2-amd64 #1 SMP Debian 5.8.10-1~bpo10+1 (2020-09-26) x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 10 (buster)
Release:	10
Codename:	buster

Updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Upgrade_the_Debian_packages to reflect the importance of the package while upgrading.

@EBernhardson can you re-test and let me know if it is better now?

elukey triaged this task as Medium priority.

Looks to be all good, thanks!