Page MenuHomePhabricator

[LLM] Use Flash attention 2 for GPU inference
Closed, ResolvedPublic

Description

I would like to use flash attention 2 instead of the default pytorch implementation of attention which offers significant inference speed up (2-10x faster).
According to the transformers docs one can use flash attention 2 with Instinct MI210, MI250 and MI300 AMD GPUS
According to the documentation it is strongly suggested to use this Dockerfile that installs the rocm/flash-attention package. We can migrate the required dockerfile instructions in our blubber file in order to use them in our huggingface image.

There are 2 options for flash attention 2 for ROCm as stated in the official documentation:

  • CK flash attention
  • triton flash attention

Our efforts will include building flash attention for ROCm for the MI210 (gfx90a architecture) on ml-lab first and after we figure out a working example we will transfer this in a docker image using a debian base image.

flash attention versionplatformStatus
CK flash attention 2ml-lab
CK flash attention 2docker
triton flash attention 2ml-lab
triton flash attention 2docker

Event Timeline

isarantopoulos renamed this task from [LLM] Use Flash attention 2 for inference to [LLM] Use Flash attention 2 for GPU inference.Jul 30 2024, 7:01 AM
isarantopoulos updated the task description. (Show Details)
isarantopoulos moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

Change #1059119 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] (WIP) huggingface: add flash attention 2

https://gerrit.wikimedia.org/r/1059119

I took a first swing at this, copying over the Dockerfile instructions to the hf blubber image.
At the moment this is failing on install

18.87   × python setup.py egg_info did not run successfully.
18.87   │ exit code: 1
18.87   ╰─> [9 lines of output]
18.87       Traceback (most recent call last):
18.87         File "<string>", line 2, in <module>
18.87         File "<pip-setuptools-caller>", line 34, in <module>
18.87         File "/srv/app/flash-attention-v2/setup.py", line 21, in <module>
18.87           import torch
18.87         File "/opt/lib/python/site-packages/torch/__init__.py", line 237, in <module>
18.87           from torch._C import *  # noqa: F403
18.87           ^^^^^^^^^^^^^^^^^^^^^^
18.87       ImportError: libamdhip64.so: cannot enable executable stack as shared object requires: Invalid argument
18.87       [end of output]
18.87   
18.87   note: This error originates from a subprocess, and is likely not a problem with pip.
19.10 error: metadata-generation-failed
19.10 
19.10 × Encountered error while generating package metadata.
19.10 ╰─> See above for output.
19.10 
19.10 note: This is an issue with the package mentioned above, not pip.
19.10 hint: See above for details.
------
ERROR: failed to solve: process "/bin/sh -c python3 \"-m\" \"pip\" \"install\" \"-r\" \"huggingface_modelserver/requirements.txt\"" did not complete successfully: exit code: 1

I need to recheck if this is a permissions issue and if so if it would make sense to install flash attention in the pytorch base image in prodcution images instead of the inference-services repository.

isarantopoulos lowered the priority of this task from High to Medium.Oct 15 2024, 2:42 PM
isarantopoulos raised the priority of this task from Medium to High.Nov 26 2024, 1:56 PM

I tried building FA2 from source on ml-lab1001 but ran into:

fatal error: cannot open file '/opt/rocm/amdgcn/bitcode/ocml.bc': Unknown attribute kind (86) (Producer: 'LLVM17.0.0git' Reader: 'LLVM 15.0.6')

To reproduce:

git clone https://github.com/ROCm/flash-attention.git
cd flash-attention
python3 -m venv .venv && source .venv/bin/activate
export PYTHONPATH="$PYTHONPATH:/srv/pytorch-rocm/venv/lib/python3.11/site-packages"
pip install -U ninja packaging wheel

# Need to export these, otherwise I get: 
# clang: error: cannot find ROCm device library; provide its path via '--rocm-path' 
# or '--rocm-device-lib-path', or pass '-nogpulib' to build without ROCm device library
export DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode
export HIP_DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode

GPU_ARCHS=gfx90a PYTORCH_ROCM_ARCH=gfx90a python setup.py install

This might have to do with the fact that hipconfig shows Debian clang version 15.0.6 but /opt/rocm-6.1.0/lib/llvm/bin/clang -v shows AMD clang version 17.0.0.

Its also a little strange that hipconfig points to /usr for most paths when it should probably be /opt/rocm-6.1.0?

mnz@ml-lab1001:~$ hipconfig
HIP version  : 5.2.21153-0

== hipconfig
HIP_PATH     : /usr
ROCM_PATH    : /usr
HIP_COMPILER : clang
...

== hip-clang
HSA_PATH         : /usr/hsa
HIP_CLANG_PATH   : /usr/bin

For contrast, this is what the output is on one of the stat clients:

mnz@stat1010:~$ hipconfig
HIP version  : 5.4.22801-aaa1e3d8

== hipconfig
HIP_PATH     : /opt/rocm-5.4.0
ROCM_PATH    : /opt/rocm-5.4.0
HIP_COMPILER : clang
...

== hip-clang
HSA_PATH         : /opt/rocm-5.4.0/hsa
HIP_CLANG_PATH   : /opt/rocm-5.4.0/llvm/bin

But maybe they're supposed to be different?

Edit: Also note that the output says HIP version : 5.2.21153-0 but I would've expected it to be something like 6.1.x ?

[ using torch==2.5.1+rocm6.1 under /home/isaranto/.venv/]
I also tried to build flash attention on ml-lab and came to a similar conclusion: hipcc -v fails with the following error:

Can't exec "/opt/rocm/llvm/bin/clang-15": No such file or directory at /usr/bin//hipcc.pl line 164.
Use of uninitialized value $HIP_CLANG_VERSION in pattern match (m//) at /usr/bin//hipcc.pl line 165.
Can't exec "/opt/rocm/llvm/bin/clang-15": No such file or directory at /usr/bin//hipcc.pl line 169.
Use of uninitialized value $HIP_CLANG_TARGET in scalar chomp at /usr/bin//hipcc.pl line 170.
Use of uninitialized value $HIP_CLANG_VERSION in concatenation (.) or string at /usr/bin//hipcc.pl line 173.
Use of uninitialized value $HIP_CLANG_VERSION in concatenation (.) or string at /usr/bin//hipcc.pl line 735.
Use of uninitialized value $HIP_CLANG_VERSION in concatenation (.) or string at /usr/bin//hipcc.pl line 740.
sh: 1: /opt/rocm/llvm/bin/clang-15: not found

I wonder if a simple symlink to /opt/rocm/llvm/bin/clang-17 would solve this, but I'd need sudo access to do that.

These are the steps I followed to install flash attention :

# Install from source
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention/
export ROCM_PATH=/opt/rocm

GPU_ARCHS=gfx90a python setup.py install #MI210 series

I get this error

_CXX11_ABI=0 -fno-gpu-rdc
/bin/sh: 1: /opt/rocm/bin/hipcc: not found
ninja: build stopped: subcommand failed.

Then after exporting the CXX env var export CXX=/usr/bin/hipcc
I get this

  File "/usr/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/hipcc', '-v']' returned non-zero exit status 127.

which led me to the exploring hipcc -v and the errors that are reported at the beginning of this message.

Also note that the output says HIP version : 5.2.21153-0 but I would've expected it to be something like 6.1.x ?

Looks like this is coming from /usr/bin/hipvars.pm where it tries to read the version from HIP_PATH/bin/.hipVersion (L159) and on failure defaults to 5.2.21153-0. In newer versions it seems, this information is written to HIP_PATH/share/hip/version instead which might be why it can't find this file:

mnz@ml-lab1001:~$ cat /opt/rocm-6.1.0/share/hip/version
# Auto-generated by cmake
HIP_PACKAGING_VERSION_PATCH=40091.60100
CPACK_DEBIAN_PACKAGE_RELEASE=82~22.04
CPACK_RPM_PACKAGE_RELEASE=82
HIP_VERSION_MAJOR=6
HIP_VERSION_MINOR=1
HIP_VERSION_PATCH=40091
HIP_VERSION_GITHASH=a8dbc0c19

In the version tagged rocm-6.1.0, this script has been updated to point to the correct location and clang-15 is also no longer hardcoded in the hipcc.pl script. So it seems like we need to upgrade hipcc? Also, since it seems like the default HIP_PATH is the absolute path of the parent directory of hipcc ( unless the HIP_PATH environment variable is set) and all other paths are derived from that, we might also want to place these scripts under /opt/rocm-6.1.0/hip/bin?

Tobias has added symlinks pointing to clang-17 (@klausman I see that you have added symlinks for clang-14 not clang-15) and we get a different error now

hipcc -v
Debian clang version 15.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/11
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/12
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/12
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found HIP installation: /usr, version 5.2.21153
 "/usr/bin/ld" -pie --hash-style=both --build-id --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o a.out /lib/x86_64-linux-gnu/Scrt1.o /lib/x86_64-linux-gnu/crti.o /usr/bin/../lib/gcc/x86_64-linux-gnu/12/crtbeginS.o -L/usr/lib -L/usr/bin/../lib/clang/15.0.6/lib/linux -L/usr/bin/../lib/gcc/x86_64-linux-gnu/12 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib -lgcc_s -lgcc -lpthread -lm -lrt -lamdhip64 -lclang_rt.builtins-x86_64 -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/bin/../lib/gcc/x86_64-linux-gnu/12/crtendS.o /lib/x86_64-linux-gnu/crtn.o
/usr/bin/ld: /lib/x86_64-linux-gnu/Scrt1.o: in function `_start':
(.text+0x17): undefined reference to `main'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Tobias has added symlinks pointing to clang-17 (@klausman I see that you have added symlinks for clang-14 not clang-15) and we get a different error now

I've fixed the 14 vs 15 problem.

hipcc -v
Debian clang version 15.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/11
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/12
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-linux-gnu/12
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found HIP installation: /usr, version 5.2.21153
 "/usr/bin/ld" -pie --hash-style=both --build-id --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o a.out /lib/x86_64-linux-gnu/Scrt1.o /lib/x86_64-linux-gnu/crti.o /usr/bin/../lib/gcc/x86_64-linux-gnu/12/crtbeginS.o -L/usr/lib -L/usr/bin/../lib/clang/15.0.6/lib/linux -L/usr/bin/../lib/gcc/x86_64-linux-gnu/12 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib -lgcc_s -lgcc -lpthread -lm -lrt -lamdhip64 -lclang_rt.builtins-x86_64 -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/bin/../lib/gcc/x86_64-linux-gnu/12/crtendS.o /lib/x86_64-linux-gnu/crtn.o
/usr/bin/ld: /lib/x86_64-linux-gnu/Scrt1.o: in function `_start':
(.text+0x17): undefined reference to `main'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

That is odd. The compilation error is what usually happens when you try to compile something that doesn't have a main() function (i.e. a library or just general object file) into a standalone program.

It looks you can override just invocations of nvcc or hipcc without overriding invocations of g++ or clang++ when building extensions (which is what CXX would do) by setting PYTORCH_NVCC (See https://github.com/pytorch/pytorch/blob/main/torch/utils/cpp_extension.py#L2363). I did export PYTORCH_NVCC=/usr/bin/hipcc (you would have to unset CXX if it is set) and it isn't done building yet but the build has started successfully.

it isn't done building yet but the build has started successfully.

This finally finished running.

(flash-env) mnz@ml-lab1001:~/scratch/flash-attn-2$ pip show flash-attn
DEPRECATION: Loading egg at /srv/home/mnz/miniconda3/envs/flash-env/lib/python3.11/site-packages/flash_attn-2.7.0.post2-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
Name: flash_attn
Version: 2.7.0.post2
Summary: Flash Attention: Fast and Memory-Efficient Exact Attention
Home-page: https://github.com/Dao-AILab/flash-attention
Author: Tri Dao
Author-email: tri@tridao.me
License: 
Location: /home/mnz/miniconda3/envs/flash-env/lib/python3.11/site-packages/flash_attn-2.7.0.post2-py3.11-linux-x86_64.egg
Requires: einops, torch
Required-by:

Currently running pytest -v test_flash_attn_ck.py, about 20% of the way through.

It looks you can override just invocations of nvcc or hipcc without overriding invocations of g++ or clang++ when building extensions (which is what CXX would do) by setting PYTORCH_NVCC (See https://github.com/pytorch/pytorch/blob/main/torch/utils/cpp_extension.py#L2363). I did export PYTORCH_NVCC=/usr/bin/hipcc (you would have to unset CXX if it is set) and it isn't done building yet but the build has started successfully.

Great find Muniza!
I tried this and it seems to progress but I get some failures further down the build.
Also hipcc -v still fails with a different error.
I have all the info in this paste (both for flash_attn build and hipcc) and will work try to figure it out in a bit.

Looking at your paste, it seems like its loading hip from /usr:

In file included from /usr/include/hip/hip_fp16.h:29:

Can you run hipconfig to check if the HIP_PATH is /usr and if so, try setting it to /opt/rocm?

Thank Muniza! I have no idea how this was set (I don't see anything in my bash history). Unsetting the HIP_CLANG_PATH and setting export HIP_PATH=/opt/rocm seems to work (it is building now) 🎉

I think we need to configure rocm properly first:

after setting HIP_PATH=/opt/rocm I get this in hipconfig:

== hip-clang
HSA_PATH         : /opt/rocm/hsa
HIP_CLANG_PATH   : /opt/rocm/llvm/bin
Can't exec "/opt/rocm/llvm/bin/clang++-15": No such file or directory at /usr/bin//hipconfig.pl line 179.
Can't exec "/opt/rocm/llvm/bin/llc-15": No such file or directory at /usr/bin//hipconfig.pl line 180.
hip-clang-cxxflags : Can't exec "/opt/rocm/bin/hipcc": No such file or directory at /usr/bin//hipconfig.pl line 182.

hip-clang-ldflags  : Can't exec "/opt/rocm/bin/hipcc": No such file or directory at /usr/bin//hipconfig.pl line 185.

Alright I tried so many things in a disorganized way so now I tried to do things from start

This is what I am trying and it seems to be building

git clone https://github.com/ROCm/flash-attention.git
cd flash-attention/

python3 -m venv .venv && source .venv/bin/activate
pip install -U ninja packaging wheel

export PYTHONPATH="$PYTHONPATH:/srv/pytorch-rocm/venv/lib/python3.11/site-packages"
export PYTORCH_NVCC=/usr/bin/hipcc
export DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode
export HIP_DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode
export ROCM_PATH=/opt/rocm
export HIP_PATH=/opt/rocm

GPU_ARCHS=gfx90a PYTORCH_ROCM_ARCH=gfx90a python setup.py install

The above sequence of actions failed again. The logs are available in this paste

We have observed some inconsistencies for CK flash attention that need to be addressed.

  1. specific commits: The recommended vllm docker image uses a specific commit from rocm flash attention repo.
FROM base AS build_flash_attn
ARG FA_BRANCH="3cea2fb"
ARG FA_REPO="https://github.com/ROCm/flash-attention.git"
  1. git submodules :The ROCm/flash attention repo has git submodules which do not use a specific commit (just get the latest commit). The important bit is the composable-kernel repo (CK) which provides this flash attention implementation. Perhaps we need to pin to a specific commit.
  1. hipcc and clang versions
    • ROCm’s hipcc is looking for clang-15. We created symlinks to use clang-17 but this doesn't seem to solve the issues
    • The debian bookworm clang version is 17 available (/opt/rocm/llvm/bin/clang-17)
  1. OS version and packages [[ this is about the hip packages availbe in debian - will fill this later ]]
  1. Python versions: All the docker images available on DockerHub by pytorch ROCm use python 3.9 or 3.10. We are using python 3.11 which is available in debian bookworm.

The above sequence of actions failed again. The logs are available in this paste

I couldn't find the actual error in your paste, so I ran the sequence of commands from your comment above and it looks like this is the actual error:

In file included from /srv/pytorch-rocm/venv/lib/python3.11/site-packages/torch/
include/torch/csrc/api/include/torch/python.h:8:
In file included from /srv/pytorch-rocm/venv/lib/python3.11/site-packages/torch/
include/torch/csrc/Device.h:4:
/srv/pytorch-rocm/venv/lib/python3.11/site-packages/torch/include/torch/csrc/pyt
hon_headers.h:12:10: fatal error: 'Python.h' file not found
   12 | #include <Python.h>
      |          ^~~~~~~~~~
1 error generated when compiling for gfx90a.

There's more here but it seems that this file should either be under /usr/include/python3.11 or under .venv/include/python3.11 (but its not). I realize that I didn't run into this because I was using a conda env with a python 3.11 installation and Python.h can be found here: /home/mnz/miniconda3/envs/flash-env/include/python3.11/Python.h.

If you have miniconda installed, maybe you could try running the following? I just tried this again and was able to build CK FA2 from scratch:

git clone https://github.com/ROCm/flash-attention.git
cd flash-attention/

conda create -n flash-env python=3.11
conda activate flash-env
pip install -U ninja packaging wheel

export PYTHONPATH="$(conda info --base)/envs/flash-env/lib/python3.11/site-packages:/srv/pytorch-rocm/venv/lib/python3.11/site-packages"
export PYTORCH_NVCC=/usr/bin/hipcc
export DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode
export HIP_DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode
export ROCM_PATH=/opt/rocm
export HIP_PATH=/opt/rocm

GPU_ARCHS=gfx90a PYTORCH_ROCM_ARCH=gfx90a python setup.py install

Thanks a lot for all the help @MunizaA!! I will try it to check if it will work.
We'll need to figure out the proper setup afterwards anyway cause virtualenv is used in blubber/docker images
Mysteriously I never got the Python.h error...

The issues we are having seem to be related to hipcc so I will download the original image to see what the hipconfiguration looks like in there

The issues we are having seem to be related to hipcc so I will download the original image to see what the hipconfiguration looks like in there

Agreed, I came to the same conclusion in a comment above that a lot of these environment variables and workarounds would go away if we figure out hipcc:

In the version tagged rocm-6.1.0, this script has been updated to point to the correct location and clang-15 is also no longer hardcoded in the hipcc.pl script. So it seems like we need to upgrade hipcc? Also, since it seems like the default HIP_PATH is the absolute path of the parent directory of hipcc ( unless the HIP_PATH environment variable is set) and all other paths are derived from that, we might also want to place these scripts under /opt/rocm-6.1.0/hip/bin?

The python header file would still be a problem and the conda env kind of works around that for now. Unless I'm the only one who's run into that? But then I wonder why I can't find it under /usr/include and if it's supposed to be elsewhere.

I managed to get a successful build with you suggested (using conda) 🎉

Here is the result of the build.

Aiko run into the same Python header issue while building the triton variant. @achou you could try to use a miniconda env to do the testing for now and we can figure out a production environment later.

I took a look at one of the official rocm/pytroch images to see what hipconfig looks like over there. I used the image rocm/pytorch:rocm6.1.3_ubuntu22.04_py3.10_pytorch_release-2.1.2 (the pytorch version is older in this one so I should have used rocm6.1_ubuntu22.04_py3.10_pytorch_2.4 instead)
It is important to note that python 3.10 is the latest python available in any of these images !!!

When I attach to the image this is the hipconfig I get:

root@4d9bd37657f2:/var/lib/jenkins# hipconfig
HIP version  : 6.1.40093-bd86f1708

== hipconfig
HIP_PATH     : /opt/rocm-6.1.3
ROCM_PATH    : /opt/rocm
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME  : rocclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-6.1.3/include -I/opt/rocm-6.1.3/lib/llvm/lib/clang/17


== hip-clang
HIP_CLANG_PATH   : /opt/rocm/llvm/bin
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.3 24193 669db884972e769450470020c06a6f132a8a065b)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin
Configuration file: /opt/rocm-6.1.3/lib/llvm/bin/clang++.cfg
AMD LLVM version 17.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: westmere

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags :  -isystem "/opt/rocm-6.1.3/include" -O3
hip-clang-ldflags  : --driver-mode=g++ -O3 --hip-link --rtlib=compiler-rt -unwindlib=libgcc

=== Environment Variables
PATH=/opt/ompi/bin:/opt/ucx/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH=/opt/ompi/lib:/opt/rocm/lib:/usr/local/lib:

== Linux Kernel
Hostname     : 4d9bd37657f2
Linux 4d9bd37657f2 6.6.26-linuxkit #1 SMP Sat Apr 27 04:13:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy

Interestingly enough hipcc -v fails with the same message that we were getting so that might not be a problem for us in the end.

root@4d9bd37657f2:/var/lib/jenkins# hipcc -v
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.3 24193 669db884972e769450470020c06a6f132a8a065b)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin
Configuration file: /opt/rocm-6.1.3/lib/llvm/bin/clang++.cfg
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/11
Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/11
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Candidate multilib: x32;@mx32
Selected multilib: .;@m64
Found HIP installation: /opt/rocm-6.1.3/lib/llvm/bin/../../.., version 6.1.40093
 "/opt/rocm/llvm/bin/ld.lld" -z relro --hash-style=gnu --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o a.out /lib/x86_64-linux-gnu/crt1.o /lib/x86_64-linux-gnu/crti.o /opt/rocm-6.1.3/lib/llvm/lib/clang/17/lib/linux/clang_rt.crtbegin-x86_64.o -L/usr/lib/gcc/x86_64-linux-gnu/11 -L/usr/lib/gcc/x86_64-linux-gnu/11/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib --enable-new-dtags -L/opt/rocm-6.1.3/lib/llvm/bin/../../../lib -rpath /opt/rocm-6.1.3/lib/llvm/bin/../../../lib -lamdhip64 -lstdc++ -lm /opt/rocm-6.1.3/lib/llvm/lib/clang/17/lib/linux/libclang_rt.builtins-x86_64.a -lgcc_s -lc /opt/rocm-6.1.3/lib/llvm/lib/clang/17/lib/linux/libclang_rt.builtins-x86_64.a -lgcc_s /opt/rocm-6.1.3/lib/llvm/lib/clang/17/lib/linux/clang_rt.crtend-x86_64.o /lib/x86_64-linux-gnu/crtn.o
ld.lld: error: undefined symbol: main
>>> referenced by /lib/x86_64-linux-gnu/crt1.o:(_start)
clang: error: linker command failed with exit code 1 (use -v to see invocation)

If we want to make things owrk for debian we need to fix the hipconfig. What Muniza wrote in T371344#10362517 is the way I would try it as well.

Change #1100056 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc

https://gerrit.wikimedia.org/r/1100056

Change #1100056 merged by Klausman:

[operations/puppet@production] ml-lab/gpu: Add environment file that sets correct paths for ROCm/hipcc

https://gerrit.wikimedia.org/r/1100056

Change #1100995 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: try prebuilt flash attn package

https://gerrit.wikimedia.org/r/1100995

Change #1059119 abandoned by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] (WIP) huggingface: add flash attention 2

https://gerrit.wikimedia.org/r/1059119

While running quantization experiments on the aya-expanse-8b model in T377848#10382809, the vanilla model had an inference speed of about 6 seconds. Today, I used the same prompt from the previous experiments and accelerated this model with Flash Attention 2. The inference speed reduced by half - around 3 seconds.

To avoid relying on one prompt, I ran the huggingface optimum benchmark to compare the mean latency at each inference stage for the CohereForAI/aya-expanse-8b model, both in its vanilla form and when accelerated with Flash Attention 2. Below are the results:
(NB: Lower latency is better for inference speed)

aya-expanse-8b benchmark - vanilla vs flash attention 2 accelerated model - Screenshot from 2024-12-09 17-43-40.png (788×1 px, 105 KB)

Besides the loading stage, the Flash Attention 2 accelerated model consistently shows faster speeds compared to its vanilla counterpart.

It seems that the package we built on ml-labs for flash-attention2 can't be easily installed on a different environment. (tested both by me and Muniza). We have tested this on ml-lab and lift wing. In order to test it on LW I published the wheels built on ml-lab in a release on GH

The build process produces an .egg which we convert to a wheel with the following command which produces a .whl which should be easy to pip install.

wheel convert flash_attn-2.7.0.post2-py3.11-linux-x86_64.egg

Instead when we try to install it we get this error:

ERROR: flash_attn-2.7.0.post2-py311-cp311-linux_x86_64.whl is not a supported wheel on this platform.

@MunizaA was able to build a wheel which can be installed via the following way

pip install build
GPU_ARCHS=gfx90a PYTORCH_ROCM_ARCH=gfx90a python3 -m build --no-isolation

pip and build use an isolated environment and torch isn't declared as a build dependency so it won't be found.

I tried to install the wheel from this in a new env and although it installs it cant be used

ImportError: /home/isaranto/miniconda3/envs/flash251/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _Z17fmha_fwd_appendkv24fmha_fwd_appendkv_traits22fmha_fwd_appendkv_argsRKN7ck_tile13stream_configE

Change #1100995 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: try prebuilt flash attn package

https://gerrit.wikimedia.org/r/1100995

I pip installed the wheel that Kevin built in a new environment that had torch 2.5.1/rocm 6.1 and was able to load it and run inference with it. I will update the llm deployment to use that one so we can run a test.

@kevinbazira could you mention the process you followed as well as the environment on which you built it? (python+pytorch version)

I tried to install the wheel from this in a new env and although it installs it cant be used

ImportError: /home/isaranto/miniconda3/envs/flash251/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _Z17fmha_fwd_appendkv24fmha_fwd_appendkv_traits22fmha_fwd_appendkv_argsRKN7ck_tile13stream_configE

I can reproduce this too. It turns out that running python3 -m build --no-isolation first builds an sdist from the source and then builds a wheel from the sdist which results in the extensions not being built correctly. I did the following to get it to build the wheel from the source instead:

pip install -U ninja build
GPU_ARCHS=gfx90a PYTORCH_ROCM_ARCH=gfx90a python3 -m build --no-isolation --wheel .

This gives me a wheel that I can then install and use. We could also run python setup.py bdist_wheel but direct invocations of setup.py have been deprecated for some time now and this lets us avoid that.

@kevinbazira could you mention the process you followed as well as the environment on which you built it? (python+pytorch version)

I documented the process I followed here: P71677. The build environment had Python 3.11.10, pytorch-triton-rocm 3.0.0, and torch 2.4.1+rocm6.1.

Change #1104648 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: update flashattention2

https://gerrit.wikimedia.org/r/1104648

Change #1104648 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: update flashattention2

https://gerrit.wikimedia.org/r/1104648

I rebuilt a wheel with pytorch 2.5.1(rocm) using python setup.py bdist_wheel and was able to succesfully deploy it on Lift Wing.

I rebuilt a wheel with pytorch 2.5.1(rocm) using python setup.py bdist_wheel and was able to succesfully deploy it on Lift Wing.

I'm not 100% that flash attention was used when I reported this. I should have been more thorough with it.

Lift wing production is using ROCm 5.4 as I see in the puppet repo. I assume that this is the case for ml-staging-codfw as well.
This would then be the reason why we are having trouble on Lift wing when we try to use the flash attention package we build from source on ml-lab.

isarantopoulos lowered the priority of this task from High to Medium.Jan 28 2025, 2:38 PM