Page MenuHomePhabricator

[LLM] quantization: allow loading model weights as int8/int4 with HF
Closed, ResolvedPublic5 Estimated Story Points

Description

As an engineer,

I want to load models using int8 integers in the transformers library so that I can use bigger LLMs that don't fit in the GPUs VRAM (e.g. aya23-35B).

We want to explore the post-training quantization options available for ROCm using pytorch models. All experimentation should be done using the aya-expanse-8B and aya-expanse-32B models for now, as these are the ones we want to validate. Below is a screenshot from the huggingface docs which depicts the current status of the libraries and what is available for ROCm.

hf_quant.png (1×1 px, 252 KB)

A great resource for model quantization on ROCm is also the ROCm AMD docs

Event Timeline

isarantopoulos renamed this task from [LLM] Allow loading model weights as int8 models with HF to [LLM] Allow loading model weights as int8 with HF.Oct 22 2024, 2:29 PM
isarantopoulos updated the task description. (Show Details)
isarantopoulos set the point value for this task to 3.
isarantopoulos moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.
isarantopoulos changed the point value for this task from 3 to 5.Nov 25 2024, 9:03 AM
isarantopoulos renamed this task from [LLM] Allow loading model weights as int8 with HF to [LLM] quantization: allow loading model weights as int8/int4 with HF.Dec 2 2024, 9:44 AM

bitsandbytes

There are two pages with similar installation instructions. The HF docs and the AMD docs

Building from source is failing due to ROCm misconfiguration.
I am setting the following env vars in order to be able to launch the build process:

export PYTORCH_NVCC=/usr/bin/hipcc
export DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode
export HIP_DEVICE_LIB_PATH=/opt/rocm/amdgcn/bitcode
export ROCM_PATH=/opt/rocm
export HIP_PATH=/opt/rocm

and then follow the AMD docs instructions

# Clone the github repo
git clone --recurse https://github.com/ROCm/bitsandbytes.git
cd bitsandbytes
git checkout rocm_enabled_multi_backend

# Install dependencies
pip install -r requirements-dev.txt

# Use -DBNB_ROCM_ARCH to specify target GPU arch
cmake -DBNB_ROCM_ARCH="gfx90a" -DCOMPUTE_BACKEND=hip -S .

# Compile the project
make

# Install
python setup.py install

I get a cmake error that the hipcc executable can't be found. The full error log is available in paste

Taking the easy route and installing the pre-built wheel seems to work

# Note, if you don't want to reinstall BNBs dependencies, append the `--no-deps` flag!
pip install --force-reinstall 'https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.44.1.dev0-py3-none-manylinux_2_24_x86_64.whl'
pip show bitsandbytes

Name: bitsandbytes
Version: 0.44.1.dev0+9315692
Summary: k-bit optimizers and matrix multiplication routines.
Home-page: https://github.com/TimDettmers/bitsandbytes
Author: Tim Dettmers
Author-email: dettmers@cs.washington.edu
License: MIT
Location: /home/isaranto/miniconda3/envs/flash-env/lib/python3.11/site-packages
Requires: numpy, torch
Required-by:

I am able to load the aya-expanse-8b model in 4bit and run inference with it which occupies only 5.6GB of VRAM
Successfully tried the same with aya-expanse-32b in 4bit and the memory footprint for the model is : 17.97 GB with an inference latency of 10s for a sample that was taking 30s before.

However when I attempt to load in 8bit I am able to load the model but running inference fails (full error log)

Error: Matmul Algo Heuristic didn't return algorithms
error detected
........
 Exception: cublasLt ran into an error!

AWQ

Building from source according to the repo was unsuccessful.

I get a Python.h not found error. The complete error log is in paste.

                 from awq_ext/exllama/exllama_ext_hip.cpp:4:
/srv/pytorch-rocm/venv/lib/python3.11/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory
   12 | #include <Python.h>
      |          ^~~~~~~~~~
compilation terminated.
error: command '/usr/bin/g++' failed with exit code 1
[end of output]

I'll try to build this using a miniconda env.

AWQ

I was able to build and install AutoAWQ using a miniconda env with these steps. However, after loading aya-expanse-8b and attempting to quantize the model, I encountered an out-of-memory error. The complete error log is in this paste.

(awq-env) aikochou@ml-lab1001:~$ pip show autoawq-kernels
Name: autoawq_kernels
Version: 0.0.9+rocm614
Summary: AutoAWQ Kernels implements the AWQ kernels.
Home-page: https://github.com/casper-hansen/AutoAWQ_kernels
Author: Casper Hansen
Author-email: 
License: MIT
Location: /home/aikochou/miniconda3/envs/awq-env/lib/python3.11/site-packages
Requires: torch
Required-by: 

(awq-env) aikochou@ml-lab1001:~$ pip show autoawq
Name: autoawq
Version: 0.2.7.post2
Summary: AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.
Home-page: https://github.com/casper-hansen/AutoAWQ
Author: Casper Hansen
Author-email: 
License: MIT
Location: /home/aikochou/miniconda3/envs/awq-env/lib/python3.11/site-packages
Requires: accelerate, datasets, tokenizers, torch, transformers, triton, typing-extensions, zstandard
Required-by:

AWQ

I was able to quantize aya-expanse-8b using AWQ. The quantized model is saved in my home directory at /home/aikochou/aya-expanse-8b-AWQ. While I could load the quantized model, the Jupyter kernel crashed when attempting to run inference.

To determine if this was specific to my quantized model, I tested a prebuilt quantized model Orion-zhen/aya-expanse-32b-AWQ. The result was the same—the Jupyter kernel crashed during inference.

The code and output logs for quantizing aya-expanse-8b, loading the quantized model, and running inference can be found in P71499. The logs for loading and running inference on the prebuilt quantized aya-expanse-32b can be found in P71500.

To determine if this issue was specific to the Aya model, I tested a different prebuilt model, TheBloke/zephyr-7B-beta-AWQ, which is referenced in their documentation. Running it in a Python script produced this error message:

Memory access fault by GPU node-1 (Agent handle: 0x7e472c0) on address 0x6fe53000. Reason: Unknown.
Aborted

The full logs can be found in P71505.

AWQ

I was able to run inference using quantized models, but I’m still figuring out the right settings for the Aya model on ROCm GPU.

List the issues I encountered and the findings:

  • Due to improper installation of the awq_ext module, the system skipped fusing modules—a large part of the speedup from AutoAWQ. Consequently, my latency tests showed that the non-quantized model performed faster than the quantized version. The inference code is available in P71579. After finding a similar issue, I reinstalled the AutoAWQ kernels, which successfully installed the AWQ extension.
  • When the fused modules were enabled, I encountered another error (full logs in P71581). Upon investigation, I discovered that fused modules aren't supported for AMD GPUs. Instead, AMD GPUs use ExLlamaV2 kernels. Following the authors' suggestion, I used AutoAWQForCausalLM.from_quantized(..., use_exllama_v2=True) for ROCm GPUs. While this worked, the model produced really poor quality responses (See the issue, P71582).

I'm now testing an alternative way to load the quantized model: using AutoModelForCausalLM.from_pretrained with quantization_config = AwqConfig(version="exllama") as referenced in Hugging Face's AWQ docs. This way appears to work correctly, but I have an issue related to bitsandbytes. (See P71594)

I'm now testing an alternative way to load the quantized model: using AutoModelForCausalLM.from_pretrained with quantization_config = AwqConfig(version="exllama") as referenced in Hugging Face's AWQ docs. This way appears to work correctly, but I have an issue related to bitsandbytes. (See P71594)

@achou regarding the above error try to install bitsandbytes from the following source. This is what will work with AMD GPU according to the docs

pip install --force-reinstall 'https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.44.1.dev0-py3-none-manylinux_2_24_x86_64.whl'

GPTQ

NOTE: Previously, I followed HF GPTQ instructions and used pip install auto-gptq which led to surprising results of the quantized model being slower than its non-quantized counterpart. After following AMD ROCm GPTQ instructions, I built auto-gptq from source and this time round the quantized model is faster than its non-quantized counterpart.

I tested the inference performance of both the non-quantized and GPTQ quantized versions of the aya-expanse-8b model and found that CohereForAI/aya-expanse-8b inference speed was around 6 seconds, while kevinbazira/aya-expanse-8b-gptq-4bit had a reduced speed of about 3 seconds.

Since I was using a single prompt to compare inference speeds, I wanted to avoid drawing conclusions based solely on that. I decided to run the HF optimum benchmark to compare mean latency across each inference stage for both models. Below are the results:
(NB: Lower latency is better for inference speed)

aya-expanse-8b non-quantized vs quantized benchmark - Screenshot from 2024-12-12 10-39-10.png (720×1 px, 96 KB)

The GPTQ-4bit quantized version of aya-expanse-8b is slower during the loading stage. However, this can be managed by how we pre-load models in KServe isvcs. In contrast, it performs faster during the decoding stage, where its non-quantized counterpart spends much more time.

bitsandbytes

I took a shot at deploying aya8b with bitsandbytes on lift wing using the LLM model server and got the following failures which I would narrow down to 2 reasons:

  • no access to /opt/rocm : the pod doesn't have access to the path on the host in order to find the ROCm architecture using rocminfo. We should be able to bypass that by setting the env var ROCM_TARGET to gfx90a I tried once with no luck.
  • The second error we get could be due to the different versions used + the inability to use rocminfo. It seems it is trying to use the libbitsandbytes_cpu.so instead of a ROCm one. We are using the torch base image with pytorch 2.3.0 and rocm 6.0. Our test on ml-lab was done with torch 2.4.0 and rocm 6.1 Looking in the contents of the site packages I dont see something for rocm6.0.
-rw-r--r-- 1 somebody somebody   32848 Dec  4 12:16 libbitsandbytes_cpu.so
-rw-r--r-- 1 somebody somebody 3561832 Dec  4 12:16 libbitsandbytes_rocm61.so
-rw-r--r-- 1 somebody somebody 3566032 Dec  4 12:16 libbitsandbytes_rocm62.so
kubectl logs -f aya-llm-predictor-00002-deployment-596854fc4d-f7q26 kserve-container
+ source common_settings.sh
+++ /usr/bin/python3 -c 'from python.resource_utils import get_cpu_count; print(get_cpu_count())'
/opt/lib/venv/lib/python3.11/site-packages/pytools/persistent_dict.py:52: RecommendedHashNotFoundWarning: Unable to import recommended hash 'siphash24.siphash13', falling back to 'hashlib.sha256'. Run 'python3 -m pip install siphash24' to install the recommended hash.
  warn("Unable to import recommended hash 'siphash24.siphash13', "
++ CPU_COUNT=8
++ echo 'CPU count detected from get_cpu_count: 8'
CPU count detected from get_cpu_count: 8
++ export OMP_NUM_THREADS=8
++ OMP_NUM_THREADS=8
+ MODEL_SERVER_PATH=src/models/llm/model.py
+ exec /usr/bin/python3 src/models/llm/model.py
/opt/lib/venv/lib/python3.11/site-packages/pytools/persistent_dict.py:52: RecommendedHashNotFoundWarning: Unable to import recommended hash 'siphash24.siphash13', falling back to 'hashlib.sha256'. Run 'python3 -m pip install siphash24' to install the recommended hash.
  warn("Unable to import recommended hash 'siphash24.siphash13', "
ERROR:bitsandbytes.cuda_specs:Could not detect ROCm GPU architecture: [Errno 2] No such file or directory: 'rocminfo'
WARNING:bitsandbytes.cuda_specs:
ROCm GPU architecture detection failed despite ROCm being available.

WARNING:bitsandbytes.cextension:Could not find the bitsandbytes ROCm binary at PosixPath('/opt/lib/venv/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_rocm60_nohipblaslt.so')
/opt/lib/venv/lib/python3.11/site-packages/bitsandbytes/backends/cpu_xpu_common.py:29: UserWarning: g++ not found, torch.compile disabled for CPU/XPU.
  warnings.warn("g++ not found, torch.compile disabled for CPU/XPU.")
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/srv/app/src/models/llm/model.py", line 138, in <module>
    model = llm_class(model_name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/srv/app/src/models/llm/aya/aya.py", line 14, in __init__
    super().__init__(model_name)
  File "/srv/app/src/models/llm/model.py", line 28, in __init__
    self.model, self.tokenizer = self.load()
                                 ^^^^^^^^^^^
  File "/srv/app/src/models/llm/aya/aya.py", line 23, in load
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 4728, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/modeling_utils.py", line 995, in _load_state_dict_into_meta_model
    hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
  File "/opt/lib/venv/lib/python3.11/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 238, in create_quantized_param
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(target_device)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 338, in to
    return self._quantize(device)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 297, in _quantize
    w_4bit, quant_state = bnb.functional.quantize_4bit(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/bitsandbytes/functional.py", line 991, in quantize_4bit
    return backends[A.device.type].quantize_4bit(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/bitsandbytes/backends/cuda.py", line 508, in quantize_4bit
    lib.cquantize_blockwise_bf16_fp4(
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/venv/lib/python3.11/site-packages/bitsandbytes/cextension.py", line 78, in __getattr__
    return getattr(self._lib, item)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ctypes/__init__.py", line 389, in __getattr__
    func = self.__getitem__(name)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ctypes/__init__.py", line 394, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: /opt/lib/venv/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cquantize_blockwise_bf16_fp4

For next steps I would suggest to :

  1. rerun bitsandbytes with an env that has torch 2.5.1 + rocm 6.1
  2. if the above succeeds add a new image in production images for the combination from 1. Otherwise add a new image with torch 2.4.0-ROCm6.1 which has already been successfully tested.

bitsandbytes

It seems that rocminfo is required to run and there is no option for the user to provide the GPU architecture info otherwise. A PR to upstream would be a good idea in order for this information to be accessed from an environment variable
More info can be found in the code here.
It seems that the ROCm fork provides the release I was using for the multi-backened release. The only evidence I found were the results of my code search on github (no results in the original bnb repo that has the release)

I opened an issue to ROCm/bitsandbytes about this.

AWQ

I finally managed to run AWQ quantized models properly! (Thanks to @MunizaA for pointing out that we need both use_exllama_v2=True and fuse_layers=False when loading the model <3).

The speedup from the prebuilt aya-expanse-32b-AWQ is mind-blowing! I made the following table so we can easily compare the results. The full inference code and outputs are available in P71635 and P71636.

  • memory usage
aya-expanse-8b17G + 13G
aya-expanse-8b-AWQ3.3G + 3.5G
aya-expanse-32b52G + 62G
aya-expanse-32b-AWQ12G + 13G
  • Inference latency
generate256 tokens256 tokens (S)
aya-expanse-8b21 s9.07 s
aya-expanse-8b-AWQ41.8 s11.4 s
generate32 tokens256 tokens32 tokens (S)64 tokens (S)128 tokens (S)256 tokens (S)
aya-expanse-32b1min 6sx1min 7s2min 14sxx
aya-expanse-32b-AWQ2.13 s15.2 s2.72 s4.02 s7.8 s8.81 s

*(S) = using a Streamer

Using 1 GPU

modelmemory usage32 tokens256 tokens32 tokens (S)64 tokens (S)128 tokens (S)256 tokens (S)
aya-expanse-32b52.8G2min 21sx2min 22s4min 40sxx
aya-expanse-32b-AWQ22.4G2.09 s12.4 s1.81 s3.3 s6.36 s12.6 s

Change #1101491 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: add rocminfo executable for gfx90a

https://gerrit.wikimedia.org/r/1101491

Change #1101491 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: add dummy rocminfo executable for gfx90a

https://gerrit.wikimedia.org/r/1101491

Change #1101804 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: update torch image to 2.5.1

https://gerrit.wikimedia.org/r/1101804

Change #1101804 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: update torch image to 2.5.1

https://gerrit.wikimedia.org/r/1101804

bitsandbytes

Deployed bitsandbytes with aya-expanse-8B on experimental ns in ml-staging-codfw:
Sincne this uses the llm model server the api is different than the previous one that was using the huggingfaceserver

time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/aya-expanse-8B:predict" -X POST -d '{"prompt":"What are some popular machine learning frameworks in python?", "result_length": 100}'  -H "Content-Type: application/json" -H "Host: aya-llm.experimental.wikimedia.org"
{"model_name":"aya-expanse-8B","response":"<BOS_TOKEN>What are some popular machine learning frameworks in python?\nHere are 10 most popular machine learning frameworks in Python.\n- TensorFlow.\n- Keras.\n- PyTorch.\n- Scikit-learn.\n- XGBoost.\n- PyTorch Lightning.\n- H2O.\n- MXNet.\n- Fastai.\n- LightFM.\nHow do I get started with machine learning?\nGetting Started with Machine Learning – Step by Step Guide\n- Understand the Problem.\n- Collect and Prepare the Data"}
real	0m4.462s
user	0m0.016s
sys	0m0.007s

GPTQ

I tried the kevinbazira/aya-expanse-8b-gptq-4bit and it performed fast. The full inference code and outputs are available in P71700, along with the steps I built and installed AutoGPTQ.

  • Inference latency
generate256 tokens256 tokens (S)
aya-expanse-8b21 s9.07 s
aya-expanse-8b-GPTQ6.75 s2.81 s

For the next step, I'll use llmperf to generate heatmaps showing how input/output token variations affect latency in aya models, similar to this notebook.

Change #1105002 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: add inference_mode and g++

https://gerrit.wikimedia.org/r/1105002

Change #1105002 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: add inference_mode and g++

https://gerrit.wikimedia.org/r/1105002

Change #1108072 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: bump bitsandbytes version

https://gerrit.wikimedia.org/r/1108072

Change #1108072 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: bump bitsandbytes version

https://gerrit.wikimedia.org/r/1108072

Change #1108087 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: add inference_mode and g++

https://gerrit.wikimedia.org/r/1108087

Change #1108087 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: add inference_mode and g++

https://gerrit.wikimedia.org/r/1108087

GPTQ
Official ROCm docs mention the installation of AutoGPTQ as a the first step. However as noted in the readme of the Auto-GPTQ repo

AutoGPTQ development has stopped. Please switch to GPTQModel as drop-in replacement.

Also when you try to build autogptq you get the following warning.

WARNING: AutoGPTQ has stopped development. Please transition to GPTQModel: https://github.com/ModelCoud/GPTQModel
GPTQModel has been merged into Transformers/Optimum and full deprecation of AutoGPTQ within HF frameworks is planned in the near-future.

Tried to setup a new env and test it

  • Installed torch-rocm 2.5.1
  • tried both building GPTQModel from pip and from source and I'm getting the following error
Trying to compile GPTQModel for CUDA, but Pytorch 2.5.1+rocm6.1 is installed without CUDA support.

Looking at the code in [[ SKIP_ROCM_VERSION_CHECK | setup.py ]] it seems that you need to define a cuda version. Will look more into that.

This is what I ran:

pip install --no-cache torch==2.5.1 --extra-index-url https://download.pytorch.org/whl/rocm6.1/
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
ROCM_VERSION=6.1 SKIP_ROCM_VERSION_CHECK=True pip install -v . --no-build-isolation

The SKIP_ROCM_VERSION_CHECK is required otherwise we get an error cause only ROCm version 6.2+ are tested. I'm going to switch to version 6.2 to test it properly

The information from this task has been summarized and documented on a Wikitech page.