[Spike] Run models and frameworks on AMD GPU and identify challenges
Open, Needs TriagePublicSpike
Actions

Assigned To

None

Authored By

	isarantopoulos
	Apr 12 2023, 12:17 PM

Description

The aim of this task is to test AMD GPUs on our stack and identify the challenges/blockers (if any) of using them.
We want to test the following:

run some open source LLM models
deploy/serve an LLM that has been trained on an Nvidia GPU
Kserve:
- deploy/serve a model using kserve
- investigate how to share a GPU among multiple models and what the community is doing on this topic. There are two approaches we should explore:
  - Share a GPU among two or more pods
  - share GPU among multiple models in one pod.

Kubeflow: run an example of a training pipeline where the training step/pod uses a GPU. To test this we can install Kubeflow on minikube/kind and make the AMD GPU available. Test common frameworks(Pytorch, Tensorflow) with the GPU.

Details

Subject	Repo	Branch	Lines +/-
ml-services: deploy amd rocm package for llm server	operations/deployment-charts	master	+8 -8
llm: test bitsandbytes-rocm package	machinelearning/liftwing/inference-services	main	+1 -1
ml-services: upgrade llm image with new version of transformers	operations/deployment-charts	master	+8 -8
llm: add scipy and upgrade transformers	machinelearning/liftwing/inference-services	main	+2 -1
llm: 8bit quantization	machinelearning/liftwing/inference-services	main	+30 -8
ml-services: correctly override values in staging	operations/deployment-charts	master	+6 -2
llm: wipe VRAM memory when an out of memory event occurs	machinelearning/liftwing/inference-services	main	+13 -0
llm: add clean up steps when GPU errors are raised	machinelearning/liftwing/inference-services	main	+19 -6
ml-services: add more experimental settings for LLMs	operations/deployment-charts	master	+7 -6
ml-services: set readinessProbe settings for falcon-7b	operations/deployment-charts	master	+61 -67
kserve-inference: refactor the predictor's container settings	operations/deployment-charts	master	+41 -37
ml-services: add bloom-3b-gpu to the experimental namespace	operations/deployment-charts	master	+18 -2
bloom: delay gpu pytorch device context bootstrap	machinelearning/liftwing/inference-services	main	+15 -7
ml-services: deploy bloom-3b with AMD GPU support	operations/deployment-charts	master	+18 -0
admin_ng: add the ml-serve experimental namespace to production	operations/deployment-charts	master	+88 -55

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T333462 Experiment with GPUs in the Machine Learning infrastructure
		Open	Spike	None	T334583 [Spike] Run models and frameworks on AMD GPU and identify challenges

Event Timeline

isarantopoulos created this task.Apr 12 2023, 12:17 PM

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptApr 12 2023, 12:17 PM

isarantopoulos updated the task description. (Show Details)Apr 13 2023, 9:08 AM

elukey moved this task from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.May 2 2023, 2:29 PM

@isarantopoulos we have two GPUs on ml-serve1001, ready to be tested :)

We can probably think about rolling out again the experimental namespace in production as well, otherwise we'll need to create a specialized namespace etc..

Thoughts?

Change 927235 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: add the ml-serve experimental namespace to production

https://gerrit.wikimedia.org/r/927235

gerritbot added a project: Patch-For-Review.Jun 5 2023, 3:58 PM

Change 927235 merged by Elukey:

[operations/deployment-charts@master] admin_ng: add the ml-serve experimental namespace to production

https://gerrit.wikimedia.org/r/927235

Maintenance_bot removed a project: Patch-For-Review.Jun 5 2023, 4:11 PM

After a chat with Tobias we thought to restore the experimental namespace in prod, but limited to bloom/llm models for now (we can do it with the new template that takes dict and not lists in helmfile). Both bloom model serves are deployed on ml-serve-eqiad, I'll ping Ilias to test them with/without gpus :)

This is what I have used to test the GPU:

apiVersion: v1
kind: Pod
metadata:
  name: alexnet-tf-gpu-pod
  labels:
    purpose: demo-tf-amdgpu
spec:
  volumes:
    - name: rocm
      hostPath:
        path: /opt/rocm-5.4.0
  containers:
    - name: alexnet-tf-gpu-container
      volumeMounts:
      - mountPath: "/opt/rocm-5.4.0"
        name: rocm
        readOnly: true
      image: docker-registry.wikimedia.org/amd-gpu-tester:0.0.15-1
      env:
      - name: HIP_VISIBLE_DEVICES
        value: "0"
      resources:
        limits:
          amd.com/gpu: 1 # requesting a GPU

The relevant bit is:

resources:
  limits:
    amd.com/gpu: 1 # requesting a GPU

We should be able to add this config to our isvcs, but it needs to be tested :)

This is great! Launching the experimental namespace is probably the best/easiest thing to do. Will try to use the amd gpu asap

Change 927620 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy bloom-3b with AMD GPU support

https://gerrit.wikimedia.org/r/927620

gerritbot added a project: Patch-For-Review.Jun 6 2023, 11:15 AM

Semantics in pytorch are a bit weird related to rocm: https://pytorch.org/docs/stable/notes/hip.html
So as far as I understand we basically use the cuda keyword and rocm is used internally (if it is installed of course)

Change 927620 abandoned by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: deploy bloom-3b with AMD GPU support

Reason:

will revisit later

https://gerrit.wikimedia.org/r/927620

Maintenance_bot removed a project: Patch-For-Review.Jun 7 2023, 3:10 PM

Change 928464 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] bloom: move gpu pytorch device context bootstrap to load()

https://gerrit.wikimedia.org/r/928464

gerritbot added a project: Patch-For-Review.Jun 8 2023, 9:43 AM

Change 928464 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] bloom: delay gpu pytorch device context bootstrap

https://gerrit.wikimedia.org/r/928464

elukey mentioned this in rMLIS41314d0bfd51: bloom: delay gpu pytorch device context bootstrap.Jun 8 2023, 1:22 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 8 2023, 1:30 PM

Change 928593 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add bloom-3b-gpu to the experimental namespace

https://gerrit.wikimedia.org/r/928593

gerritbot added a project: Patch-For-Review.Jun 8 2023, 3:50 PM

Change 928593 merged by Elukey:

[operations/deployment-charts@master] ml-services: add bloom-3b-gpu to the experimental namespace

https://gerrit.wikimedia.org/r/928593

Maintenance_bot removed a project: Patch-For-Review.Jun 8 2023, 4:10 PM

Change 930200 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: set readinessProbe settings for falcon-7b

https://gerrit.wikimedia.org/r/930200

Change 930209 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve-inference: refactor the predictor's container settings

https://gerrit.wikimedia.org/r/930209

Change 930209 merged by Elukey:

[operations/deployment-charts@master] kserve-inference: refactor the predictor's container settings

https://gerrit.wikimedia.org/r/930209

Change 930200 merged by Elukey:

[operations/deployment-charts@master] ml-services: set readinessProbe settings for falcon-7b

https://gerrit.wikimedia.org/r/930200

Maintenance_bot removed a project: Patch-For-Review.Jun 15 2023, 7:30 AM

We were able to run falcon-7b llm on an AMD GPU on Lift Wing, but sadly the GPU's memory is not enough:

{"error":"OutOfMemoryError : HIP out of memory. Tried to allocate 80.00 MiB (GPU 0; 15.98 GiB total capacity; 15.87 GiB already allocated; 70.00 MiB free; 15.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF"}

Most probably this is related to the memory usage while generating new samples (i.e. running inference) . According to some memory profiling I have done (posted here), during generation time the amount of memory used is double the amount of the model size.
I can try to work on the following improvements:

Figure out a way to not use that much memory during inference
load/transform the model weights as 8 bit integers instead of 16bit floats to cut down model size by half. This hurts model performance but at this point we care mostly about figuring out ways to be able to deploy successfully. (https://huggingface.co/docs/transformers/main_classes/quantization)

Change 930622 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] llm: add clean up steps when GPU errors are raised

https://gerrit.wikimedia.org/r/930622

gerritbot added a project: Patch-For-Review.Jun 15 2023, 1:08 PM

Change 930632 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add more experimental settings for LLMs

https://gerrit.wikimedia.org/r/930632

Challenges found so far:

Due to how Knative works (that we use to manage deployments etc..) a pod is deleted only when its new version is up and running, to avoid dropping traffic. The main issue with GPUs is that every pod with the AMD annotations will hold a GPU, so in order to deploy a GPU must be free on a node to allow the old pod to keep running. This is probably not an issue in the cloud, but something that we'd need to fix.

Little issues like T339231, but those are fixable.

LLM models are requiring a lot of RAM, and on our servers we have 128G each. We may want to ask to DCops for more RAM on those nodes.

gmodena subscribed.Jun 15 2023, 1:49 PM

Change 930632 merged by Elukey:

[operations/deployment-charts@master] ml-services: add more experimental settings for LLMs

https://gerrit.wikimedia.org/r/930632

Change 930622 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: add clean up steps when GPU errors are raised

https://gerrit.wikimedia.org/r/930622

elukey mentioned this in rMLIS2fbfc30e4328: llm: add clean up steps when GPU errors are raised.Jun 16 2023, 6:27 AM

Maintenance_bot removed a project: Patch-For-Review.Jun 16 2023, 6:30 AM

Challenge with Falcon 7b, this is the first call to the model server (model tensors loaded to the GPU, plus tokens related to features):

2023-06-16 08:22:21.744 72 root ERROR [preprocess():59] An error has occurred in preprocess.
Traceback (most recent call last):
  File "/srv/llm/model-server/model.py", line 52, in preprocess
    self.check_gpu()
  File "/srv/llm/model-server/model.py", line 46, in check_gpu
    self.model = self.model.to(self.device)
  File "/opt/lib/python/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/opt/lib/python/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/opt/lib/python/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/lib/python/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/lib/python/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/opt/lib/python/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/opt/lib/python/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 80.00 MiB (GPU 0; 15.98 GiB total capacity; 15.87 GiB already allocated; 70.00 MiB free; 15.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

Second call:

2023-06-16 08:23:18.799 72 uvicorn.error ERROR [run_asgi():376] Exception in ASGI application
Traceback (most recent call last):
  File "/opt/lib/python/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/opt/lib/python/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/fastapi/applications.py", line 270, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/applications.py", line 124, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/opt/lib/python/site-packages/timing_asgi/middleware.py", line 68, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/opt/lib/python/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in __call__
    raise e
  File "/opt/lib/python/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 706, in __call__
    await route.handle(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 235, in app
    raw_response = await run_endpoint_function(
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 161, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/lib/python/site-packages/kserve/protocol/rest/v1_endpoints.py", line 69, in predict
    response, response_headers = await self.dataplane.infer(model_name=model_name, body=body, headers=headers)
  File "/opt/lib/python/site-packages/kserve/protocol/dataplane.py", line 276, in infer
    response = await model(body, headers=headers)
  File "/opt/lib/python/site-packages/kserve/model.py", line 117, in __call__
    else self.predict(payload, headers)
  File "/srv/llm/model-server/model.py", line 73, in predict
    outputs = self.model.generate(
  File "/opt/lib/python/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/lib/python/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/opt/lib/python/site-packages/transformers/generation/utils.py", line 2524, in sample
    outputs = self(
  File "/opt/lib/python/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/somebody/.cache/huggingface/modules/transformers_modules/modelling_RW.py", line 753, in forward
    transformer_outputs = self.transformer(
  File "/opt/lib/python/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/somebody/.cache/huggingface/modules/transformers_modules/modelling_RW.py", line 614, in forward
    causal_mask = self._prepare_attn_mask(
  File "/home/somebody/.cache/huggingface/modules/transformers_modules/modelling_RW.py", line 531, in _prepare_attn_mask
    expanded_attn_mask = _expand_mask(attention_mask, tgt_length=src_length)
  File "/home/somebody/.cache/huggingface/modules/transformers_modules/modelling_RW.py", line 116, in _expand_mask
    expanded_mask = ~(mask[:, None, None, :].to(torch.bool))
RuntimeError: HIP error: shared object initialization failed
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.

So it seems that the second call gets up to predict(), but it doesn't lead to the same issue. Anyway, the first error seems to indicate that the model fails to load on the GPU, we don't even get to loading tokens, so probably falcon 7b needs a GPU >> 16GB of ram.

We are discussing https://huggingface.co/docs/transformers/main_classes/quantization#load-a-large-model-in-8bit, as a way to reduce the model's footprint and possibly run it on a 16GB card.

Change 930797 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] llm: wipe VRAM memory when an out of memory event occurs

https://gerrit.wikimedia.org/r/930797

gerritbot added a project: Patch-For-Review.Jun 16 2023, 1:33 PM

Change 930797 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: wipe VRAM memory when an out of memory event occurs

https://gerrit.wikimedia.org/r/930797

elukey mentioned this in rMLIS6cef58de5b83: llm: wipe VRAM memory when an out of memory event occurs.Jun 16 2023, 2:11 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 16 2023, 2:30 PM

Tried to use the device_auto setting in , but this is the result:

Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Loading checkpoint shards:  50%|█████     | 1/2 [00:20<00:20, 20.37s/it]
Traceback (most recent call last):
  File "/srv/llm/model-server/model.py", line 78, in <module>
    model = LLM(model_name)
  File "/srv/llm/model-server/model.py", line 24, in __init__
    self.model, self.tokenizer = self.load()
  File "/srv/llm/model-server/model.py", line 28, in load
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/lib/python/site-packages/transformers/models/auto/auto_factory.py", line 466, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/lib/python/site-packages/transformers/modeling_utils.py", line 2795, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/lib/python/site-packages/transformers/modeling_utils.py", line 3123, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/lib/python/site-packages/transformers/modeling_utils.py", line 698, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/lib/python/site-packages/accelerate/utils/modeling.py", line 153, in set_module_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 1.10 GiB (GPU 0; 15.98 GiB total capacity; 15.02 GiB already allocated; 946.00 MiB free; 15.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

Current status for falcon:

I deployed the last version of the docker image that boots correctly, but that leads to a consistent VRAM out of memory every time that we try to hit the model server.
Created a revert (https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/930779) for the device_auto setting, but we may want to tweak/test it more?

Change 931585 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: correctly override values in staging

https://gerrit.wikimedia.org/r/931585

gerritbot added a project: Patch-For-Review.Jun 20 2023, 11:24 AM

Change 931585 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: correctly override values in staging

https://gerrit.wikimedia.org/r/931585

Maintenance_bot removed a project: Patch-For-Review.Jun 20 2023, 1:11 PM

Change 931648 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: 8bit quantization

https://gerrit.wikimedia.org/r/931648

gerritbot added a project: Patch-For-Review.Jun 20 2023, 4:35 PM

Change 931648 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: 8bit quantization

https://gerrit.wikimedia.org/r/931648

isarantopoulos mentioned this in rMLIS5296f4384787: llm: 8bit quantization.Jun 21 2023, 10:38 AM

Maintenance_bot removed a project: Patch-For-Review.Jun 21 2023, 11:10 AM

Change 932182 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: add scipy and upgrade transformers

https://gerrit.wikimedia.org/r/932182

gerritbot added a project: Patch-For-Review.Jun 22 2023, 9:18 AM

Change 932182 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: add scipy and upgrade transformers

https://gerrit.wikimedia.org/r/932182

isarantopoulos mentioned this in rMLIS04a2252c7036: llm: add scipy and upgrade transformers.Jun 22 2023, 9:36 AM

Maintenance_bot removed a project: Patch-For-Review.Jun 22 2023, 10:10 AM

Change 932196 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: upgrade transformers

https://gerrit.wikimedia.org/r/932196

gerritbot added a project: Patch-For-Review.Jun 22 2023, 10:21 AM

Change 932196 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: upgrade llm image with new version of transformers

https://gerrit.wikimedia.org/r/932196

Maintenance_bot removed a project: Patch-For-Review.Jun 22 2023, 11:10 AM

While trying to load models on GPU with weights as 8bit integers I have encountered the following stack trace:

/opt/lib/python/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/opt/lib/python/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//10.67.0.1'), PosixPath('tcp'), PosixPath('443')}
  warn(msg)
/opt/lib/python/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
  warn(msg)
/opt/lib/python/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
  warn(msg)
/opt/lib/python/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
  warn(msg)
/opt/lib/python/site-packages/_distutils_hack/__init__.py:18: UserWarning: Distutils was imported before Setuptools, but importing Setuptools also replaces the `distutils` module in `sys.modules`. This may lead to undesirable behaviors or errors. To avoid these issues, avoid using distutils directly, ensure that setuptools is installed in the traditional way (e.g. not an editable install), and/or make sure that setuptools is always imported before distutils.
  warnings.warn(
/opt/lib/python/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
Traceback (most recent call last):
  File "/srv/llm/model-server/model.py", line 134, in <module>

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /opt/lib/python/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/opt/lib/python/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine!
CUDA SETUP: Loading binary /opt/lib/python/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
    model = llm_class(model_name)
  File "/srv/llm/model-server/nllb.py", line 14, in __init__
    super().__init__(model_name)
  File "/srv/llm/model-server/model.py", line 26, in __init__
    self.model, self.tokenizer = self.load()
  File "/srv/llm/model-server/nllb.py", line 27, in load
    model = AutoModelForSeq2SeqLM.from_pretrained(
  File "/opt/lib/python/site-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/lib/python/site-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/lib/python/site-packages/transformers/modeling_utils.py", line 3228, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/lib/python/site-packages/transformers/modeling_utils.py", line 728, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/opt/lib/python/site-packages/transformers/utils/bitsandbytes.py", line 89, in set_module_quantized_tensor_to_device
    new_value = bnb.nn.Int8Params(new_value, requires_grad=False, **kwargs).to(device)
  File "/opt/lib/python/site-packages/bitsandbytes/nn/modules.py", line 294, in to
    return self.cuda(device)
  File "/opt/lib/python/site-packages/bitsandbytes/nn/modules.py", line 258, in cuda
    CB, CBt, SCB, SCBt, coo_tensorB = bnb.functional.double_quant(B)
  File "/opt/lib/python/site-packages/bitsandbytes/functional.py", line 1987, in double_quant
    row_stats, col_stats, nnz_row_ptr = get_colrow_absmax(
  File "/opt/lib/python/site-packages/bitsandbytes/functional.py", line 1876, in get_colrow_absmax
    lib.cget_col_row_stats(ptrA, ptrRowStats, ptrColStats, ptrNnzrows, ct.c_float(threshold), rows, cols)
  File "/usr/lib/python3.9/ctypes/__init__.py", line 387, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib/python3.9/ctypes/__init__.py", line 392, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /opt/lib/python/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats

This seems to be related to bitsandbytes requirements as it requires cuda.
Some more context: The bitsandbytes package is a package that does the following:

The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.

It was started by facebook research and its development is continued here
The package doesn't support rocm yet and has the following hardware requirements:

Hardware requirements:

LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older).
8-bit optimizers and quantization: NVIDIA Kepler GPU or newer (>=GTX 78X).
Supported CUDA versions: 10.2 - 12.0

There are open issues and some PRs about incorporating ROCm support but it seems stuck for a while. However the work required exists in this repo with a newer version in a branch in this repo
I will try the above rocm package hoping that it would work with our old GPU, however even if it works we would have to figure out how to use it cause at the moment future support is not guaranteed and also we rely on an install from a github repo (and not a package repository)

The effort to load a model using 8bit integers has been successful in this colab notebook (the current output in the notebook doesnt use a gpu as I dont have any more free units - you can see the )
Tested using a Tesla T4 running cuda 12.0

Change 932227 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: test bitsandbytes-rocm package

https://gerrit.wikimedia.org/r/932227

gerritbot added a project: Patch-For-Review.Jun 22 2023, 12:25 PM

Change 932227 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: test bitsandbytes-rocm package

https://gerrit.wikimedia.org/r/932227

isarantopoulos mentioned this in rMLIS26dbb8595db6: llm: test bitsandbytes-rocm package.Jun 26 2023, 2:55 PM

Change 933119 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy amd rocm package for llm server

https://gerrit.wikimedia.org/r/933119

Change 933119 merged by jenkins-bot: