Page MenuHomePhabricator

Test the feasibility of deployment of Aya-expanse model in LiftWing
Closed, ResolvedPublic

Description

We are using the Aya23 model to generate simple summaries of (sections of) Wikipedia articles. In the initial experiments, we have used Cohere's API endpoint (see this PAWS-notebook for an example).

In this task, we want to figure out whether we could host the model in LiftWing. There are different versions:

Alternatively, we also consider the next generation of this model, Aya-expanse because i) it is supposedly strictly better than the Aya-23, ii) the larger version has a slightly smaller memory-footprint, iii) it supports the same 23 languages as aya-23.

Additional notes:

  • Scope: The aim of this task is to test whether we can host one of these models as a proof-of-concept and not as a production-ready service. Most likely, if the model can be hosted, it will require additional work around optimization which will be captured in follow-up work.
  • Context: This work supports hypothesis WE.3.1.3: If we develop models for remixing content such as a content simplification or summarization that can be hosted and served via our infrastructure (e.g. LiftWing), we will establish the technical direction for work focused on increasing reader retention through new content discovery features.

Event Timeline

Update: Aya-23-8B model runs successfully in LiftWing. thanks @isarantopoulos

Example query (works only internally, e.g. from stat-machines)

curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -H "Host: aya23.experimental.wikimedia.org" -H "Content-Type: application/json" -X POST -d '{"model": "aya23", "prompt": ".", "max_tokens": 100}'

Update: @isarantopoulos did first experiments with the Aya-23-35B model. It does not work out of the box. The raw version of the model is 65GB on disk and does not fit into memory. We will explore some potential workarounds using a quantized model, e.g. using in8 datatype, to reduce footprint such that its compatible with our infrastructure.

@MGerlach shall we shift the focus to the new aya-expanse models? Specifically the aya-expanse-8B and aya-expanse-32B.

The 32B variant is 61GB on disk and a little bit more when loaded on GPU making it a tight fit for one GPU but it may work.
I tested it on ml-lab and was able to load it and run inference. It took 33 seconds 😵 but that's the start.

Screenshot 2024-11-08 at 4.02.57 PM.png (337×939 px, 35 KB)

@isarantopoulos Thanks for the updates

shall we shift the focus to the new aya-expanse models? Specifically the aya-expanse-8B and aya-expanse-32B.

yes, I think this makes sense. From what I understand, the new Aya-expanse should be strictly better than the older aya-23. Thus, test-deploying the aya-expanse is useful for this hypothesis as we would likely switch to that model in future experiments (especially if the model is slightly smaller so it can fit into memory, even if it is a close call).

The 32B variant is 61GB on disk and a little bit more when loaded on GPU making it a tight fit for one GPU but it may work. I tested it on ml-lab and was able to load it and run inference. It took 33 seconds 😵 but that's the start.

  • Do you have a sense how this compares to the 8B version? as a comparison, the test-deployment of aya-23-8b (T379052#10291972) is much faster,
  • Do you have a sense of what are the options to reduce this time (substantially) for a test-deployment?

I made a first attempt to deploy the 32B model on LiftWing and I'm dumping some notes for future reference:

It seems that the model couldn't fit on the GPU so I got he following error

kubectl logs aya23-predictor-00007-deployment-5c7dc5c886-r64dl
2024-11-08 17:22:58.133 7 kserve INFO [storage.py:download():66] Copying contents of /mnt/models to local
2024-11-08 17:22:58.134 7 kserve INFO [storage.py:download():110] Successfully copied /mnt/models to None
2024-11-08 17:22:58.134 7 kserve INFO [storage.py:download():111] Model downloaded in 0.00038403994403779507 seconds.
2024-11-08 17:22:58.135 7 kserve INFO [__main__.py:load_model():204] Loading generative model for task 'text_generation' in torch.bfloat16
2024-11-08 17:22:58.500 7 kserve INFO [generative_model.py:load():206] Decoder-only model detected. Setting padding side to left.
2024-11-08 17:22:58.915 7 kserve INFO [generative_model.py:load():223] Successfully loaded tokenizer
Loading checkpoint shards: 100%|██████████| 14/14 [00:49<00:00,  3.56s/it]
WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.
WARNING:accelerate.big_modeling:You shouldn't move a model that is dispatched using accelerate hooks.
2024-11-08 17:23:49.317 7 kserve ERROR [__main__.py:<module>():259] Failed to start model server: You can't move a model that has some modules offloaded to cpu or disk.

After an attempt to redeploy with the default dtype of the model which was used on ml-lab the pod was evicted due to low ephemeral storage. Leaving it for now and will circle back again.

Warning  Evicted    90s    kubelet            The node was low on resource: ephemeral-storage.

Do you have a sense how this compares to the 8B version? as a comparison, the test-deployment of aya-23-8b (T379052#10291972) is much faster,

Default latency was approx 1-6s depending on the size of the requested output (by default I mean no improvements)

Do you have a sense of what are the options to reduce this time (substantially) for a test-deployment?

There are various options and I think we should use as many as we can. The first things to try is lower precision, quantization and use inference optimization frameworks (vllm). We (ML team) need to work to provide the necessary tooling for the amd gpus. Basically build a docker image to use vllm. I see CohereForCausalLM listed in the supported architectures although there is no reference to Aya23 or expanse - just Command-R

Change #1088609 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update aya model deployment to aya-expanse-8b

https://gerrit.wikimedia.org/r/1088609

Change #1088609 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update aya model deployment to aya-expanse-8b

https://gerrit.wikimedia.org/r/1088609

The aya-expanse-8B model has been deployed.
Example request:

curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -H "Host: aya.experimental.wikimedia.org" -H "Content-Type: application/json" -X POST -d '{"model": "aya-expanse-8B", "prompt": ".", "max_tokens": 100}'

Just pasting an update.
I've loaded the 32B on ml-lab using accelerate and it used ~54GB GPU VRAM and only 5-7GB CPU memory. Previous attempts to "just load the model" with transformers utilized 75GB of CPU memory to load the model.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from accelerate import init_empty_weights, infer_auto_device_map

model_name = "CohereForAI/aya-expanse-32b"
config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

tokenizer = AutoTokenizer.from_pretrained(model_name)

weights_location = "/srv/hf-cache/hub/models--CohereForAI--aya-expanse-32b/snapshots/c1df2547e1f5fe22e1f4897f980f231dc74cfc27"
model = load_checkpoint_and_dispatch(
    model, checkpoint=weights_location, device_map="auto", dtype=torch.float16
)

However this way the model weights are not properly loaded and the model can't be used. Will do more work on this to figure it out. Ideally I'd like to use a custom device_map that works best for each model.
For this work we will probably use custom model servers as the huggingfaceserver from kserve is not that easy to customize

Change #1100441 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] (WIP) llm: add aya with bitsandbytes

https://gerrit.wikimedia.org/r/1100441

Change #1100441 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: add aya with bitsandbytes

https://gerrit.wikimedia.org/r/1100441

Change #1100997 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: add aya to __init__

https://gerrit.wikimedia.org/r/1100997

Change #1101000 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: revamp llm model server with aya-8B

https://gerrit.wikimedia.org/r/1101000

Change #1100997 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: add aya to __init__

https://gerrit.wikimedia.org/r/1100997

Made an attempt to load the 32B on a deployment on Lift Wing with bitsandbytes and got the following error on model load

/opt/lib/venv/lib/python3.11/site-packages/bitsandbytes/backends/cpu_xpu_common.py:29: UserWarning: g++ not found, torch.compile disabled for CPU/XPU.
  warnings.warn("g++ not found, torch.compile disabled for CPU/XPU.")
Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]Error invalid device function at line 82 in file /src/csrc/ops.hip

Change #1101000 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: revamp llm model server with aya-8B

https://gerrit.wikimedia.org/r/1101000

isarantopoulos renamed this task from Test the feasibility of deployment of Aya-23 model in LiftWing to Test the feasibility of deployment of Aya-expanse model in LiftWing.Jan 14 2025, 3:48 PM

We have concluded that aya-expanse-32B model can be hosted on LiftWing but do serve it efficiently we'll need to use vllm image. The path to do this will be through porting the vllm image from ubuntu to a debian based one -- instead of building and maintaining custom wheels.
We have ran the model using the existing MI 210 GPUs as well as the MI300X (on a test machine that the vendor provided us).

I'm resolving this task and the work related to this will be continued and documented in T391941: Q4 24-25 Goal: Simple article summaries: Set up the software stack for efficiently serving production LLMs as well as its substasks.