Page MenuHomePhabricator

Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0)
Closed, ResolvedPublic1 Estimated Story Points

Description

Huggingface server has been bumped to pytorch 2.3.0. This allows us also to use one of the latest ROCm versions.

We will need to use a new pytorch base image from production images.
The procedure to follow is the one mentioned in the README of inference services.

Event Timeline

Change #1032777 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] huggingface: upgrade kserve to 0.13-rc0

https://gerrit.wikimedia.org/r/1032777

calbon set the point value for this task to 1.May 21 2024, 2:33 PM
calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

Change #1032777 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] huggingface: upgrade kserve to 0.13-rc0

https://gerrit.wikimedia.org/r/1032777

Change #1035476 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update hf image and remove nllb

https://gerrit.wikimedia.org/r/1035476

Currently getting a CrashLoopBackoff in the pod with the updated image. However there is something I missed during the update: when it come to ROCm support latest vllm doesn't support MI 100.

Requirements
OS: Linux

Python: 3.8 – 3.11

GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100)

ROCm 6.0 and ROCm 5.7

A nice thing is that there is more informative logging in the new version:

kubectl logs mistral-7b-instruct-gpu-predictor-00011-deployment-88d7bb4mrlbh
INFO:root:Copying contents of /mnt/models to local
WARNING 05-23 15:41:24 config.py:1086] Casting torch.bfloat16 to torch.float16.
INFO 05-23 15:41:24 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models)
INFO 05-23 15:41:24 pynccl_utils.py:17] Failed to import NCCL library: Cannot find librccl.so.1 in the system.
INFO 05-23 15:41:24 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
INFO 05-23 15:41:24 selector.py:37] Using ROCmFlashAttention backend.
WARNING 05-23 15:41:25 __init__.py:93] Model architecture MistralForCausalLM is partially supported by ROCm: Sliding window attention is not yet supported in ROCm's flash attention
INFO 05-23 15:41:37 model_runner.py:175] Loading model weights took 13.4966 GB

Currently investigating the issue to see if MI 100 (gfx908) is supported by vllm after all. Although documentation mentioned above says that it isn't, there are mentions and PRs that seem to support it.
If it doesn't work we'll have to go with huggingface backend instead of vllm, but we lose a ton of improvements mostly in speed.

Change #1036297 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: set command for hf image and remove nllb

https://gerrit.wikimedia.org/r/1036297

Change #1035476 abandoned by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: update hf image and remove nllb

Reason:

Covered by https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1036297

https://gerrit.wikimedia.org/r/1035476

After defining --backed=hugginface in the entrypoint command the server starts properly but I'm getting an error when I make a request

time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/mistral-7b-instruct:predict" -X POST -d '{"instances": ["Who is president of the united states?"] }' -H  "Host: mistral-7b-instruct-gpu.experimental.wikimedia.org" -H "Content-Type: application/json"
{"error":"TypeError : 'HuggingfaceGenerativeModel' object is not callable"}

and the logs from the pod

Traceback (most recent call last):
  File "/opt/lib/python/site-packages/uvicorn/protocols/http/httptools_impl.py", line 436, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/opt/lib/python/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/opt/lib/python/site-packages/timing_asgi/middleware.py", line 70, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/opt/lib/python/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/lib/python/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/opt/lib/python/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/opt/lib/python/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/opt/lib/python/site-packages/starlette/routing.py", line 74, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/kserve/protocol/rest/v1_endpoints.py", line 81, in predict
    response, response_headers = await self.dataplane.infer(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/kserve/protocol/dataplane.py", line 339, in infer
    response = await model(request, headers=headers)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'HuggingfaceGenerativeModel' object is not callable

Although the mistral model seems to be affected after the upgrade, I think these errors have nothing to do with the upgrade itself.
I'll investigate if it can be fixed otherwise it could be more suitable to continue this work in another task

Change #1036297 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: set command for hf image and remove nllb

https://gerrit.wikimedia.org/r/1036297

The model server has been successfully upgraded to kserve v0.13.0 and uses the pytorch 2.3.0 - rocm 6.0 base image.