Page MenuHomePhabricator

Use Huggingface model server image for HF LLMs
Closed, ResolvedPublic5 Estimated Story Points

Description

There is a plan to include a prebuilt model server for LLMs very close to what we were discussing which is also based on vllm runtime(kserve already has an experimental vllm runtime).
More specifically the huggingface model server is the implementation to support out of the box support for HF models.
Pasting from the README.md:

The Huggingface serving runtime implements a runtime that can serve huggingface transformer based model out of the box. The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification, token-classification, text-generation, text2text generation. Based on the performance requirement, you can choose to perform the inference on a more optimized inference engine like triton inference server and vLLM for text generation.

This involves a custom runtime server which means that we'll need to mirror/upload the image to our docker registry in order to use it from kserve.
Until this moment this seems like the most prominent solution as we won't have to maintain the dependencies ourselves but we can engage more with the community and contribute if we need something that isn't supported yet.
Full HF model server with vllm integration is expected in kserve 0.12 with the new generate endpoint.
As part of this task we'll

  • Add the upstream huggingface model server docker image to WMF's docker registry so that we can use it in Lift Wing.
  • Test its support with ROCm and AMD GPU: if it supports it out of the box (as recent HF versions suggest) we are good to go, otherwise we'll build another image based on this that will include ROCm version of pytorch. This image will be use the kserve HF image as a base image and build a new one with blubber in the same we do for the rest of the inference services.

Event Timeline

isarantopoulos set the point value for this task to 5.
isarantopoulos triaged this task as Medium priority.
isarantopoulos moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

We'll need to adapt the original image so that it uses Debian so that we align with WMF's recommendations for production images on k8s.

I'm proceeding to adapt the upstream image to use debian with rocm as the original one can't be built without a cuda runtime

I have started adding an image in a fork of the kserve repo. After it is finished and test I'll create a patch to add this to the production-images repo

I'm proceeding of creating a blubber image instead of going to production-images and handling dependencies in there. After running what upstream provides locally, I saw that we don't really need anything special.

Some of the links from the pytorch repositories seem to be wrong today (at least this is when I noticed it). The links under https://download.pytorch.org/whl/rocm5.5 should resolve to https://download.pytorch.org/whl/rocm5.5/{package_name} e.g. if you click on torch you should be redirected to https://download.pytorch.org/whl/rocm5.5/torch.
This results in a flaky build of the image as if the link behaves like the above we end up with the default pytorch version with no rocm support and all the nvidia stuff in it.

I opened a GH issue on the pytorch repo about it.

Change 1009783 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] huggingface: add huggingface image

https://gerrit.wikimedia.org/r/1009783

I managed to build the huggingface image with blubber and downloading the specified model from HF (example with bert-base-uncased) I'm currently looking at what would be the best strategy for creating the image.
At the moment I'm cloning the kserve repo and using the huggingfaceserver module directly but I'm exploring if committing these files in the repository would work better in the long run.
Next steps:

  • Create a readme and open up the attached patch for reviews
  • Trying to run it with a model that exists locally since this will be the standard way that we'll use it. Since our pods won't have access to the HF repository we want to load the model from disk (the same way that we do with the models in the LLM image). At the moment I'm running into some errors so I'm looking if it is cause of me or the kserve code.(example of the errors, not really improtant at the moment I dont' know if I'm using the arguments properly)

I've managed to make it work with a model available on disk (which means no connection to HF repo). The issues I faced were specific to the example as it was also trying to load the coreml directory as a model so it was failing. (See model directory structure here). However I successfully loaded bloom-560m and nllb model.
I see there are some open issues on GH so I understand this is going to be more stable as we move forward.
For now we will have to test each model we want to deploy/test without taking for granted that all models will work. Also when using the GPU with vllm it is important to remember that not all models are supported (list of supported models by vllm).

Change 1011303 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[integration/config@master] ml-services: add huggingface pipelines

https://gerrit.wikimedia.org/r/1011303

Change 1011303 merged by jenkins-bot:

[integration/config@master] ml-services: add huggingface pipelines

https://gerrit.wikimedia.org/r/1011303

Change 1013161 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[integration/config@master] ml-services: fix huggingface pipeline triggers

https://gerrit.wikimedia.org/r/1013161

Change 1013161 merged by jenkins-bot:

[integration/config@master] ml-services: fix huggingface pipeline triggers

https://gerrit.wikimedia.org/r/1013161

Mentioned in SAL (#wikimedia-releng) [2024-03-21T09:51:28Z] <hashar> Reloaded Zuul for "ml-services: fix huggingface pipeline triggers" https://gerrit.wikimedia.org/r/1013161 | T357986

We'll be using the pytorch rocm image based on debian bookworm for this image (see #T360638)
Also we need to either copy the code from the directory https://github.com/kserve/kserve/tree/master/python/huggingfaceserver/huggingfaceserver or pin to a specific commit on the master branch as new changes are being introduced all the time and if we use the release version we don't get the fixes and if we just use the master branch we may end up in a failed deployment (which actually just happened due to some commit from last week on kserve).
I'm exploring which of the two options would be the best for now (copy code or pin to specific commit).

After thinking about this and trying various things out (copying code or using a specific commit) I found the following 2 issues we need to resolve:

  • Upstream makes changes as huggingfaceserver is still in development although it has been released in v0.12 and we may end up with an unstable build if we just use the master branch. (as mentioned also in the previous comment)
  • the huggingfaceserver has its own pyproject.toml where some dependencies are declared. One of them is pytorch and this results in the cpu version being installed. This is problematic in the case where we use the pytorch-rocm base docker image as it overwrites the system torch version. With a standard debian image we could just flip the order of the dependencies (install huggingfaceserver first and then pytorch-rocm) but it seems like we're leaving things quite unstable and random.

To mitigate both of the above issues I have created a fork of the kserve repo under wikimedia on Github and created a branch named liftwing. There we can change the pyproject.toml to include the torch dependency from the index https://download.pytorch.org/whl/rocm5.7 and it will also allow us to upgrade the version used in this module so that we can use the base image.
This way we can manually sync the fork and merge the master branch into liftwing and deal with any merge conflicts manually.

I'm facing issues trying to update huggingfaceserver dependencies to use torch 2.2.1. I've reached a point where I'm blocked because vllm requires torch version 2.1.2 so my suggestion is to go and create another base image with torch-rocm 2.1.2 and rocm 5.5 which is available for that version.

Change #1015297 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/docker-images/production-images@master] Add new version for amd-pytorch image

https://gerrit.wikimedia.org/r/1015297

There is an open Pull Request in vllm repo to upgrade pytorch support to 2.2.1, just leaving this here as a reference.

It seems that the most prominent solution at the moment that would not require so many hacks and forks would be to use a pytorch 2.1.2 and rocm5.5 base image.
I was able to build the image which ended up with the following dependencies

aiohttp==3.9.3
aiohttp-cors==0.7.0
aiosignal==1.3.1
annotated-types==0.6.0
anyio==4.3.0
async-timeout==4.0.3
attrs==23.2.0
azure-core==1.30.1
azure-identity==1.15.0
azure-storage-blob==12.19.1
azure-storage-file-share==12.15.0
boto3==1.34.73
botocore==1.34.73
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloudevents==1.10.1
cmake==3.29.0.1
colorful==0.5.6
cryptography==42.0.5
deprecation==2.1.0
distlib==0.3.8
fastapi==0.108.0
filelock==3.13.3
frozenlist==1.4.1
fsspec==2024.3.1
google-api-core==2.18.0
google-auth==2.29.0
google-cloud-core==2.4.1
google-cloud-storage==2.16.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.63.0
grpcio==1.62.1
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.26.0
huggingface-hub==0.22.1
# Editable install with no version control (huggingfaceserver==0.12.0)
-e /srv/app/kserve_repo/python/huggingfaceserver
idna==3.6
isodate==0.6.1
Jinja2==3.1.3
jmespath==1.0.1
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
kserve @ file:///srv/app/kserve_repo/python/kserve
kubernetes==29.0.0
lit==18.1.2
MarkupSafe==2.1.5
mpmath==1.3.0
msal==1.28.0
msal-extensions==1.1.0
msgpack==1.0.8
multidict==6.0.5
networkx==3.2.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
opencensus==0.11.4
opencensus-context==0.1.3
orjson==3.10.0
packaging==24.0
pandas==2.2.1
platformdirs==4.2.0
portalocker==2.8.2
prometheus-client==0.13.1
proto-plus==1.23.0
protobuf==3.20.3
psutil==5.9.8
py-spy==0.3.14
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.21
pydantic==2.6.4
pydantic_core==2.16.3
PyJWT==2.8.0
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytorch-triton-rocm==2.0.1
pytz==2024.1
PyYAML==6.0.1
ray==2.10.0
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
requests-oauthlib==2.0.0
rpds-py==0.18.0
rsa==4.9
s3transfer==0.10.1
safetensors==0.4.2
six==1.16.0
smart-open==7.0.4
sniffio==1.3.1
starlette==0.32.0.post1
sympy==1.12
tabulate==0.9.0
timing-asgi==0.3.1
tokenizers==0.15.2
torch==2.1.2
tqdm==4.66.2
transformers==4.37.2
triton==2.1.0
typing_extensions==4.10.0
tzdata==2024.1
urllib3==1.26.18
uvicorn==0.21.1
uvloop==0.19.0
virtualenv==20.25.1
watchfiles==0.21.0
websocket-client==1.7.0
websockets==12.0
wrapt==1.16.0
yarl==1.9.4

At the moment I am getting the following error with ray serve

Traceback (most recent call last):
  File "<frozen runpy>", line 189, in _run_module_as_main
  File "<frozen runpy>", line 148, in _get_module_details
  File "<frozen runpy>", line 112, in _get_module_details
  File "/srv/app/huggingfaceserver/__init__.py", line 15, in <module>
    from .model import HuggingfaceModel  # noqa # pylint: disable=unused-import
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/app/huggingfaceserver/model.py", line 20, in <module>
    from kserve.model import PredictorConfig
  File "/opt/lib/python/site-packages/kserve/__init__.py", line 18, in <module>
    from .model_server import ModelServer
  File "/opt/lib/python/site-packages/kserve/model_server.py", line 27, in <module>
    from ray.serve.handle import RayServeHandle
ImportError: cannot import name 'RayServeHandle' from 'ray.serve.handle' (/opt/lib/python/site-packages/ray/serve/handle.py)

There is an open issue already for this on kserve GH. I am trying to see if I can find a workaround when using the wikimedia kserve fork.
I created the fork for 2 reasons:

  • Pin to a specific commit so that we can get a reproducible build.
  • Change the huggingfaceserver project requirements to either include torch from rocm or remove it completely.

I thought I would avoid the rayserve error but it seems that we end up with ray serve 2.10.0 which causes this issue. I'll try to see where this dependency is coming from and perhaps change the requirement and pin it to 2.9.2

HF image now works! The above issue seems to have been fixed in one of the latest commits. However to avoid any future issues (breaking changes etc) we are still using a wikimedia fork of the kserve repo for huggingfaceserver package (and kserve).
Using a fork makes updating a bit mundane as a task but will allow us to have deterministic builds until things are more stable (in a new kserve release).
For now we can:

  • Sync the fork via GH
  • rebase the liftwing branch on top of master and test it.

I'll make sure to add this to the README.md file as well.

I've built an image by explicitly defining all the requirements (instead of pip resolving the dependencies by its own). To do this we set no-dep: true in the blubber configuration and it is the equivalent of running pip install --no-deps
This resulted in a reduced image size of 10.9GB compared to 13.9GB and a really small docker layer size
In the output of docker history we see that the biggest layer size is 622MB:

IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
f357017f3949   42 minutes ago   LABEL blubber.variant=production blubber.ver…   0B        buildkit.dockerfile.v0
<missing>      42 minutes ago   ENTRYPOINT ["./entrypoint.sh"]                  0B        buildkit.dockerfile.v0
<missing>      42 minutes ago   COPY huggingface_modelserver/entrypoint.sh .…   321B      buildkit.dockerfile.v0
<missing>      42 minutes ago   COPY /srv/app/kserve_repo/python/huggingface…   418kB     buildkit.dockerfile.v0
<missing>      42 minutes ago   COPY /opt/lib/python/site-packages /opt/lib/…   622MB     buildkit.dockerfile.v0
<missing>      4 hours ago      ENV PATH=/opt/lib/python/site-packages/bin:/…   0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ENV MODEL_DIR=/mnt/models PYTHONPATH=/srv/app   0B        buildkit.dockerfile.v0
<missing>      4 hours ago      WORKDIR /srv/app                                0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ENV HOME=/home/somebody                         0B        buildkit.dockerfile.v0
<missing>      4 hours ago      USER 65533                                      0B        buildkit.dockerfile.v0
<missing>      4 hours ago      RUN |6 LIVES_AS=somebody LIVES_UID=65533 LIV…   9.07kB    buildkit.dockerfile.v0
<missing>      4 hours ago      ARG RUNS_GID=900                                0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ARG RUNS_UID=900                                0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ARG RUNS_AS=runuser                             0B        buildkit.dockerfile.v0
<missing>      4 hours ago      RUN |3 LIVES_AS=somebody LIVES_UID=65533 LIV…   0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ARG LIVES_GID=65533                             0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ARG LIVES_UID=65533                             0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ARG LIVES_AS=somebody                           0B        buildkit.dockerfile.v0
<missing>      4 hours ago      RUN /bin/sh -c apt-get update && apt-get ins…   1.53MB    buildkit.dockerfile.v0
<missing>      4 hours ago      ENV DEBIAN_FRONTEND=noninteractive              0B        buildkit.dockerfile.v0
<missing>      4 hours ago      ENV HOME=/root                                  0B        buildkit.dockerfile.v0
<missing>      4 hours ago      USER 0                                          0B        buildkit.dockerfile.v0

if we compare to the previous one which is 3.63GB:

IMAGE          CREATED       CREATED BY                                      SIZE      COMMENT
c74affa96f99   2 hours ago   LABEL blubber.variant=production blubber.ver…   0B        buildkit.dockerfile.v0
<missing>      2 hours ago   ENTRYPOINT ["./entrypoint.sh"]                  0B        buildkit.dockerfile.v0
<missing>      2 hours ago   COPY huggingface_modelserver/entrypoint.sh .…   321B      buildkit.dockerfile.v0
<missing>      2 hours ago   COPY /srv/app/kserve_repo/python/huggingface…   418kB     buildkit.dockerfile.v0
<missing>      2 hours ago   COPY /opt/lib/python/site-packages /opt/lib/…   3.63GB    buildkit.dockerfile.v0
<missing>      5 hours ago   ENV PATH=/opt/lib/python/site-packages/bin:/…   0B        buildkit.dockerfile.v0
<missing>      5 hours ago   ENV MODEL_DIR=/mnt/models PYTHONPATH=/srv/app   0B        buildkit.dockerfile.v0
<missing>      5 hours ago   WORKDIR /srv/app                                0B        buildkit.dockerfile.v0
<missing>      5 hours ago   ENV HOME=/home/somebody                         0B        buildkit.dockerfile.v0
<missing>      5 hours ago   USER 65533                                      0B        buildkit.dockerfile.v0
<missing>      5 hours ago   RUN |6 LIVES_AS=somebody LIVES_UID=65533 LIV…   9.07kB    buildkit.dockerfile.v0
<missing>      5 hours ago   ARG RUNS_GID=900                                0B        buildkit.dockerfile.v0
<missing>      5 hours ago   ARG RUNS_UID=900                                0B        buildkit.dockerfile.v0
<missing>      5 hours ago   ARG RUNS_AS=runuser                             0B        buildkit.dockerfile.v0
<missing>      5 hours ago   RUN |3 LIVES_AS=somebody LIVES_UID=65533 LIV…   0B        buildkit.dockerfile.v0
<missing>      5 hours ago   ARG LIVES_GID=65533                             0B        buildkit.dockerfile.v0
<missing>      5 hours ago   ARG LIVES_UID=65533                             0B        buildkit.dockerfile.v0
<missing>      5 hours ago   ARG LIVES_AS=somebody                           0B        buildkit.dockerfile.v0
<missing>      5 hours ago   RUN /bin/sh -c apt-get update && apt-get ins…   1.53MB    buildkit.dockerfile.v0
<missing>      5 hours ago   ENV DEBIAN_FRONTEND=noninteractive              0B        buildkit.dockerfile.v0
<missing>      5 hours ago   ENV HOME=/root                                  0B        buildkit.dockerfile.v0
<missing>      5 hours ago   USER 0                                          0B        buildkit.dockerfile.v0

At this point I believe/assume that the issue lies within the torch package. Although we install torch in the base image its metadata is different as it is named as torch==2.1.2+rocm5.5 (as seen afteer running pip freeze). The result it that even I define the extra index url as a source for torch in the huggingfaceserver pyproject.toml we still end up with torch cpu version being downloaded and a ~3GB of nvidia stuff installed because of the metadata mismatch caused it dependency resolution as some of the packages (e.g. accelerate) require torch and they can't find it.
The above is just an assumption but my suggestion is to explicitly specify all the requirements in the requirements.txt file so that we can have control over what is installed. The upside is that we have a 100% deterministic build and the downside is that we have to update versions manually after installing the packages, doing pip freeze and update what is needed.

I have seen the same behavior, namely pip trying to download the torch's cpu version and ending up only installing nvidia-related packages. I like the explicit-dependency solution, it is less flexible than letting pip to manage dependencies but I think it is the only viable way to get a good result.

The alternative could be to explore poetry, and replace our pip usage with it, but not sure if it will change much from now or not.

I believe we would have the same issue even with poetry in the following scenario:

  1. We have torch-rocm installed in the base image
  2. the package accelerate is looking for torch torch==2.1.2 and finds torch==2.1.2+rocm5.5 so it goes ahead and downloads torch cpu version

I think it may be the only case where pip allows for more ease of maintenance by using the requirements.txt than having to maintain all these requirements via a pyproject.toml file.

I have built the "final" image that also includes vllm==0.2.7 which will be used for GPU inference optimization. The final size is 11.4GB and contains the following layers (larger one is 1.17GB)

IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
b992697a5b07   11 minutes ago   LABEL blubber.variant=production blubber.ver…   0B        buildkit.dockerfile.v0
<missing>      11 minutes ago   ENTRYPOINT ["./entrypoint.sh"]                  0B        buildkit.dockerfile.v0
<missing>      11 minutes ago   COPY huggingface_modelserver/entrypoint.sh .…   321B      buildkit.dockerfile.v0
<missing>      11 minutes ago   COPY /srv/app/kserve_repo/python/huggingface…   428kB     buildkit.dockerfile.v0
<missing>      11 minutes ago   COPY /opt/lib/python/site-packages /opt/lib/…   1.17GB    buildkit.dockerfile.v0
<missing>      34 minutes ago   ENV PATH=/opt/lib/python/site-packages/bin:/…   0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ENV MODEL_DIR=/mnt/models PYTHONPATH=/srv/app   0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   WORKDIR /srv/app                                0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ENV HOME=/home/somebody                         0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   USER 65533                                      0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   RUN |6 LIVES_AS=somebody LIVES_UID=65533 LIV…   9.07kB    buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG RUNS_GID=900                                0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG RUNS_UID=900                                0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG RUNS_AS=runuser                             0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   RUN |3 LIVES_AS=somebody LIVES_UID=65533 LIV…   0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG LIVES_GID=65533                             0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG LIVES_UID=65533                             0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ARG LIVES_AS=somebody                           0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   RUN /bin/sh -c apt-get update && apt-get ins…   0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ENV DEBIAN_FRONTEND=noninteractive              0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   ENV HOME=/root                                  0B        buildkit.dockerfile.v0
<missing>      34 minutes ago   USER 0                                          0B        buildkit.dockerfile.v0

Change #1015297 merged by Elukey:

[operations/docker-images/production-images@master] Add new version for amd-pytorch image

https://gerrit.wikimedia.org/r/1015297

After the released new pytorch image I have reveried the docker image size (11.4GB as described above), and the layers being the following (same as above):

docker history --format "Layer Size: {{.Size}} |  {{.CreatedBy}}" hf:kserve
Layer Size: 0B |  LABEL blubber.variant=production blubber.ver…
Layer Size: 0B |  ENTRYPOINT ["./entrypoint.sh"]
Layer Size: 321B |  COPY huggingface_modelserver/entrypoint.sh .…
Layer Size: 428kB |  COPY /srv/app/kserve_repo/python/huggingface…
Layer Size: 1.17GB |  COPY /opt/lib/python/site-packages /opt/lib/…
Layer Size: 0B |  ENV PATH=/opt/lib/python/site-packages/bin:/…
Layer Size: 0B |  ENV MODEL_DIR=/mnt/models PYTHONPATH=/srv/app
Layer Size: 0B |  WORKDIR /srv/app
Layer Size: 0B |  ENV HOME=/home/somebody
Layer Size: 0B |  USER 65533
Layer Size: 9.07kB |  RUN |6 LIVES_AS=somebody LIVES_UID=65533 LIV…
Layer Size: 0B |  ARG RUNS_GID=900
Layer Size: 0B |  ARG RUNS_UID=900
Layer Size: 0B |  ARG RUNS_AS=runuser
Layer Size: 0B |  RUN |3 LIVES_AS=somebody LIVES_UID=65533 LIV…
Layer Size: 0B |  ARG LIVES_GID=65533
Layer Size: 0B |  ARG LIVES_UID=65533
Layer Size: 0B |  ARG LIVES_AS=somebody
Layer Size: 0B |  ENV HOME=/root
Layer Size: 0B |  USER 0
Layer Size: 10GB |  |0 /bin/sh -c /usr/bin/pip3 install --target…
Layer Size: 0B |  /bin/sh -c #(nop)  USER 65533
Layer Size: 1.54MB |  |0 /bin/sh -c echo 'Acquire::http::Proxy "ht…
Layer Size: 69.8MB |  |0 /bin/sh -c echo 'Acquire::http::Proxy "ht…
Layer Size: 0B |  /bin/sh -c #(nop)  CMD ["/bin/bash"]
Layer Size: 0B |  /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8
Layer Size: 122MB |  /bin/sh -c #(nop) ADD file:4d8f8923252d099a4…

There is a 10GB layer but it is the one coming from the base pytorch image.

Change #1009783 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] huggingface: add huggingface image

https://gerrit.wikimedia.org/r/1009783

Change #1017858 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy falcon7b-instruct

https://gerrit.wikimedia.org/r/1017858

Change #1017858 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy falcon7b-instruct

https://gerrit.wikimedia.org/r/1017858

I tried to deploy falcon-7b-instruct using the hf image but got the following error in the kserve-container:

kubectl logs falcon-7b-instruct-gpu-predictor-00001-deployment-6776fccd6llf7 kserve-container
INFO:root:Copying contents of /mnt/models to local
The repository for /mnt/models contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//mnt/models.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Traceback (most recent call last):
  File "/opt/lib/python/site-packages/transformers/dynamic_module_utils.py", line 598, in resolve_trust_remote_code
    answer = input(
             ^^^^^^
EOFError: EOF when reading a line

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/srv/app/huggingfaceserver/__main__.py", line 69, in <module>
    model.load()
  File "/srv/app/huggingfaceserver/model.py", line 114, in load
    model_config = AutoConfig.from_pretrained(model_id_or_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/transformers/models/auto/configuration_auto.py", line 1103, in from_pretrained
    trust_remote_code = resolve_trust_remote_code(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/transformers/dynamic_module_utils.py", line 611, in resolve_trust_remote_code
    raise ValueError(
ValueError: The repository for /mnt/models contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co//mnt/models.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.

As it seems by default the huggingfaceserver won't allow custom code execution that comes with some models. In order for this to happen we would have to modify the kserve code and set trust_remote_code=True when calling model_config = AutoConfig.from_pretrained(model_id_or_path) either by model_config = AutoConfig.from_pretrained(model_id_or_path, trust_remote_code=True) or model_config = AutoConfig.from_pretrained(model_id_or_path, **kwargs) and passing it somehow else. This is the same as we had done in our custom llm image.

For now we'll proceed with testing models that don't include custom code and I'll update after discussing intentions about this with upstream (created a relevant issue on GH)

Proceeding with google-bert-uncased as an example model (the one we used during debugging) and `Mistral-7B-Instruct-v0.2 as a bigger model

Change #1018274 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy bert model on ml-staging

https://gerrit.wikimedia.org/r/1018274

Change #1018274 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy bert model on ml-staging

https://gerrit.wikimedia.org/r/1018274

Deployed bert-base-uncased model on ml-staging and it works!

time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/bert:predict" -X POST -d '{"instances": ["The capital of france is [MASK]."] }' -H  "Host: bert.experimental.wikimedia.org" -H "Content-Type: application/json" 

{"predictions":["paris"]}
real	0m10.552s
user	0m0.025s
sys	0m0.001s

Change #1018633 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy mistral-7b-instruct

https://gerrit.wikimedia.org/r/1018633

Change #1018633 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy mistral-7b-instruct

https://gerrit.wikimedia.org/r/1018633

Change #1018646 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: fix indentation in mistral model resources

https://gerrit.wikimedia.org/r/1018646

I managed to deploy Mistral-7B-Instructv0.2 on ml-staging using the GPU and 35GB of memory.
At the moment I'm testing the modifications we need to make to the huggingfaceserver so that:

  • we can use less memory when loading the model by setting low_cpu_mem_usage=True (as we did in llm image and article-descriptions )
  • set the max_length for the returned sequence in the request.

These changes will either be requests to change the upstream kserve code or things we add in our wmf fork of kserve.

At the moment we have a 7B model deployed on ml-staging that uses the CPU and gets a response in ~30s.

I am experimenting loading various model sizes to see if we experience lower memory utilization by setting low_cpu_mem_usage=True while loading the model in a commit in the wmf fork of kserve.
Not seeing significant improvements at the moment so I'm still probably missing something.

I'm facing some issues with the GPU utilization. After fixing some indentation errors in the previous deployment which didn't allow the GPU to be used. We get the following:

  • GPU is visible in the pod resources:
Limits:
      amd.com/gpu:  1
      cpu:          8
      memory:       35Gi
    Requests:
      amd.com/gpu:  1
      cpu:          8
      memory:       35Gi
  • pod logs show amd driver failure
kubectl logs -f mistral-7b-instruct-gpu-predictor-00005-deployment-5d6676dg7s6q
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
INFO:root:Copying contents of /mnt/models to local
INFO:kserve:successfully loaded tokenizer for task: 6
  • As a result server response times are the same as CPU seems to be used.

Also setting the max_length of the returned sequence as reported above is still pending.

Tried to check the GPU after attaching to the running container and executing the following in a python console. I'm getting the same result:

>>> import torch
>>> torch.cuda.is_available()
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
False

It seems that there is either an issue container's access to the drivers or an issue with the drivers themselves.

Change #1018646 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: fix indentation in mistral model resources and increase memory

https://gerrit.wikimedia.org/r/1018646

Change #1023414 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/docker-images/production-images@master] amdpytorch21: use bullseye as pytorch base image

https://gerrit.wikimedia.org/r/1023414

Change #1023414 abandoned by Ilias Sarantopoulos:

[operations/docker-images/production-images@master] amdpytorch21: use bullseye as pytorch base image

Reason:

https://gerrit.wikimedia.org/r/1023414

Change #1035476 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update hf image and remove nllb

https://gerrit.wikimedia.org/r/1035476

Task T365253: Allow Kubernetes workers to be deployed on Bookworm fixed the issue mentioned above in ml-staging-codfw. After that bert model works perfect while we're having issues with Mistral (more info in. T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) probably related to the lack of full support in vllm for MI100.

 time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/bert:predict" -X POST -d '{"instances": ["The capital of france is [MASK]."] }' -H  "Host: bert.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"predictions":["paris"]}
real	0m1.113s
user	0m0.019s
sys	0m0.008s

Previous requests using CPU were taking ~10s.

Change #1036297 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: set command for hf image and remove nllb

https://gerrit.wikimedia.org/r/1036297

Change #1035476 abandoned by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: update hf image and remove nllb

Reason:

Covered by https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1036297

https://gerrit.wikimedia.org/r/1035476

Change #1036297 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: set command for hf image and remove nllb

https://gerrit.wikimedia.org/r/1036297

Change #1047106 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy llama3

https://gerrit.wikimedia.org/r/1047106

Change #1047106 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy llama3

https://gerrit.wikimedia.org/r/1047106

Huggingface image is now shipped with v0.13.0 of kserve and this is the one we are using. This task is considered done and this is the summary:

  • the upstream huggingface image comes with ubuntu and cude so we have a huggingface blubber image in the inference-services repo based on the pytorch-rocm base image (debian).
  • the image has kserve as well as the code of the kserve repo so that the module huggingface can be used as expected python -m huggingfaceserver
  • for stable versions we clone the upstream repo. However since it is quite early in the development phase if we need to ship from a specific branch or apply anything custom we have a wmf fork of kserve to do so. Ideally the fork will be deleted in the future.
  • there are two modes of running a model: either fetching it from huggingface or loading it from localhost. The latter is the one we use in staging/production and download models from swift from the storage initializer container
  • when we utilize the GPU there are 2 backends to do that: huggingface and vllm. The latter is considered to be much much faster but at the moment we are only using huggingface as we are getting an error with pytorch 2.3.0 and vllm 0.4.2 which seems [[ https://github.com/vllm-project/vllm/issues/4229 | to have been fixed ]]in a future versions which are not yet supported by kserve/huggingfaceserver. We could try to use our fork to add the latest version and see if it fixes the issue.
  • There are new supported endpoints to use in later kserve versions. The generate endpoint has been added to support text generation and follows the open api protocol (follows v2 kserve endpoints). Support has been added for the OpenAI API Schema as well and more specifically for openai/v1/completions and openai/v1/chat/completions and their v2 counterparts as well.

It is a good time to look into v2 kserve rest endpoints so that we can support more functionality out of the box (metadata etc)

Change #1050947 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] huggingface: bump trasnsformers to support Gemma

https://gerrit.wikimedia.org/r/1050947

Change #1050947 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] huggingface: bump transformers to support Gemma

https://gerrit.wikimedia.org/r/1050947

I've tested the gemma2 27b model after bumping transformers package to latest version.

  • Good news are that it is deployed successfully (no resource issues or anything GPU related). This is the first time we deploy such a big model (27 billion params)
  • Bad news is that the response comes back empty and most likely more changes are needed on the kserve side. Looking at some community discussions there are multiple things that need to be changed when loading the model (enable eager attention and special tokens.
time curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -H "Host: llama3.experimental.wikimedia.org" -H "Content-Type: application/json" -X POST -d '{"model": "llama3", "prompt": "Write me a poem about Machine Learning.", "stream":false, "max_tokens": 50}'
{"id":"494684fa-b2f2-40e2-a837-a2d6644925fa","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":""}],"created":1719838007,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":50,"prompt_tokens":9,"total_tokens":59}}

(in the above example it says llama3 because I was just modifying the current llama3 service to test it, it is indeed the gemma model)
Will try the model locally (the 9B version) to see what are the required changes for kserve/huggingfaceserver

Change #1052051 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update hf image

https://gerrit.wikimedia.org/r/1052051

The current work can be marked done as we can now deploy images using the huggingfaceserver and in a stable way after completing https://phabricator.wikimedia.org/T369359