After the successful deployment of large transformer model from huggingface using the huggingface server available from kserve, we want to solve the issue of using an inference optimization engine with our MI 210 GPU.
MI210 seems to be supported by vllm (according to official docs) we are now in a place to test the vllm backend with huggingface.
Following the official docs we can explore 2 alternatives:
- vllm docs : The recommended way is to build it from source and use the rocm docker image variant provided in the repo
- ROCm docs: suggest a simpler way to just clone the ROCm fork of vllm and run :
PYTORCH_ROCM_ARCH=gfx90a python setup.py install
where gfx90a is the architecture for the MI200 series, while the ROCm fork offers a different docker image as well.
I'm exploring which is the simplest solution and easier to maintain.
If we decide to use vllm engine as an inference framework we can move its installation to a base pytorch image. However for the time being I would avoid to add it over there as versions are chaning quite often.
One other thing to figure out is the version discrepancy between vllm(latest versions v0.4.3) and ROCm-vllm fork (latest version 0.4.0) which ends up being inconsistent with huggingfaceserver python module requirements - vllm = { version = "^0.4.2", optional = true }- which can be tackled if we use our fork of kserve but ends up being one more custom step in the build/update process. vllm releases has progressed to 0.5.2 which may be needed by huggingfaceserver which still requires v0.4.3
In a previous task we stumbled into issues while trying to install vllm: https://phabricator.wikimedia.org/T354870#9935109
We followed the process in the vllm docs to build from source based on our pytorch-rocm-base-image. All the examples and instructions we have found use python 3.9 while we are using a bookworm base image with python 3.11.
The instructions in the aforementioned link mention the following (after installing torch+rocm)
cd vllm pip install -U -r requirements-rocm.txt python setup.py install # This may take 5-10 minutes. Currently, `pip install .`` does not work for ROCm installation
I've modified the above to a version more friendly with blubber for testing. After cloning the repo in a separate command I use the following requirements.txt file:
-r /srv/app/vllm/requirements-rocm.txt -e /srv/app/vllm/
However it seems that it is expecting CUDA_HOME to be set and fails.
× Getting requirements to build wheel did not run successfully. 832.6 │ exit code: 1 832.6 ╰─> [20 lines of output] 832.6 Traceback (most recent call last): 832.6 File "/usr/local/lib/python3.11/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module> 832.6 main() 832.6 File "/usr/local/lib/python3.11/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main 832.6 json_out['return_val'] = hook(**hook_input['kwargs']) 832.6 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 832.6 File "/usr/local/lib/python3.11/dist-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel 832.6 return hook(config_settings) 832.6 ^^^^^^^^^^^^^^^^^^^^^ 832.6 File "/tmp/pip-build-env-o4me9kcq/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 327, in get_requires_for_build_wheel 832.6 return self._get_build_requires(config_settings, requirements=[]) 832.6 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 832.6 File "/tmp/pip-build-env-o4me9kcq/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 297, in _get_build_requires 832.6 self.run_setup() 832.6 File "/tmp/pip-build-env-o4me9kcq/overlay/local/lib/python3.11/dist-packages/setuptools/build_meta.py", line 313, in run_setup 832.6 exec(code, locals()) 832.6 File "<string>", line 406, in <module> 832.6 File "<string>", line 312, in get_vllm_version 832.6 File "<string>", line 282, in get_nvcc_cuda_version 832.6 AssertionError: CUDA_HOME is not set 832.6 [end of output] 832.6 832.6 note: This error originates from a subprocess, and is likely not a problem with pip. 832.6 error: subprocess-exited-with-error
The above process does seem to hacky so even if it works we'll have to make sure it is stable enough for us to use without being a real pain to maintain/update.
Expected outcome:
An expected outcome of this task would be to have a model (e.g. gemma2) using an image with the vllm engine which would be much faster than the current one which uses the huggingface backend
