User Details
- User Since
- Nov 1 2022, 12:34 PM (85 w, 4 d)
- Availability
- Available
- LDAP User
- Ilias Sarantopoulos
- MediaWiki User
- ISarantopoulos-WMF [ Global Accounts ]
Fri, Jun 21
Following up on the vllm support issue from https://phabricator.wikimedia.org/T365246#9826503
with the installation of the new MI210 which seems to be supported by vllm (according to official docs) we are now in a place to test the vllm backend with huggingface.
Following the official docs I am exploring 2 alternatives:
- vllm docs : The recommended way is to build it from source and use the rocm docker image variant provided in the repo
Preprocessing now works. For this POC we used the following endpoint https://en.wikipedia.org/w/rest.php/v1/revision/12345/html:
curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345, "lang": "en"}' -H "Host: articlequality.experimental.wikimedia.org" {"rev_id":12345,"lang":"en","normalized_features":[0.25854384507523304,0.0,0.27142566329646467,0.0,0.0,0.0,0.0,false,false]}
Before we move this to production we should also figure out the way to use it with the Rest Gateway.
Now the only thing that is left is to load the model and run the above features through the predict function.
Thu, Jun 20
Got the same error with vllm==0.4.3 so I'll try to follow the documentation and see if anyone else is experiencing this issue.
WARNING 06-20 15:46:14 config.py:1155] Casting torch.bfloat16 to torch.float16. INFO 06-20 15:46:14 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend. INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend. INFO 06-20 15:46:35 model_runner.py:146] Loading model weights took 14.9595 GB 2024-06-20 15:46:35.322 1 kserve ERROR [__main__.py:<module>():254] Failed to start model server: name 'vllm_ops' is not defined
I bumped into a fork of the vllm project by ROCm which also has different releases as well as a flash-attention implementation for ROCm.
I'm trying vllm 0.4.3 and if it fails I'll go with the official instructions for [ https://docs.vllm.ai/en/latest/getting_started/amd-installation.html | vllm and ROCm ]]. They recommend we build vllm after we install torch-rocm which we are doing since we're using the base image and the only difference in the requirements-rocm.txt is that it requires ray==2.10.0 which we already have (pytest-asyncio as well but I doubt that is being needed to run anything). I'll try both ways and provide an update.
Wed, Jun 19
I have deployed llama3-8B-instruct on ml-staging.
making a request using the OpenAI API completions endpoint:
Huggingface image is now shipped with v0.13.0 of kserve and this is the one we are using. This task is considered done and this is the summary:
The model server has been successfully upgrade to kserve v0.13.0 and uses the pytorch 2.3.0 - rocm 6.0 base image.
Tue, Jun 18
liftwing package version 0.1.0 has been released on PyPI - https://pypi.org/project/liftwing/
Mon, Jun 17
I'm trying to migrate the following example request to be used by the service:
Dummy version has been deployed on ml-staging-codfw experimental. It is just a dummy service that returns the json input passed in the POST request.
curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345}' -H "Host: articlequality.experimental.wikimedia.org"
Fri, Jun 14
Thu, Jun 13
All revscoring models have been added in the attached Pull Request and will be included in v0.1 of the package.
I have manually tested the new kserve version with the huggingface image and the bert model that is deployed in the experimental namespace.
Wed, Jun 12
We are focusing on adding all the models available through the API Gateway. The package can work also with the internal endpoints but first we will include the models that are publicly available.
Tue, Jun 11
Covered by other tasks (https://phabricator.wikimedia.org/T363336)
We made request validation optional and it is now really simple to add support for a new model to the package.
Have also added metadata (optional again) for each model. The user can get the list of available models by running python -m liftwing - relevant PR
Adding a todo list of tasks:
Mon, Jun 10
Fri, Jun 7
Applied multiprocessing to eswiki-damaging and viwiki-reverted but only for large revisions (to avoid cpu throttling)
Thu, Jun 6
Wed, Jun 5
- Identified that the latency issues are caused by revscoring preprocessing code when scoring large revisions. {T363336#9850901}. The team is focused on tackling the issue by enabling multiprocessing for problematic model servers and/or limiting the content passed to revscoring.
Tue, Jun 4
We have added request payload validation with pydantic and currently adding more models to the package.
Mon, May 27
After defining --backed=hugginface in the entrypoint command the server starts properly but I'm getting an error when I make a request
This is the relevant Pull Request : https://github.com/wikimedia/liftwing-python/pull/5
Fri, May 24
Task T365253: Allow Kubernetes workers to be deployed on Bookworm fixed the issue mentioned above in ml-staging-codfw. After that bert model works perfect while we're having issues with Mistral (more info in. {9826605} probably related to the lack of full support in vllm for MI100.
time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/bert:predict" -X POST -d '{"instances": ["The capital of france is [MASK]."] }' -H "Host: bert.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1 {"predictions":["paris"]} real 0m1.113s user 0m0.019s sys 0m0.008s
Previous requests using CPU were taking ~10s.
May 23 2024
Currently investigating the issue to see if MI 100 (gfx908) is supported by vllm after all. Although documentation mentioned above says that it isn't, there are mentions and PRs that seem to support it.
If it doesn't work we'll have to go with huggingface backend instead of vllm, but we lose a ton of improvements mostly in speed.
Currently getting a CrashLoopBackoff in the pod with the updated image. However there is something I missed during the update: when it come to ROCm support latest vllm doesn't support MI 100.
Requirements OS: Linux
May 22 2024
We had forgotten the .pip dir inside the docker image which increased its size by more than 2GB (the size of the packages since torch compressed is really big by itself).
New image is now 13.5GB and 2.5GB when compressed which allows us to publish it in our docker registry.
May 21 2024
@AgnesAbah have you managed to resolve the issue?
As Kosta mentioned there isn't anything there related to Lift Wing but with the MediaWiki Action API.
As it turns out the above approach won't cut it. Even without the dependencies the compressed image with pytorch 2.3.0 and rocm 6.0 is 4.36GB.
This is the list of packages under /opt/lib/site-packages
functorch torch torch-2.3.0+rocm6.0.dist-info torchgen
Also seems that torch-ROCm by itself is ~12GB, so it is indeed getting bigger and bigger:
Images seem to become more bloated so I am exploring the option to install pytorch-rocm with --no-dependencies option and handle dependencies manually either at the production images repo or on the inference services side. It is a long shot but I think it is worth to try from our side at least to cross it out if it can't be done.
Whether this approach is feasible or not will depend on: