- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Yesterday
Tue, Jun 25
Hello, yes Kosta is right and it seems that you hit the rate limit. Copying from the Wikitech Page these are the limits:
The Lift Wing endpoints have the following rate limits tiers:
After Aiko uploaded the model we can now use the model server which is deployed in the experimental namespace in ml-staging.
Fri, Jun 21
Following up on the vllm support issue from https://phabricator.wikimedia.org/T365246#9826503
with the installation of the new MI210 which seems to be supported by vllm (according to official docs) we are now in a place to test the vllm backend with huggingface.
Following the official docs I am exploring 2 alternatives:
- vllm docs : The recommended way is to build it from source and use the rocm docker image variant provided in the repo
Preprocessing now works. For this POC we used the following endpoint https://en.wikipedia.org/w/rest.php/v1/revision/12345/html:
curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345, "lang": "en"}' -H "Host: articlequality.experimental.wikimedia.org" {"rev_id":12345,"lang":"en","normalized_features":[0.25854384507523304,0.0,0.27142566329646467,0.0,0.0,0.0,0.0,false,false]}
Before we move this to production we should also figure out the way to use it with the Rest Gateway.
Now the only thing that is left is to load the model and run the above features through the predict function.
Thu, Jun 20
Got the same error with vllm==0.4.3 so I'll try to follow the documentation and see if anyone else is experiencing this issue.
WARNING 06-20 15:46:14 config.py:1155] Casting torch.bfloat16 to torch.float16. INFO 06-20 15:46:14 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend. INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend. INFO 06-20 15:46:35 model_runner.py:146] Loading model weights took 14.9595 GB 2024-06-20 15:46:35.322 1 kserve ERROR [__main__.py:<module>():254] Failed to start model server: name 'vllm_ops' is not defined
I bumped into a fork of the vllm project by ROCm which also has different releases as well as a flash-attention implementation for ROCm.
I'm trying vllm 0.4.3 and if it fails I'll go with the official instructions for [ https://docs.vllm.ai/en/latest/getting_started/amd-installation.html | vllm and ROCm ]]. They recommend we build vllm after we install torch-rocm which we are doing since we're using the base image and the only difference in the requirements-rocm.txt is that it requires ray==2.10.0 which we already have (pytest-asyncio as well but I doubt that is being needed to run anything). I'll try both ways and provide an update.
Wed, Jun 19
I have deployed llama3-8B-instruct on ml-staging.
making a request using the OpenAI API completions endpoint:
Huggingface image is now shipped with v0.13.0 of kserve and this is the one we are using. This task is considered done and this is the summary:
The model server has been successfully upgrade to kserve v0.13.0 and uses the pytorch 2.3.0 - rocm 6.0 base image.
Tue, Jun 18
liftwing package version 0.1.0 has been released on PyPI - https://pypi.org/project/liftwing/
Mon, Jun 17
I'm trying to migrate the following example request to be used by the service:
Dummy version has been deployed on ml-staging-codfw experimental. It is just a dummy service that returns the json input passed in the POST request.
curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345}' -H "Host: articlequality.experimental.wikimedia.org"
Fri, Jun 14
Thu, Jun 13
All revscoring models have been added in the attached Pull Request and will be included in v0.1 of the package.
I have manually tested the new kserve version with the huggingface image and the bert model that is deployed in the experimental namespace.
Wed, Jun 12
We are focusing on adding all the models available through the API Gateway. The package can work also with the internal endpoints but first we will include the models that are publicly available.
Tue, Jun 11
Covered by other tasks (https://phabricator.wikimedia.org/T363336)
We made request validation optional and it is now really simple to add support for a new model to the package.
Have also added metadata (optional again) for each model. The user can get the list of available models by running python -m liftwing - relevant PR
Adding a todo list of tasks:
Mon, Jun 10
Fri, Jun 7
Applied multiprocessing to eswiki-damaging and viwiki-reverted but only for large revisions (to avoid cpu throttling)
Thu, Jun 6
Wed, Jun 5
- Identified that the latency issues are caused by revscoring preprocessing code when scoring large revisions. {T363336#9850901}. The team is focused on tackling the issue by enabling multiprocessing for problematic model servers and/or limiting the content passed to revscoring.
Tue, Jun 4
We have added request payload validation with pydantic and currently adding more models to the package.
May 27 2024
After defining --backed=hugginface in the entrypoint command the server starts properly but I'm getting an error when I make a request
This is the relevant Pull Request : https://github.com/wikimedia/liftwing-python/pull/5
May 24 2024
Task T365253: Allow Kubernetes workers to be deployed on Bookworm fixed the issue mentioned above in ml-staging-codfw. After that bert model works perfect while we're having issues with Mistral (more info in. T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) probably related to the lack of full support in vllm for MI100.
time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/bert:predict" -X POST -d '{"instances": ["The capital of france is [MASK]."] }' -H "Host: bert.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1 {"predictions":["paris"]} real 0m1.113s user 0m0.019s sys 0m0.008s
Previous requests using CPU were taking ~10s.
May 23 2024
Currently investigating the issue to see if MI 100 (gfx908) is supported by vllm after all. Although documentation mentioned above says that it isn't, there are mentions and PRs that seem to support it.
If it doesn't work we'll have to go with huggingface backend instead of vllm, but we lose a ton of improvements mostly in speed.
Currently getting a CrashLoopBackoff in the pod with the updated image. However there is something I missed during the update: when it come to ROCm support latest vllm doesn't support MI 100.
Requirements OS: Linux
May 22 2024
We had forgotten the .pip dir inside the docker image which increased its size by more than 2GB (the size of the packages since torch compressed is really big by itself).
New image is now 13.5GB and 2.5GB when compressed which allows us to publish it in our docker registry.