We deploy an embeddings inference service for Qwen3.
This service will be used in semantic search mvp by the users who query in search bar.
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
[ ] Non functional requirements: clarify with David Causse and Peter Fischer.
- number of request per second.
- query context lengths average, max, min
- SLO
[x] Api input/output parameters: Same as Openai standard.
[ ] Implementation with sentence embeddings. See [implementation](https://phabricator.wikimedia.org/P86608) from Kevin. choose one of the followings:
[ ] Which gpus to occupy (Clarify with the team.)
[ ] Locust tests based on the requirements.
[ ] Iterate if locust tests are not successful.
- vllm (Clarify with Kevin and Dawid)
- [Kserve embeddings](https://kserve.github.io/website/docs/model-serving/generative-inference/tasks/embedding?_highlight=embedding-vllm.yaml). Known blocker: Does not support our rocm version. We can investigate more if we can find a compatible match.
- investigate more options.