We deploy an embeddings inference service for Qwen3.
This service will be used in semantic search mvp by the users who query in search bar.
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
[x] Non functional requirements: clarify with David Causse and Peter Fischer.
- number of request per second: ~5 RPS
- query context lengths average, max, min (Assuming average length of a word is 5 letters.)
- average: ~8–12 words: 12*6 = 72
- max: 300. we use the first 300 letters of the queries if they are longer than 300 letters.
- latency: <300 ms.
- SLO: We don’t yet have a hard uptime SLO defined for the MVP.
[x] Api input/output parameters: Same as Openai standard.
[x] Implementation with sentence embeddings. See [implementation](https://phabricator.wikimedia.org/P86608) from Kevin.
[x] Which GPUs to occupy (Clarify with the team.)
- ML team agreed to use:
- 1 MI210 GPU in staging
- 1 MI300x GPU partition in production
[x] Locust tests based on the requirements.
- Scenario1: min=20, median=74, max=171
- Scenario2: min=101, median=110, max=171
[ ] Iterate if locust tests are not successful.
- vllm (Clarify with Kevin and Dawid)
- [Kserve embeddings](https://kserve.github.io/website/docs/model-serving/generative-inference/tasks/embedding?_highlight=embedding-vllm.yaml). Known blocker: Does not support our rocm version. We can investigate more if we can find a compatible match.
- investigate more options.