We deploy an embeddings inference service for Qwen3.
This service will be used in semantic search mvp by the users who query in search bar.
https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
- Non functional requirements: clarify with David Causse and Peter Fischer.
- number of request per second: ~5 RPS
- query context lengths average, max, min (Assuming average length of a word is 5 letters.)
- average: ~8–12 words: 12*6 = 72
- max: 300. we use the first 300 letters of the queries if they are longer than 300 letters.
- latency: <300 ms.
- SLO: We don’t yet have a hard uptime SLO defined for the MVP.
- Api input/output parameters: Same as Openai standard.
- Implementation with sentence embeddings. See implementation from Kevin.
- Which GPUs to occupy (Clarify with the team.)
- ML team agreed to use:
- 1 MI210 GPU in staging
- 1 MI300x GPU partition in production
- ML team agreed to use:
- Locust tests based on the requirements.
- Scenario1: min=20, median=74, max=171
- Scenario2: min=101, median=110, max=171
Out of scope (was not needed)
- Iterate if locust tests are not successful.
- vllm (Clarify with Kevin and Dawid)
- Kserve embeddings. Known blocker: Does not support our rocm version. We can investigate more if we can find a compatible match.
- investigate more options.