Page MenuHomePhabricator

Q2 FY2025-26 Goal: Semantic Search - Embeddings Service for MVP
Open, Needs TriagePublic

Description

We deploy an embeddings inference service for Qwen3.
This service will be used in semantic search mvp by the users who query in search bar.

https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

  • Non functional requirements: clarify with David Causse and Peter Fischer.
    • number of request per second: ~5 RPS
    • query context lengths average, max, min (Assuming average length of a word is 5 letters.)
      • average: ~8–12 words: 12*6 = 72
      • max: 300. we use the first 300 letters of the queries if they are longer than 300 letters.
    • latency: <300 ms.
    • SLO: We don’t yet have a hard uptime SLO defined for the MVP.
  • Api input/output parameters: Same as Openai standard.
  • Implementation with sentence embeddings. See implementation from Kevin.
  • Which GPUs to occupy (Clarify with the team.)
    • ML team agreed to use:
      • 1 MI210 GPU in staging
      • 1 MI300x GPU partition in production
  • Locust tests based on the requirements.
    • Scenario1: min=20, median=74, max=171
    • Scenario2: min=101, median=110, max=171

Out of scope (was not needed)

  • Iterate if locust tests are not successful.
    • vllm (Clarify with Kevin and Dawid)
    • Kserve embeddings. Known blocker: Does not support our rocm version. We can investigate more if we can find a compatible match.
    • investigate more options.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@OKarakaya-WMF is it okay if I turn this into a Goal ticket on our board and move it to the Goals column? You can still use it for all the same updates that you would use it for otherwise, but then it can also be a home for our Friday weekly updates. Otherwise, I can create a separate parent ticket to be the Goal ticket and I'll make this a child ticket under that. Please lmk whatever you prefer! TYSM

Change #1219128 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] embeddings: integrate prototype into model-server

https://gerrit.wikimedia.org/r/1219128

@Sucheta-Salgaonkar-WMF ,

Great idea! Let's turn this into a goal. I think it's fine not to create child tickets for now.
I have added checkboxes to the description indicating each step/task.
I'll update them as we progress and I can add weekly updates here.
Thank you!

Change #1219128 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] embeddings: integrate prototype into model-server

https://gerrit.wikimedia.org/r/1219128

Change #1219271 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[integration/config@master] inference-services: add CI pipeline jobs for embeddings model-server

https://gerrit.wikimedia.org/r/1219271

Change #1219537 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] embeddings: containerize model-server

https://gerrit.wikimedia.org/r/1219537

Change #1219537 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] embeddings: containerize model-server

https://gerrit.wikimedia.org/r/1219537

Change #1219271 merged by jenkins-bot:

[integration/config@master] inference-services: add CI pipeline jobs for embeddings model-server

https://gerrit.wikimedia.org/r/1219271

Change #1219602 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] embeddings: trigger CI to publish model-server

https://gerrit.wikimedia.org/r/1219602

Change #1219602 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] embeddings: trigger CI to publish model-server

https://gerrit.wikimedia.org/r/1219602

Change #1220050 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] docker-compose: add embeddings config

https://gerrit.wikimedia.org/r/1220050

Change #1220050 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] docker-compose: add embeddings config

https://gerrit.wikimedia.org/r/1220050

Change #1220313 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: add embeddings isvc to the experimental namespace

https://gerrit.wikimedia.org/r/1220313

Change #1220313 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add embeddings isvc to the experimental namespace

https://gerrit.wikimedia.org/r/1220313

Change #1220321 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: remove revise-tone-task-generator from experimental ns

https://gerrit.wikimedia.org/r/1220321

performance test results in local with CPU:

Scenario1:

question_length
count 4510.000000
mean 74.141463
std 22.811325
min 20.000000
25% 57.000000
50% 73.000000
75% 88.000000
max 171.000000

[2025-12-22 12:31:50,422] wmf3658/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     http://localhost:8080/v1/models/qwen3-embedding:predict                         1564     0(0.00%) |     98      44     218     94 |   13.06        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      1564     0(0.00%) |     98      44     218     94 |   13.06        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     http://localhost:8080/v1/models/qwen3-embedding:predict                                94    110    120    130    140    160    180    190    210    220    220   1564
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                             94    110    120    130    140    160    180    190    210    220    220   1564

Scenario2:
question_length
count 632.000000
mean 113.484177
std 12.441294
min 100.000000
25% 104.000000
50% 109.500000
75% 120.000000
max 171.000000

Load test results are within the threshold
[2025-12-22 12:36:15,421] wmf3658/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     http://localhost:8080/v1/models/qwen3-embedding:predict                         1243     0(0.00%) |    137      74     255    140 |   10.39        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      1243     0(0.00%) |    137      74     255    140 |   10.39        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     http://localhost:8080/v1/models/qwen3-embedding:predict                               140    150    160    170    190    200    220    230    250    260    260   1243
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated

Change #1220321 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: remove revise-tone-task-generator from experimental ns

https://gerrit.wikimedia.org/r/1220321

results with a new set up from local.

(myenv3112) ozge@wmf3658 locust % MODEL=embeddings locust
Min length: 10, Max length: 350
       question_length
count      4610.000000
mean         78.499349
std          37.519812
min          20.000000
25%          58.000000
50%          73.000000
75%          90.000000
max         350.000000
[2025-12-22 13:25:32,271] wmf3658/INFO/locust.main: Run time limit set to 120 seconds
[2025-12-22 13:25:32,271] wmf3658/INFO/locust.main: Starting Locust 2.31.5
[2025-12-22 13:25:32,272] wmf3658/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-12-22 13:25:32,272] wmf3658/INFO/locust.runners: All users spawned: {"Embeddings": 2} (2 total users)
[2025-12-22 13:27:31,996] wmf3658/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-12-22 13:27:32,032] wmf3658/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     http://localhost:8080/v1/models/qwen3-embedding:predict                         1457     0(0.00%) |    110      45     595     99 |   12.17        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      1457     0(0.00%) |    110      45     595     99 |   12.17        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     http://localhost:8080/v1/models/qwen3-embedding:predict                                99    120    130    130    160    190    330    390    520    600    600   1457
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                             99    120    130    130    160    190    330    390    520    600    600   1457

The embeddings model-server has been deployed in the LiftWing experimental namespace. It is currently available through an internal endpoint that can only be accessed by tools that run within the WMF infrastructure (e.g deploy2002, stat1008, etc):

$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/qwen3-embedding:predict" -X POST -d '{"instances": ["text1", "text2"]}' -H  "Host: embeddings.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{
    "model_name": "qwen3-embedding",
    "model_version": "",
    "predictions": [
      [
        -0.03631591796875,
        -0.0428466796875,
        -0.0142669677734375,
        ...,
        0.0076751708984375,
        0.01092529296875,
        0.018829345703125
      ],
      [
        -0.0250396728515625,
        -0.061767578125,
        -0.0142364501953125,
        ...,
        0.0157318115234375,
        0.00920867919921875,
        0.027984619140625
      ]
    ]
  }
real	0m0.072s
user	0m0.015s
sys	0m0.004s

We can test it and fix the edge cases we may come across.

Change #1220335 had a related patch set uploaded (by Ozge; author: ozge):

[machinelearning/liftwing/inference-services@main] ml-services: embeddings locust tests

https://gerrit.wikimedia.org/r/1220335

Staging results.

(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ MODEL=embeddings locust
Min length: 10, Max length: 350
       question_length
count      4610.000000
mean         78.490456
std          37.470874
min          20.000000
25%          58.000000
50%          73.000000
75%          90.000000
max         348.000000
[2025-12-22 13:22:37,710] stat1010/INFO/locust.main: Run time limit set to 120 seconds
[2025-12-22 13:22:37,711] stat1010/INFO/locust.main: Starting Locust 2.31.5
[2025-12-22 13:22:37,711] stat1010/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-12-22 13:22:37,711] stat1010/INFO/locust.runners: All users spawned: {"Embeddings": 2} (2 total users)
[2025-12-22 13:24:37,247] stat1010/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-12-22 13:24:37,324] stat1010/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     https://inference-staging.svc.codfw.wmnet:30443/v1/models/qwen3-embedding:predict    1895     0(0.00%) |     75      66     289     70 |   15.85        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      1895     0(0.00%) |     75      66     289     70 |   15.85        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     https://inference-staging.svc.codfw.wmnet:30443/v1/models/qwen3-embedding:predict       70     71     75     81     94    100    100    110    270    290    290   1895
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                             70     71     75     81     94    100    100    110    270    290    290   1895

Change #1220335 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] ml-services: embeddings locust tests

https://gerrit.wikimedia.org/r/1220335

(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ export MAX_LENGTH=350
(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ export MIN_LENGTH=100
(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ MODEL=embeddings locust
Min length: 100, Max length: 350
       question_length
count       732.000000
mean        135.498634
std          58.858208
min         100.000000
25%         105.000000
50%         112.000000
75%         128.000000
max         348.000000
[2025-12-22 13:44:47,723] stat1010/INFO/locust.main: Run time limit set to 120 seconds
[2025-12-22 13:44:47,723] stat1010/INFO/locust.main: Starting Locust 2.31.5
[2025-12-22 13:44:47,724] stat1010/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-12-22 13:44:47,724] stat1010/INFO/locust.runners: All users spawned: {"Embeddings": 2} (2 total users)
[2025-12-22 13:46:47,260] stat1010/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-12-22 13:46:47,339] stat1010/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     https://inference-staging.svc.codfw.wmnet:30443/v1/models/qwen3-embedding:predict    1890     0(0.00%) |     74      66     304     70 |   15.80        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      1890     0(0.00%) |     74      66     304     70 |   15.80        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     https://inference-staging.svc.codfw.wmnet:30443/v1/models/qwen3-embedding:predict       70     71     74     80     93    100    100    110    250    300    300   1890
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                             70     71     74     80     93    100    100    110    250    300    300   1890

(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$
(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ MODEL=embeddings locust
Min length: 250, Max length: 350
       question_length
count        65.000000
mean        301.353846
std          28.316532
min         250.000000
25%         283.000000
50%         303.000000
75%         324.000000
max         348.000000
[2025-12-22 13:49:40,609] stat1010/INFO/locust.main: Run time limit set to 120 seconds
[2025-12-22 13:49:40,609] stat1010/INFO/locust.main: Starting Locust 2.31.5
[2025-12-22 13:49:40,610] stat1010/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-12-22 13:49:40,610] stat1010/INFO/locust.runners: All users spawned: {"Embeddings": 2} (2 total users)
[2025-12-22 13:51:40,147] stat1010/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-12-22 13:51:40,224] stat1010/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     https://inference-staging.svc.codfw.wmnet:30443/v1/models/qwen3-embedding:predict    1902     0(0.00%) |     74      67     292     70 |   15.90        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      1902     0(0.00%) |     74      67     292     70 |   15.90        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     https://inference-staging.svc.codfw.wmnet:30443/v1/models/qwen3-embedding:predict       70     72     74     78     91     98    100    110    270    290    290   1902
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                             70     72     74     78     91     98    100    110    270    290    290   1902

@kevinbazira @OKarakaya-WMF thanks! is there a way to call this API in a way that is compatible with the openAI embedding format?
Regarding query size, qwen3 suggests a prompt that looks like this:

Instruct: Given a web search query, retrieve relevant passages that answer the query
Query:$user_query_here

Is there a chance that this prompt gets cached after multiple requests?

@kevinbazira @OKarakaya-WMF thanks! is there a way to call this API in a way that is compatible with the openAI embedding format?

Hi David, we have reviewed the OpenAI embeddings API format and below are the changes we would have to make to support this requirement:

1. Request Format

Our current embeddings API expects a POST request with a JSON body like this:

{
    "instances": ["text1", "text2"]
}

The OpenAI embeddings API expects this format instead:

{
    "input": ["text1", "text2"],
    "model": "qwen3-embedding"
}

The changes we would have to make on the request format are:

  1. Replace the instances key with input.
  2. Add a model field to specify which model to use. However, this may be redundant since our API is already model-specific based on the endpoint being called. Is this field a hard requirement for your use case?
2. Response Format

Our current embeddings API returns a response like this:

{
    "model_name": "qwen3-embedding",
    "model_version": "",
    "predictions": [
        [-0.0363, ...],
        [-0.0250, ...]
    ]
}

The OpenAI embeddings API returns:

{
    "object": "list",
    "data": [
        {
            "object": "embedding",
            "embedding": [-0.0363, ...],
            "index": 0
        },
        {
            "object": "embedding",
            "embedding": [-0.0250, ...],
            "index": 1
        }
    ],
    "model": "qwen3-embedding",
    "usage": {
        "prompt_tokens": 4,
        "total_tokens": 4
    }
}

The changes we would have to make on the response format are:

  1. Replace the predictions array with a data array, where each element is an object containing the embedding vector and its original index.
  2. Add a usage object that reports token counts. Is this field a hard requirement for your use case?

Please confirm whether the changes above to the request and response formats will meet your requirements.

@kevinbazira thanks, yes this seems like a format that opensearch would be able to work with (P86755 is what is working at the moment, we don't pass the model attribute because we currently use llama.cpp that does not support multi-model serving but we can pass it if required).

Later I stumbled on https://kserve.github.io/website/docs/model-serving/generative-inference/overview#api-endpoints which mentions OpenAI compatibility but could not make it work with the current endpoint:
curl -H "Content-Type: application/json" -XPOST -HHost:embeddings.experimental.wikimedia.org -d'{"input": ["text1", "text2"]}' https://inference-staging.svc.codfw.wmnet:30443/openai/v1/embeddings
-> {"detail":"Not Found"}

  1. Add a usage object that reports token counts. Is this field a hard requirement for your use case?

I don't think so, I believe that opensearch will simply ignore that part (I don't see anything in the codebase that suggests otherwise but I haven't tested to confirm), please feel free to ignore this requirement and we'll do some testing to confirm. thanks! :)

Change #1223000 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] embeddings: support OpenAI-compatible API format

https://gerrit.wikimedia.org/r/1223000

Change #1223000 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] embeddings: support OpenAI-compatible API format

https://gerrit.wikimedia.org/r/1223000

Change #1223172 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update embeddings model-server image

https://gerrit.wikimedia.org/r/1223172

Change #1223172 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update embeddings model-server image

https://gerrit.wikimedia.org/r/1223172

Later I stumbled on https://kserve.github.io/website/docs/model-serving/generative-inference/overview#api-endpoints which mentions OpenAI compatibility but could not make it work with the current endpoint.

This URL didn't work because the current embeddings inference service was built using a custom KServe model-server rather than the KServe HuggingFace runtime.

The HuggingFace runtime supports embeddings and OpenAI-compatible API endpoints out of the box, but it requires vLLM, which is not currently supported by our ROCm version. This limitation is being addressed in T385173 and T394778.

In the meantime, the current embeddings inference service has been updated to support OpenAI-compatible API request and response formats as discussed earlier.

$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/qwen3-embedding:predict" -X POST -d '{"input": ["text1", "text2"]}' -H  "Host: embeddings.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [ -0.03631591796875, ... ],
      "index": 0
    },
    {
      "object": "embedding",
      "embedding": [ -0.0250396728515625, ... ],
      "index": 1
    }
  ],
  "model": "/mnt/models/"  # going to fix this
}
real	0m0.079s
user	0m0.016s
sys	0m0.004s

Please test this and let us know whether it's compatible with opensearch.

Change #1223626 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] embeddings: add model_version config

https://gerrit.wikimedia.org/r/1223626

Change #1223626 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] embeddings: add model_version config

https://gerrit.wikimedia.org/r/1223626

Please test this and let us know whether it's compatible with opensearch.

Thanks Kevin!
I tested and this works as expected, minor annoyance is that I can't seem to be able to propagate the host header from the opensearch connector config and had to hack the /etc/hosts file of the host running opensearch to make it work. I suspect that it won't be necessary once we move out of staging?
Regarding the end state for the MVP we are likely going to use an opensearch cluster in eqiad (relforge) but the host you mentioned embeddings.experimental.wikimedia.org made me wonder if you had in mind exposing this outside the WMF infra? I was more expecting an internal endpoint with possibly a discovery ns record: embeddings.inference.discovery.wmnet pointing either to eqiad or codfw.

Change #1223629 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update embeddings model-server image

https://gerrit.wikimedia.org/r/1223629

minor annoyance is that I can't seem to be able to propagate the host header from the opensearch connector config and had to hack the /etc/hosts file of the host running opensearch to make it work. I suspect that it won't be necessary once we move out of staging?

Please ignore this sorry, I was able able in the end to propagate the host header and made the system work without any /etc/hosts hacks.

Change #1223629 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update embeddings model-server image

https://gerrit.wikimedia.org/r/1223629

Change #1223961 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: deploy embeddings isvc to llm ns prod

https://gerrit.wikimedia.org/r/1223961

Change #1223961 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy embeddings isvc to llm ns prod

https://gerrit.wikimedia.org/r/1223961

Thank you for testing and confirming that this service works as expected, David! We have now deployed it to LiftWing production (eqiad). It can be accessed using the embeddings.llm.wikimedia.org host header, as shown below:

$ curl "https://inference.svc.eqiad.wmnet:30443/v1/models/qwen3-embedding:predict" -X POST -d '{"input": ["text1", "text2"]}' -H "Host: embeddings.llm.wikimedia.org" -H "Content-Type: application/json" --http1.1

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [ -0.036407470703125, ...],
      "index": 0
    },
    {
      "object": "embedding",
      "embedding": [ -0.0249786376953125, ...],
      "index": 1
    }
  ],
  "model": "Qwen3-Embedding-0.6B"
}    

Regarding your question: you are correct, we do not plan to expose the embeddings service outside the WMF infra. Please use this prod endpoint for the MVP and let us know if you experience any routing issues so that we can engage an SRE.

Hello @dcausse ,

Do we plan to query the api on prod with the following prompt?
We set max length to 300 chars. So that, if the query text length is higher than 300 chars, only the first 300 chars will be used.
We can increase it if we expect longer text.
Following prompt is ~90 chars.

Instruct: Given a web search query, retrieve relevant passages that answer the query
Query:$user_query_here

We get better results on prod.

  • Median text length: 303. Requirement 72 (+ prompt if exists.)
  • Median latency 32ms. Max latency 290. Requirement <300 ms.
  • Number of requests per second 22. Requirement 5.
(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ MODEL=embeddings locust
Min length: 250, Max length: 350
       question_length
count        65.000000
mean        301.353846
std          28.316532
min         250.000000
25%         283.000000
50%         303.000000
75%         324.000000
max         348.000000
[2026-01-08 09:09:52,173] stat1010/INFO/locust.main: Run time limit set to 120 seconds
[2026-01-08 09:09:52,173] stat1010/INFO/locust.main: Starting Locust 2.31.5
[2026-01-08 09:09:52,174] stat1010/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2026-01-08 09:09:52,174] stat1010/INFO/locust.runners: All users spawned: {"Embeddings": 2} (2 total users)
[2026-01-08 09:11:51,684] stat1010/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2026-01-08 09:11:51,765] stat1010/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/qwen3-embedding:predict                                              2635     0(0.00%) |     38      28     287     32 |   22.04        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      2635     0(0.00%) |     38      28     287     32 |   22.04        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/qwen3-embedding:predict                                                     32     36     42     45     52     56     68    240    270    290    290   2635
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                             32     36     42     45     52     56     68    240    270    290    290   2635

Hello @dcausse ,

Do we plan to query the api on prod with the following prompt?

Yes, this is what is suggested in the doc and not using this prompt gives indeed worse results at query time.
CirrusSearch by default will error any queries with more than 300 chars.
So indeed if the liftwing limit is at 300 (prompt included) then cirrus might possibly send too large queries.
I think it might make sense to increase the limit of liftwing to len(prompt)+300 if possible.

Relatedly do we have a kv cache in place so that the prompt tokens are less impactful to the latencies? Would it be possibly to run your benchmark with the given prompt in front of your generated questions?

@dcausse , cool.
I'll update the service and the performance tests accordingly.

when median length of the query: 77 chars + the prompt (108 chars):

  • max latency: 290ms.
  • 99.9 percentile latency: 280ms.
  • median latency: 34ms

when median length of the query: 303 chars + the prompt (108 chars):

  • max latency: 570ms.
  • 99.9 percentile latency: 370ms.
  • median latency: 33ms

max query length: 500 chars
@dcausse you will still need to send the query with prompt to the api. Please let me know if this is possible in OpenSearch. Otherwise, we can discuss adding the prompt to the api.

@dcausse @pfischer please let me know if the performance looks good for MVP.

(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ MODEL=embeddings locust
Min length: 250, Max length: 350
       question_length
count        65.000000
mean        301.353846
std          28.316532
min         250.000000
25%         283.000000
50%         303.000000
75%         324.000000
max         348.000000
Prompt length:  108
[2026-01-08 12:47:18,192] stat1010/INFO/locust.main: Run time limit set to 120 seconds
[2026-01-08 12:47:18,192] stat1010/INFO/locust.main: Starting Locust 2.31.5
[2026-01-08 12:47:18,193] stat1010/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2026-01-08 12:47:18,193] stat1010/INFO/locust.runners: All users spawned: {"Embeddings": 2} (2 total users)
[2026-01-08 12:49:17,685] stat1010/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2026-01-08 12:49:17,768] stat1010/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/qwen3-embedding:predict                                              2577     0(0.00%) |     41      29     571     33 |   21.56        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      2577     0(0.00%) |     41      29     571     33 |   21.56        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/qwen3-embedding:predict                                                     33     37     44     48     55     60     82    260    370    570    570   2577
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated
(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ MODEL=embeddings locust
Min length: 50, Max length: 350
       question_length
count      3952.000000
mean         84.550354
std          37.088196
min          50.000000
25%          64.000000
50%          77.000000
75%          93.000000
max         348.000000
Prompt length:  108
[2026-01-08 12:50:39,141] stat1010/INFO/locust.main: Run time limit set to 120 seconds
[2026-01-08 12:50:39,142] stat1010/INFO/locust.main: Starting Locust 2.31.5
[2026-01-08 12:50:39,142] stat1010/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2026-01-08 12:50:39,143] stat1010/INFO/locust.runners: All users spawned: {"Embeddings": 2} (2 total users)
[2026-01-08 12:52:38,649] stat1010/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2026-01-08 12:52:38,727] stat1010/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/qwen3-embedding:predict                                              2586     0(0.00%) |     41      28     292     34 |   21.64        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      2586     0(0.00%) |     41      28     292     34 |   21.64        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/qwen3-embedding:predict                                                     34     38     45     48     56     61     81    260    280    290    290   2586
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated

@OKarakaya-WMF awesome thanks, p50 at 34ms is nice thanks! If I'm reading the numbers right it does seem like the prompt is not adding much overhead.
yes the prompt will be sent on every requests for now, if deemed necessary we could think about some named prompt templates but it's probably too early to think about that at this point.

Thank you again @dcausse ,

We are closing this task.
We can keep improving the service based on your findings.
Looking forward to the next steps in MVP.
😍

Sucheta-Salgaonkar-WMF renamed this task from Semantic Search - Embeddings Service for MVP to Q2 FY2025-26 Goal: Semantic Search - Embeddings Service for MVP.Mon, Jan 12, 5:34 AM
Sucheta-Salgaonkar-WMF added a project: Goal.