Page MenuHomePhabricator

Host an OpenVINO model in LiftWing
Closed, DeclinedPublic

Description

Background

OpenVINO is an open-source toolkit developed by Intel for optimizing and deploying deep learning models across various hardware platforms, including Intel CPUs, GPUs, and other accelerators. In my recent experiment, I found that Int8 or Int4 quanitized OpenVINO models are functional with reasonable speed in intel CPUs. The models I tried with in my experiments include Gemma 3, Phi4, Phi3, Deepseek Distill R1, Qwen 3.5 and Qwen 4. I have been trying this in my development laptop(Thinkpad X1 Carbon) and stat1010.eqiad.wmnet. I posted the screencasts of these experiments at https://asciinema.org/a/Kp9WyRrXajzoNdLgNdZOTCFdI

Proposal

I propose we host an OpenVINO model in intel CPUs using the liftwing infrastructure. This will inform us:

  1. The baseline performance and usability of models with the OpenVINO and Intel CPU setup.
  2. Various parameters such as CPU cores and RAM that influence the performance
  3. The latency and throughput characteristics
  4. Inform the usecases we can address with this setup if everything goes well

The model I would like to use for this initial experiment is https://huggingface.co/OpenVINO/Phi-4-mini-instruct-int8-ov
Original model: Phi-4-mini-instruct
License: MIT
Model Creator: Microsoft
Quantization: Int8
Announcement Blogpost: https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/
Capabilities: Instruction following, Chat, Tool Calling, Tokenization
Supported languages: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian

Technical plan

The model can be directly used with OpenVINO and Optimum python libraries. but that won't server the capabilities. But we need to expose the capabilities in a generic way(APIs). The OpenVINO model server is the recommended way to host these models. Once a model reposity is configured with Model server, it can expose REST API(compatible with OpenAI apis). This api can be directly integrated to applications or KServe can act as proxy. There is also https://github.com/vllm-project/vllm-openvino project but seems quite new project at this point of time.

Steps

  • Prepare a container image with OpenVino model server, model repository and configuration
  • Test and get it uploaded to https://docker-registry.wikimedia.org
  • Deploy to liftwing
    • Prepare initial k8s configuration. Roughly 8cpu, 16GB RAM is expected
  • Measure the performance

Event Timeline

Change #1151645 had a related patch set uploaded (by Santhosh; author: Santhosh):

[machinelearning/liftwing/inference-services@main] WIP: Openvinoi model server integration

https://gerrit.wikimedia.org/r/1151645

I tried to integrate Openvino model server to liftwing. Learnings for the first iteration(see the above WIP patch):

The Opevino Model server's latest version 2025.1 require Python 3.12+ and GLIBC 2.38+. So we cannot use our bookworm images to prepare based image. Because of this, I used the Opevino model server docker images(official image provided by Intel) for this iteration.

@isarantopoulos I am not sure about the best practice for the base images in Liftwing. Please advice.

Using older versions of OpenVino is not an option as I had tried them in the past and not performant enough. It defeats the purpose.

I am working on the KServe API bridging with OpenVino server.

Regarding the KServe API and Openvino model server:

The Kserver compatiblae OVMS Rest api is documented at https://docs.openvino.ai/2025/model-server/ovms_docs_rest_api_kfs.html#inference-api
It provides APIs compatible with V2 of Kserve API standard. https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#httprest

We are using V1 of kserve currently. V2 introduces /infer API. But this infer API accepts token ids, shape etc- basically the raw inputs to the model. To call that API, you need to calculate the model inputs - You need to do tokenization and shaping, which is not at all trivial and cannot expect API consumers to do that from their end. (Or I am completely misunderstanding the spec)

There is a problem with both v1 and v1 kserve api for LLM inference though. Both are incompatible with OpenAI API spec which is kind of standard for any LLM inference(thanks to OpenRouter and many SDKs which adopted it). OVMS recommends that API and readily provides it too under /v3 path. I got this working. It is generic - you can give roles(system, assistant, user, tool) and more options for inference. I would like to move in that direction rather than kserve's limited infer API.

At present, the following works based on my patch.

http
POST http://localhost:8080/v3/chat/completions
Content-Type: application/json

{
  "model": "Phi-4-mini-instruct-int4-ov",
  "max_tokens": 300,
  "temperature": 0,
  "stream": false,
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Hello, How are you!" }
    ]
}

Output(from my dev laptop: X1 Carbon 9th gen. Intel i7 CPU):

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Hello! I'm just a computer program, so I don't have feelings, but I'm here and ready to assist you. How can I help you today?",
        "role": "assistant"
      }
    }
  ],
  "created": 1748516549,
  "model": "Phi-4-mini-instruct-int4-ov",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 17,
    "completion_tokens": 31,
    "total_tokens": 48
  }
}

Since you people tried VLLM inference, I am sure you will have thoughts on this as well. Please let me know. @isarantopoulos , @kevinbazira

Docker images: The images that we can use as base images in WMF's production clusters have to be in production-images repo. Looking at the dockefile instructions available in the openvino repo if we wanted to use this image in prod we would have to port one of the dockerfiles created for ubuntu to start from a debian distribution. Regarding python 3.12 we could install this python version in bookworm.

v1/v2 protocol
At the moment on LiftWing we don't support v2 but we can plan the work required to do it for new services. One can however test the v2 protocol by building an image locally at the moment.
kserve v2 follows the Open Inference Protocol (OIP) which aims to become a standard across platforms (kserve, seldon, triton, openvino etc.). Looking at this kserve github page it says that openvino has also adopted this, although the link is broken so I don't know what to assume here.
Regarding the input required for v2 you can either pass the tensor directly or any other input you want. Here is an example of how a v1 request is translated to a v2 request:

v1:

curl -s localhost:8080/v2/models/articlequality/infer -X POST -d '{"rev_id": 12345, "lang": "en"}'

v2:

curl -s localhost:8080/v2/models/articlequality/infer -X POST \
  -d '{
        "inputs": [
          {
            "name": "lang",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["en"]
          },
          {
            "name": "rev_id",
            "shape": [1],
            "datatype": "INT64",
            "data": [12345]
          }
        ]
      }'

Have you tried to test if v1 would work? We have been using the huggingfaceserver in kserve which supports the OpenAI protocol out of the box.

I'm not against going a different route than kserve but we would have to scope the work required to do so before taking a final decision. Apart from the docker image work there is the kubernetes side of things (helm charts, prometheus metrics)

Regarding the KServe API and Openvino model server:
...
Since you people tried VLLM inference, I am sure you will have thoughts on this as well. Please let me know. @isarantopoulos , @kevinbazira

As per Ilias' comment in T395012#10874407, I ran a simple comparison between vllm and ovms using KServe's v1 protocol on ml-lab1002. The goal was to get an understanding of how ovms' setup and performance compares with vllm using a simple Kserve inference service. Here are the results

vLLM

Using the docker image built in 1146891, that supports vllm and kserve, I tested a simple isvc as shown below:

1.Set up simple vllm model-server

1from vllm import LLM, SamplingParams
2import os
3
4MODEL_ID = "/srv/app/models/Phi-4-mini-instruct" # path inside container
5GPU_MEMORY_UTILIZATION = 1.0
6MAX_MODEL_LEN = 4096
7
8class SimpleVLLMModel(kserve.Model):
9 def __init__(self, name: str):
10 super().__init__(name)
11 self.name = name
12 self.llm = None
13
14 def load(self) -> bool:
15 """
16 Called once at startup. Instantiate the vLLM engine.
17 """
18 self.llm = LLM(
19 model=MODEL_ID,
20 gpu_memory_utilization=GPU_MEMORY_UTILIZATION,
21 max_model_len=MAX_MODEL_LEN
22 )
23 self.ready = True
24 return True
25
26 def preprocess(self, payload: dict, headers: dict = None) -> str:
27 """
28 Extract a plain prompt string from the incoming payload.
29 Expecting JSON like:
30 { "instances": [ { "text": "Your input prompt here" } ] }
31 """
32 instances = payload.get("instances", [])
33 if len(instances) == 0 or "text" not in instances[0]:
34 raise kserve.errors.InvalidInput("Missing 'text' field in instances")
35 return instances[0]["text"]
36
37 def predict(self, processed_input: str, headers: dict = None) -> dict:
38 """
39 Call vLLM's generate() on the single prompt and return JSON.
40 """
41 sampling_params = SamplingParams(
42 n=1,
43 temperature=0.7,
44 top_p=0.9,
45 top_k=50,
46 max_tokens=100
47 )
48 # vLLM expects a list of prompts + list of SamplingParams
49 outputs = self.llm.generate([processed_input], [sampling_params])
50 generated = outputs[0].outputs[0].text.strip()
51
52 return { "predictions": [ { "generated_text": generated } ] }
53
54
55if __name__ == "__main__":
56 model_name = os.environ.get("MODEL_NAME", "simple-vllm")
57 model = SimpleVLLMModel(model_name)
58 model.load()
59 kserve.ModelServer().start([model])

2.Start the vllm isvc to serve phi-4-mini-instruct on GPU

1$ python3 simple_vllm_model.py
2INFO 06-02 13:24:22 [importing.py:53] Triton module has been replaced with a placeholder.
3INFO 06-02 13:24:23 [__init__.py:239] Automatically detected platform rocm.
4INFO 06-02 13:24:24 [config.py:209] Replacing legacy 'type' key with 'rope_type'
5INFO 06-02 13:24:40 [config.py:716] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
6INFO 06-02 13:24:48 [arg_utils.py:1699] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
7INFO 06-02 13:24:48 [config.py:1770] Disabled the custom all-reduce kernel because it is not supported on current platform.
8INFO 06-02 13:24:48 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.5.dev243+gc53e0730c) with config: model='/srv/app/models/Phi-4-mini-instruct', speculative_
9config=None, tokenizer='/srv/app/models/Phi-4-mini-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/srv/app/models/Phi-4-mini-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
10INFO 06-02 13:24:48 [rocm.py:186] None is not supported in AMD GPUs.
11INFO 06-02 13:24:48 [rocm.py:187] Using ROCmFlashAttention backend.
12[W602 13:24:48.833832012 ProcessGroupNCCL.cpp:1028] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
13[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
14[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
15[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
16[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
17INFO 06-02 13:24:48 [parallel_state.py:946] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
18INFO 06-02 13:24:48 [model_runner.py:1120] Starting to load model /srv/app/models/Phi-4-mini-instruct...
19Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
20Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:01<00:01, 1.60s/it]
21Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00, 2.43s/it]
22Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:04<00:00, 2.30s/it]
23
24INFO 06-02 13:24:54 [loader.py:458] Loading weights took 4.91 seconds
25INFO 06-02 13:24:54 [model_runner.py:1152] Model loading took 8.5215 GiB and 5.464753 seconds
26INFO 06-02 13:24:55 [worker.py:287] Memory profiling takes 0.85 seconds
27INFO 06-02 13:24:55 [worker.py:287] the current vLLM instance can use total_gpu_memory (63.98GiB) x gpu_memory_utilization (1.00) = 63.98GiB
28INFO 06-02 13:24:55 [worker.py:287] model weights take 8.52GiB; non_torch_memory takes 0.29GiB; PyTorch activation peak memory takes 1.84GiB; the rest of the memory reserved for KV Cache is 53.33GiB.
29INFO 06-02 13:24:55 [executor_base.py:112] # rocm blocks: 27305, # CPU blocks: 2048
30INFO 06-02 13:24:55 [executor_base.py:117] Maximum concurrency for 4096 tokens per request: 106.66x
31INFO 06-02 13:24:56 [model_runner.py:1462] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
32Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:16<00:00, 2.10it/s]
33INFO 06-02 13:25:13 [model_runner.py:1604] Graph capturing finished in 17 secs, took 0.20 GiB
34INFO 06-02 13:25:13 [llm_engine.py:443] init engine (profile, create kv cache, warmup model) took 18.56 seconds
352025-06-02 13:25:13.410 1996 kserve INFO [model_server.py:register_model():398] Registering model: simple-vllm
362025-06-02 13:25:13.411 1996 kserve INFO [model_server.py:setup_event_loop():278] Setting max asyncio worker threads as 32
372025-06-02 13:25:13.429 1996 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
382025-06-02 13:25:13.430 1996 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
392025-06-02 13:25:13.441 1996 uvicorn.error INFO: Started server process [1996]
402025-06-02 13:25:13.441 1996 uvicorn.error INFO: Waiting for application startup.
412025-06-02 13:25:13.444 1996 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
422025-06-02 13:25:13.444 1996 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
432025-06-02 13:25:13.445 1996 uvicorn.error INFO: Application startup complete.
442025-06-02 13:25:13.445 1996 uvicorn.error INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

3.Query the vllm isvc

$ time curl -s localhost:8080/v1/models/simple-vllm:predict -X POST -d '{ "instances": [ { "text": "Who are you?" } ] }' -i -H "Content-type: application/json"
HTTP/1.1 200 OK
date: Mon, 02 Jun 2025 13:28:27 GMT
server: uvicorn
content-length: 152
content-type: application/json

{"predictions":[{"generated_text":"I am Phi, your assistant. I'm here to help you with any questions or tasks you have. What can I do for you today?"}]}
real	0m0.471s
user	0m0.014s
sys	0m0.000s
OVMS

Using the docker image built in 1151645, which supports ovms and kserve, I tested a simple isvc as shown below:

1.Set up simple ovms model-server

1import numpy as np
2import os
3import json
4from transformers import AutoTokenizer
5from openvino.runtime import Core
6
7MODEL_BASE = "/mnt/models/phi-4-mini-instruct-int8-ov/1" # version folder
8IR_XML = os.path.join(MODEL_BASE, "openvino_model.xml")
9IR_BIN = os.path.join(MODEL_BASE, "openvino_model.bin")
10
11class SimpleOVMSModel(kserve.Model):
12 def __init__(self, name: str):
13 super().__init__(name)
14 self.name = name
15 self.tokenizer = None
16 self.core = None
17 self.compiled = None
18 self.input_ids_name = None
19 self.attn_name = None
20 self.pos_name = None
21 self.beam_name = None
22 self.logits_name = None
23
24 def load(self) -> bool:
25 """
26 1. Load HuggingFace tokenizer from MODEL_BASE
27 2. Read and compile the IR with OpenVINO
28 3. Cache input/output layer names
29 """
30 self.tokenizer = AutoTokenizer.from_pretrained(MODEL_BASE, trust_remote_code=False)
31
32 self.core = Core()
33 model = self.core.read_model(model=IR_XML, weights=IR_BIN)
34 self.compiled = self.core.compile_model(model, device_name="CPU")
35
36 # Cache layer names (assumes standard naming: input_ids, attention_mask, position_ids, beam_idx)
37 inputs = list(self.compiled.inputs)
38 self.input_ids_name = inputs[0]
39 self.attn_name = inputs[1]
40 self.pos_name = inputs[2]
41 self.beam_name = inputs[3]
42 self.logits_name = list(self.compiled.outputs)[0]
43
44 self.ready = True
45 return True
46
47 def preprocess(self, payload: dict, headers: dict = None) -> dict:
48 """
49 Expect:
50 { "instances": [ { "text": "some sentence" } ] }
51 Tokenize to NumPy arrays for input_ids, attention_mask, position_ids, beam_idx.
52 Return a dict of Python lists so that JSON is serializable.
53 """
54 instances = payload.get("instances", [])
55 if not instances or "text" not in instances[0]:
56 raise kserve.errors.InvalidInput("Missing 'text' field in instances")
57
58 text = instances[0]["text"]
59 enc = self.tokenizer(text, return_tensors="np", padding=True, truncation=True)
60
61 input_ids = enc["input_ids"].astype(np.int64) # shape (1, seq_len)
62 attention_mask = enc["attention_mask"].astype(np.int64) # shape (1, seq_len)
63
64 batch_size, seq_len = input_ids.shape
65 position_ids = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
66 beam_idx = np.zeros((batch_size,), dtype=np.int32)
67
68 return {
69 "input_ids": input_ids.tolist(),
70 "attention_mask": attention_mask.tolist(),
71 "position_ids": position_ids.tolist(),
72 "beam_idx": beam_idx.tolist()
73 }
74
75 def predict(self, processed_inputs: dict, headers: dict = None) -> dict:
76 """
77 1. Convert the lists back to NumPy arrays
78 2. Call OpenVINO .infer_new_request(...)
79 3. Greedy argmax over logits → token IDs
80 4. Detokenize IDs → text
81 5. Return {"predictions":[{"text": generated_text}]}
82 """
83 # Reconstruct NumPy arrays
84 input_ids = np.array(processed_inputs["input_ids"], dtype=np.int64)
85 attention_mask = np.array(processed_inputs["attention_mask"], dtype=np.int64)
86 position_ids = np.array(processed_inputs["position_ids"], dtype=np.int64)
87 beam_idx = np.array(processed_inputs["beam_idx"], dtype=np.int32)
88
89 infer_inputs = {
90 self.input_ids_name: input_ids,
91 self.attn_name: attention_mask,
92 self.pos_name: position_ids,
93 self.beam_name: beam_idx
94 }
95 results = self.compiled.infer_new_request(infer_inputs)
96 logits = results[self.logits_name] # shape: (1, seq_len, vocab_size)
97
98 # Greedy decode: pick argmax for each new token step
99 # (assumes this IR outputs full sequence of new tokens)
100 token_ids = np.argmax(logits, axis=-1).flatten().tolist()
101
102 # Detokenize back to text
103 generated_text = self.tokenizer.decode(token_ids, skip_special_tokens=True)
104
105 return { "predictions": [ { "text": generated_text } ] }
106
107
108if __name__ == "__main__":
109 model_name = os.environ.get("MODEL_NAME", "simple-ovms")
110 model = SimpleOVMSModel(model_name)
111 model.load()
112 kserve.ModelServer().start([model])

2.Start the ovms isvc to serve phi-4-mini-instruct-int8-ov on CPU

1$ python3 simple_ovms_model.py
2None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
3/ovms/lib/python/openvino/runtime/__init__.py:10: DeprecationWarning: The `openvino.runtime` module is deprecated and will be removed in the 2026.0 release. Please replace `openvino.runtime` with `openvino`.
4 warnings.warn(
52025-06-02 13:39:57.609 3256 kserve INFO [model_server.py:register_model():398] Registering model: simple-ovms
62025-06-02 13:39:57.610 3256 kserve INFO [model_server.py:setup_event_loop():278] Setting max asyncio worker threads as 32
72025-06-02 13:39:57.627 3256 kserve INFO [server.py:_register_endpoints():110] OpenAI endpoints not registered
82025-06-02 13:39:57.627 3256 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
92025-06-02 13:39:57.639 3256 uvicorn.error INFO: Started server process [3256]
102025-06-02 13:39:57.639 3256 uvicorn.error INFO: Waiting for application startup.
112025-06-02 13:39:57.642 3256 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
122025-06-02 13:39:57.642 3256 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
132025-06-02 13:39:57.642 3256 uvicorn.error INFO: Application startup complete.
142025-06-02 13:39:57.642 3256 uvicorn.error INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

3.Query the ovms isvc

$ time curl -s localhost:8080/v1/models/simple-ovms:predict -X POST -d '{ "instances": [ { "text": "Who are you?" } ] }' -i -H "Content-type: application/json"
HTTP/1.1 200 OK
date: Mon, 02 Jun 2025 13:42:03 GMT
server: uvicorn
content-length: 44
content-type: application/json

{"predictions":[{"text":" sold the? What"}]}
real	0m1.174s
user	0m0.000s
sys	0m0.011s
My takeaways:

1.OVMS on CPU (~1.17s) is slower than vLLM on GPU (~0.47s) in this simple test. If we plan to host LLMs on CPU with ovms, we might have a performance drop.
2.Whereas vLLM can serve original models from HuggingFace, ovms requires an extra model conversion step to OpenVINO IR format (xml/bin), then place it under a version folder. This will likely require another evaluation step to ensure the converted model matches the original model's accuracy.
3.On a brighter note, we can use ovms with the KServe v1 protocol.

Thanks @kevinbazira and @isarantopoulos for these details. Very useful information.

Kevin, What you tried with openvino is low level openvino API. That code has some issues it seems. The output you got does not make sense("Sold the? What"). It could be because the input does not match with the expected input of Phi-4 which expect messages with roles. You mentioned it as OVMS. It is not OVMS

OVMS is Openvino Model Server(https://github.com/openvinotoolkit/model_server) - very similar to huggingfaceserver. I had tried the low level openvino APIs, and it works but quite hard to work since you need to configure each model with its internal architecture. I am not proposing that for this exploration. I am proposing Openvino model server setup. I image that it would be similar to our huggingface_server setup.

There won't be any python code we add, other then plugging the OVMS with an entrypoint.sh and corresponding pipeline blubber. See WIP patch https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1151645

@kevinbazira I learned that ml-lab-1002 has AMD CPUs and not Intel CPUs. This is interesting information. OpenVINO is targetting Intel CPUs, NPUs and GPUs. I had heard that it will also work on AMD GPUs but this is quite undocumented. I think your experiment shows it works on AMDs too(AMD EPYC 7643P is what ml-lab-1002 uses).

The speed difference between AMD GPU and CPU is expected. As I was discussing with Sucheta and Ilias, I foresee this as inference capability where realtime computation is not required for the usecase. However, trying to expose smaller LMs for realtime usage is not out of scope in my exploration. I found BERT based models running in OVMS with my intel CPUs in ~10-20 millisecond range.

@isarantopoulos, after reading your reply, I read how huggingfaceserver is working and configured(T357986) and I think we can get OVMS working in the same way.

The huggingface server exposes openai completion API like this

curl https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -H "Host: llama3.experimental.wikimedia.org" -H "Content-Type: application/json" -X POST -d '{"model": "llama3", "prompt": "Write me a poem about Machine Learning.", "stream":false, "max_tokens": 50}'

And OVMS exposes it like this:

curl http://localhost:8080/v3/chat/completions -H "Content-Type: application/json" -X POST -d '{
  "model": "Phi-4-mini-instruct-int4-ov",
  "max_tokens": 300,
  "temperature": 0,
  "stream": false,
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Who are you?" }
  ]
}'

The only difference is openai/v1 vs v3 in the API URL part. I guess this is manageable with our proxies and and optional rewrites I guess? Output is same for both.
(ref https://docs.openvino.ai/2025/model-server/ovms_docs_genai.html). @isarantopoulos please let me know.

I have added chat, completion and embedding examples that uses the openi api in my WIP patch https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1151645 - please have a look

Now, regarding Kserve v1 and v2 APIs, OVMS documents it as Tensorflow API and KServe API respectively. I could not get :predict(V1) or /infer APIs running so far. The /models API path lists my models, but both predict and infer APIs says the model name could not found. This is something I will try again. Sending the input as BYTES is documented, but that is also failing my experiments(it says not a valid type). This is also another item to figure out.

However, @isarantopoulos, do you think it is important to expose APIs other than openai api endpoint? My understanding is exposing OpenAI compatible API endpoint makes the models in OVMS immediately usable, and v1-predict or v2-infer APIs are nice to have. Please correct me if I am wrong.

One interesting feature in OVMS is the model respository and configuration uses Mediapipe graphs. So unlike huggingface, you cannot just download a model and run it. You need to convert to Openvino IR format(this is easy and takes a few minutes) and then the graphs need to be prepared(they are text files with node definitions).

@isarantopoulos, The docker image preparation based on production image as base image - I will take a look into it. But if I am stuck I will report here - I am not really good with Docker image building :-)

I am trying to build a production docker image with WMF debian bookworm+python3.11+openvino 2025.1.0+ OVMS 2025.1.0. I am referring the Ubuntu Dockerfile.

There are prebuilt binaries of Openvino for Ubuntu, Redhat, and Debian 10(buster), but not for Debian stable(bookworm). So I am trying to compile it from source on Debian. The compilations of the cpp code is quite time consuming(2 hours and still going on. So many submodules and dependencies)

Change #1153988 had a related patch set uploaded (by Santhosh; author: Santhosh):

[operations/docker-images/production-images@master] Add Openvino modelserver

https://gerrit.wikimedia.org/r/1153988

Updates: I successfully created a production docker image on top of WMF production wookworm image.

Upstream provides Dockerfile for Redhat 8 and Ubuntu 24.04 and not for Debian stable.

For the Dockerfile, I referred Ubuntu Dockerfile, but had to make several changes to get this working. I am considering to request an official Dockerfile for Debian(may be I can contribute)

Openvino and Openvino model server are written in C++ and we compile them from source code - including dependencies. They are big projects and in my development laptop, it took more than 5+ hours! Upstream provides openvino prebuilt binaries for Ubuntu 24.04 and Debian 10. But that wont work in Debian 12(bookworm) because of glibc and other
shared library version conflicts.

Took lot of time for me to prepare the final version. docker-pkg build with --use-cache helped a lot for incremental preparation of Dockerfile. It is 238 lines and perhaps the largest one in production-images repo :-)

The final image is 734MB.

I tested it with the model repository I had prepared. Everything is working as expected so far.

I will try to create a screencast or video recording showing the models in action.

Here is the screencast of everything working together: https://drive.google.com/file/d/1YDSvTm3ePv585-ittck2tYWH2XZ-AHLx/view
(MP4 video, 15mins, 88MB)

Screencast of inference alone(30seconds), this time not hindered by CPU usage of screenrecording.

https://asciinema.org/a/722578

Update: New version of Opevino and Openvino model server was released a few days ago. I updated my production bookworm based docker image patch: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1153988 to that version. This new version adds support for agentic AI(tool calling and MCP).

Since preparing the production image for bookworm is the most time taking effort, I filed a ticket in upstream to consider providing a docker image for stable debian releases- https://github.com/openvinotoolkit/model_server/issues/3452

Now that we have proof of concept, ML team need to take a decision on this so that they can include this work in the planned activities. That will help steamlining code review, and next steps.

The model repository setup is easy with the new versions. OVMS can pull models from huggingface repos and setup all configurations. I wrote a shellscript to get all listed models in it and start the server. This script is used with the official docker images of OVMS. https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1151645 will require updates based on this

#!/bin/bash
# Get all the models and prepapre configuration
# Refer https://docs.openvino.ai/nightly/model-server/ovms_demos_continuous_batching_rag.html

# Configuration
MODELS_PATH="/data/scratch/santhosh/models"
USER_GROUP="$(id -u):$(id -g)"
IMAGE="openvino/model_server:latest"
BASE_CMD="docker run --user $USER_GROUP --rm -v $MODELS_PATH:/models"

# Model definitions: "model_name:task"
MODELS=(
        "OpenVINO/gemma-2b-it-int4-ov:text_generation"
        "OpenVINO/bge-base-en-v1.5-fp16-ov:embeddings"
        "OpenVINO/bge-reranker-base-fp16-ov:rerank"
        "OpenVINO/FLUX.1-schnell-int4-ov:image_generation"
)

echo "Setting up OpenVINO models..."

# Download models if they don't exist
for model_task in "${MODELS[@]}"; do
        model="${model_task%:*}"
        task="${model_task#*:}"
        model_dir="$MODELS_PATH/${model##*/}"

        if [[ ! -d "$model_dir" ]]; then
                echo "Downloading $model..."
                $BASE_CMD:rw $IMAGE --pull --model_repository_path /models --source_model "$model" --task "$task"
        else
                echo "Model $model already exists, skipping download"
        fi
done

# Add models to config
echo "Configuring models..."
for model_task in "${MODELS[@]}"; do
        model="${model_task%:*}"
        $BASE_CMD:rw $IMAGE --add_to_config /models --model_name "$model" --model_path "$model"
done

# Check if config exists before running server
if [[ -f "$MODELS_PATH/config.json" ]]; then
        echo "Starting OpenVINO model server..."
        $BASE_CMD:ro -p 9000:9000 -p 8000:8000 $IMAGE \
                --config_path /models/config.json --port 8000 --rest_port 9000
else
        echo "Error: Configuration file not found at $MODELS_PATH/config.json"
        exit 1
fi

https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1153988

This patch to prepare a WMF production image is outdated again as new version of Debian is released.

Since LiftWing hosting plan is pending and under consideration by Machine-Learning-Team , I setup https://ovms.wmcloud.org/ with upstream docker image and above shellscript. Wrote an nginx proxy configuration for landing page, API key authentication and /api/v1 mounting. OpenAI api compatible APIs are available. This instance uses 8 CPU cores, 32GB RAM, 20GB Disk(models are in scratch volume). This wmcloud instance does not cancel this ticket, and for experimentation purpose.

Change #1153988 abandoned by Santhosh:

[operations/docker-images/production-images@master] Add Openvino modelserver

Reason:

Abandoning as this is old and outdated and there is no plan from ML team on this front.

https://gerrit.wikimedia.org/r/1153988

I am abandoning the patches as they are outdated and I do not plan to keep them updated with the changes mentioned above.

Change #1151645 abandoned by Santhosh:

[machinelearning/liftwing/inference-services@main] WIP: Openvino model server integration

Reason:

Abandoning as this is old and outdated and there is no plan from ML team on this front.

https://gerrit.wikimedia.org/r/1151645

santhosh changed the task status from Open to Stalled.Oct 6 2025, 6:42 AM

Marking as stalled for now.

I am closing this ticket as this is not moving forward and I am not planning to work on this stream in any immediate future.