Background
OpenVINO is an open-source toolkit developed by Intel for optimizing and deploying deep learning models across various hardware platforms, including Intel CPUs, GPUs, and other accelerators. In my recent experiment, I found that Int8 or Int4 quanitized OpenVINO models are functional with reasonable speed in intel CPUs. The models I tried with in my experiments include Gemma 3, Phi4, Phi3, Deepseek Distill R1, Qwen 3.5 and Qwen 4. I have been trying this in my development laptop(Thinkpad X1 Carbon) and stat1010.eqiad.wmnet. I posted the screencasts of these experiments at https://asciinema.org/a/Kp9WyRrXajzoNdLgNdZOTCFdI
Proposal
I propose we host an OpenVINO model in intel CPUs using the liftwing infrastructure. This will inform us:
- The baseline performance and usability of models with the OpenVINO and Intel CPU setup.
- Various parameters such as CPU cores and RAM that influence the performance
- The latency and throughput characteristics
- Inform the usecases we can address with this setup if everything goes well
The model I would like to use for this initial experiment is https://huggingface.co/OpenVINO/Phi-4-mini-instruct-int8-ov
Original model: Phi-4-mini-instruct
License: MIT
Model Creator: Microsoft
Quantization: Int8
Announcement Blogpost: https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/
Capabilities: Instruction following, Chat, Tool Calling, Tokenization
Supported languages: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Technical plan
The model can be directly used with OpenVINO and Optimum python libraries. but that won't server the capabilities. But we need to expose the capabilities in a generic way(APIs). The OpenVINO model server is the recommended way to host these models. Once a model reposity is configured with Model server, it can expose REST API(compatible with OpenAI apis). This api can be directly integrated to applications or KServe can act as proxy. There is also https://github.com/vllm-project/vllm-openvino project but seems quite new project at this point of time.
Steps
- Prepare a container image with OpenVino model server, model repository and configuration
- Test and get it uploaded to https://docker-registry.wikimedia.org
- Deploy to liftwing
- Prepare initial k8s configuration. Roughly 8cpu, 16GB RAM is expected
- Measure the performance