If we can take the advantage of quantization of models (CPU optimized), we will be able to improve the efficiency, reduce resource & cost for hosting large models (LLM). This will allow us to move towards batch inference which model train/test will greatly benefit from. Quantized models will provide the community non-GPU access to our large models.
- Quantized model will still utilize GPUs; however we will be able to host multiple models per GPU
- from ML: in production, we will still host one model per GPU to ensure request-response
- A quantized model will allow inference on CPU, we will be able to optimize inference according to a specific architecture (memory map files). This can be slower than GPU but we obtain better scalability