Page MenuHomePhabricator

Model quantization (research infra)
Open, Needs TriagePublic

Description

If we can take the advantage of quantization of models (CPU optimized), we will be able to improve the efficiency, reduce resource & cost for hosting large models (LLM). This will allow us to move towards batch inference which model train/test will greatly benefit from. Quantized models will provide the community non-GPU access to our large models.

  1. Quantized model will still utilize GPUs; however we will be able to host multiple models per GPU
    1. from ML: in production, we will still host one model per GPU to ensure request-response
  2. A quantized model will allow inference on CPU, we will be able to optimize inference according to a specific architecture (memory map files). This can be slower than GPU but we obtain better scalability