Model quantization (research infra)
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	XiaoXiao-WMF
	Mon, Apr 15, 12:45 PM

Description

If we can take the advantage of quantization of models (CPU optimized), we will be able to improve the efficiency, reduce resource & cost for hosting large models (LLM). This will allow us to move towards batch inference which model train/test will greatly benefit from. Quantized models will provide the community non-GPU access to our large models.

Quantized model will still utilize GPUs; however we will be able to host multiple models per GPU
1. from ML: in production, we will still host one model per GPU to ensure request-response
A quantized model will allow inference on CPU, we will be able to optimize inference according to a specific architecture (memory map files). This can be slower than GPU but we obtain better scalability

Event Timeline

XiaoXiao-WMF created this task.Mon, Apr 15, 12:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Apr 15, 12:45 PM

Model quantization (research infra)Open, Needs TriagePublicActions

Description

Event Timeline

Model quantization (research infra)
Open, Needs TriagePublic
Actions