If we can take the advantage of quantization of models (CPU optimized), we will be able to improve the efficiency, reduce resource & cost for hosting large models (LLM). This will allow us to move towards batch inference which model train/test will greatly benefit from. Quantized models will provide the community non-GPU access to our large models.
Description
Details
- Due Date
- Sep 30 2024, 4:00 AM
Related Objects
Event Timeline
Update: Building on the work we did in this direction last quarter, I've been experimenting with applying post training quantization techniques to the reference-need model (version 0), which is a fine-tuned bert-base-multilingual-case. These experiments focus on optimizing inference on CPU while making sure that the model's accuracy, precision, recall and f1 score stays the same.
Two interesting results from these experiments are:
- Latency of model optimized and quantized using Onnx Runtime: This involved converting the original pytorch model to onnx, applying graph optimizations (redundant node elimination and operator fusion) and applying 8-bit linear dynamic quantization (where the quantization parameters for weights are calculated ahead of time and those for activations are calculated during inference). The quantization is also done per-channel, so different quantization parameters per element along one of the dimensions of a tensor (higher accuracy at the cost of more memory) and leverages AVX512-VNNI instruction set. The resulting model has a p95 latency of 0.142 secs which is ~2.4x faster than the full precision model without any drop in performance.
- Performance of model quantized using Intel Neural Compressor accuracy aware tuning: This involved using Intel NC to get an 8-bit dynamic quantized model optimized for Intel cpus. An evaluation function and dataset along with an acceptable accuracy loss was also passed in so that only a quantized model that is found to fit this criteria would be returned (within 20 trials). What's unintuitive here is that this model has considerably better recall and f1 score than the original model on the same evaluation dataset so we'd likely need to test with additional data and / or perform cross validation to know for sure what's going on here.
Code and comparisons for these experiments can be found here.
Note: The latest version of reference-need is based on a model different from bert-base-multilingual-case and @Aitolkyn has graciously agreed to re-run these experiments with this new version to see if we get similar results.
I re-ran our latest reference-need model on a test data of 15K sentences. Our currently deployed model uses distilbert-base-multilingual-cased with torch dynamic quantization (column 2 - torch - in the plots below).
Some of the observations here:
- Quantization using Onnx runtime still shows almost x2 improvement in latency for min, median, and p95, while, max latency value is better in the torch quantization.
- Evaluation metrics are comparable across all models, i.e., f1-score is in the range of 0.703-0.705 for all models.
@MunizaA I'm moving this task to in-progress b/c I'm closing the first quarter's lane. Please take the appropriate next steps with it.
Mark as resolved for Q1 deliverables.
Deliverables:
- various techniques for quantization
- sample notebooks
- performance metrics
Next step:
- apply learnings from this quarter to LLM for Q2
