- Goal: Measurements for running LLM inference using the WMF GPU infrastructure.
- Deliverables:
- Report in the form of a jupyter notebook with tables / plots that defines and quantifies the limitations of the GPUs available for LLM inference at WMF.
- Summary document with the main takeaways and recommendations for next steps
- Infrastructure: the ml-lab instances that each have two AMD Instinct MI210 GPUs
- The LLM model families evaluated: LLama / Aya / Mistral.
- Determine the largest model that can be run, use that for benchmark, with one smaller model of the same family.
- LLama 8B → 70 might be too large of a gap (un-quantized)
- The task: a simulation of a classification task
- The measurements to collect:
- Inference latency - time to generate a full response (depends on use case)
- Throughput - output tokens generated per second
- Concurrent requests - how many users can served simultaneously, at what cost to inference speed
- Configuration options used for experiments, aka knobs to turn
- Input length passed to the model (depends on context size of model)
- Output length (e.g. shorter classification output like article categories vs longer text generation like article outlines)
- Batch size - number of input sequences processed in parallel (e.g. for concurrent requests, or for offline inference)
Description
Details
- Due Date
- Nov 8 2024, 12:00 AM
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | diego | T377159 [SDS 1.2.1 B] Test existing AI models for internal use-cases | |||
| Resolved | • MunizaA | T377498 Phase 2: Article categorization metrics, fine-tuning metrics, optimization tooling | |||
| Resolved | • MunizaA | T377496 Phase 1: LLM inference - base metrics |
Event Timeline
We discussed this task in today's backlog refinement meeting and set the deadline (based on the original time estimate given by the team). Moving to In progress as work is in progress.
Updates:
- Added a basic benchmark harness for measuring latency and throughput of causal decoder-only models:
- Works with only HuggingFace transformers CausalLM models at the moment but is written to be extensible to new backends.
- Allows fiddling with input size, batch size, output size and model specific parameters like pretrained model name or path, whether to load the model on gpu or cpu etc. Measures latency and throughput in terms of tokens / second.
- The way it currently works is by passing an input of size batch_size * sequence_length tokens to the model, initially generating a single token to get the prefill phase latency (i.e. the time it takes to initialize the Key Value cache with the input tokens and use it to generate the first token). The throughput is then measured using the time taken to autoregressively generate max_new_tokens afterwards i.e. the decode phase.
- Example:
from llmperf.benchmark import Benchmark from llmperf.llms.huggingface import HFLM, HFConfig llm = HFLM( config=HFConfig( model_name="mistralai/Mistral-7B-v0.1", device_map="cuda", ) ) metrics = Benchmark( llm=llm, batch_size=1, sequence_length=512, max_new_tokens=256, warmup_iterations=3, ).run() print(metrics)
Next steps:
- Add options for optimizing latency or memory usage for HF transformers to the harness: quantization, cpu offloading, limiting memory usage per gpu, distributed inference.
- Add some metrics related to memory usage.
- Add support for more backends. Possible options: llama-cpp python bindings, Intel neural compressor, vllm.
- Add a CLI and make it easy to pass the multitude of model parameters.
- Thoroughly test the harness.
- Experiment with other interesting stuff like this Accelerate utility for finding the right batch size.
Updates:
- Added support for VRAM and RAM monitoring to the benchmark harness. A monitor polls either rocm-smi (for VRAM) or the process running the benchmarks (for RAM) at configurable intervals, writes their memory usage to a csv file with timestamps and returns min and peak usage at the end of the monitoring window. You can also configure what blocks of code are monitored and which gpu(s) to consider.
- Extended the huggingface implementation to support offloading model layers to the cpu and/or limit memory usage per gpu(s). This means you can load a model using, for example, half the vram that it would normally take. However, inference would be slower since the model layers are then sequentially onloaded to the gpu as needed.
- The harness now also supports configuring whether to test quantized versions of a model. Supported methods include awq, gptq, bitsandbytes and quanto. gptq yielded promising results during tests (upto 3-4x higher throughput) but requires pre-quantized weights . bitsandbytes is not tested yet as it only supports rocm 6.x (stat machines are on rocm 5.6).
I'm preparing a report that exercises all the currently available knobs on the stat machine infra and compares metrics. Will share it here shortly.
Updates:
- Experiments can now be configured using yaml files. The location of these files can be passed to the cli and experiments can be filtered using shell-style wildcards.
- Added an isolated runner so that each experiment is launched in a separate process (sequentially) and resources such as vram are cleaned up between subsequent runs, when multiple experiments are passed to the cli.
- Started running experiments on ml-labs. These experiments use a single gpu and fix the sequence length to 8192 tokens and generate exactly 512 new tokens. The variables are the model family (LLama 3.1 and Mixtral 8x7B and 8x22B), number of parameters, batch size, torch dtype and attention mechanisms. For models that don't fit on a single gpu, like the Llama 3.1 70B, additional variables are how much memory (vram and ram) to allocate to model weights when dispatching the model across gpu and cpu. So far I've ran experiments for Llama 3.1 8B and Llama 3.1 70B (half precision, single gpu with cpu offloading). The bottleneck here is downloading the model weights from the HF hub (tops out at 10 Mbps) and inference with models that use cpu offloading is very slow (0.17 tokens/s for the experiment mentioned above) so experiments can take a long time to finish.
Next steps:
- Run experiments for the Mixtral models.
- Analyze reported metrics across experiments and summarize results.
Wonderful, thank you @MunizaA ! Just a note that he Mixtral model @Trokhymovych is experimenting with are 8x7B and 7B (not 8x22) See: https://phabricator.wikimedia.org/T377425#10302406
Also question: inference with models that use cpu offloading are mainly the larger ones (eg. LLama 3.1 70B), correct?
Thanks so much!
Thanks Miriam! I've added Mistral-7B-Instruct-v0.3 to the experiments.
Also question: inference with models that use cpu offloading are mainly the larger ones (eg. LLama 3.1 70B), correct?
That's correct. We're only experimenting with CPU offloading for models that don't fit on a single GPU (i.e. 64 GiB vram). On our list, that's Meta-Llama-3.1-70B-Instruct and Mixtral-8x7B-Instruct-v0.1
Updates:
- Finished running basic experiments for all models and added a notebook analyzing the results. Some observations from the analysis (see the notebook for more detailed metrics, experiment setup and comparison charts):
- The fastest model is (not unexpectedly) the smallest model on the list: mistralai/Mistral-7B-Instruct-v0.3 with eager attention at 16.414 tokens/sec and 35.454 secs overall latency. The slowest model is the meta-llama/Llama-3.1-70B-Instruct at 0.127 tokens/sec and 4103.839 secs overall latency.
- Eager vs SDPA attention:
- Throughput: For smaller models, decoding throughput is slightly lower with SDPA than eager attention. For larger models, where we use CPU offloading, the throughput is almost 2x higher.
- VRAM usage: For smaller models SDPA can result in as much 2.7x lower peak VRAM usage. For larger models, things are a little more complicated: The way we load these models is by allocating a portion of the VRAM to model weights. Then, during inference, layers are onloaded from the RAM as needed. Eager attention tends to cause large spikes in VRAM usage so we can only allocate a very small portion of the VRAM upfront to account for these. On the flip side, this means that apart from these spikes, a large portion of the memory can go unused. VRAM usage with SDPA is much more stable, so we can allocate almost all of the 64 GiB upfront, resulting in less onloading and offloading and thus faster decoding. SDPA is supposed to be more memory efficient but these spikes are still strange and I need to investigate some more here.
- For CPU usage, when running model.generate, the method used for autoregressive decoding, only a single core is used, pegged at almost 100%. I found this comment from a transformers maintainer that suggests that this is the python code that orchestrates instructions on the GPU, "not optimized in many segments of the model forward pass".
- Tried building flash-attention-2 from source for rocm on ML-Labs and ran into a bunch of issues. I reported these to ML during the hands-on session held at the offsite.
Updates:
- Added support for text classification (in addition to text generation to the harness) to be able to benchmark bert and longformer.
- Ran experiments to observe the effects of input vs output tokens and model size on latency.
- Ran experiments that simulate inference requests for the npov and peacock experiments with fixed input and output lengths and variable batch sizes as a rough proxy for concurrent requests.
- Added a quickstart guide to the llmperf readme.
- For phase 2:
- Built flash attention 2, gptq, awq, bitsandbytes and deepspeed for ROCm on ML labs.
- Ran experiments comparing latency for original and quantized models.
- Ran experiments using tensor parallelism to study the effects of better hardware on latency.
- Next steps include: finishing up remaining work for phase 2 (i.e. fine tuning) and adding and summarizing results from all experiments so far to the report appendix.