We want a simple way to produce plots that show how input and output token size affect latency. For example, the heatmap below demonstrates this for Llama 3.1 8B models, showing latency comparisons on varying token sizes between the base model and its AWQ and GPTQ quantized versions.
Tools we're investigating are:











