Quantization for Serving

Training Quantization vs Serving Quantization

In the fine-tuning track we covered QLoRA — quantizing the base model so we could train on it with LoRA adapters attached. The frozen weights lived in 4-bit NF4 format to save memory, but the whole point was to enable fine-tuning: gradients flowed through the dequantized weights, LoRA matrices updated in FP16, and at the end we got a trained adapter. The quantization was a means to an end — the end being training.

Serving quantization is a different game entirely. We start with a finished model — already pre-trained, already fine-tuned, already evaluated — and we compress its weights so that inference is faster and cheaper. No training happens. No adapters. No gradients. We take the final checkpoint, quantize it once, and deploy the quantized version for serving. The question is no longer "can we fit training into this GPU?" but "can we serve this model faster and to more users?"

Why does compressing weights make inference faster? Recall from article 1 that the decode phase of autoregressive generation is memory-bandwidth-bound : generating each token requires reading the entire model's weights from GPU memory, but the arithmetic intensity is low (we're multiplying those weights by a single token's activations). The GPU's compute units spend most of their time waiting for data to arrive from HBM. If we halve the number of bytes per weight — say, from 16-bit floats to 8-bit integers — we halve the amount of data the GPU needs to read per token, and decode throughput roughly doubles. Going from FP16 to INT4 means a $4\times$ reduction in bytes, which translates to up to $4\times$ faster decode.

The word "up to" matters. The actual speedup depends on dequantization overhead (the GPU must convert INT4 weights back to FP16 before multiplying), kernel efficiency, and whether other bottlenecks (like KV-cache memory or CPU overhead) become dominant once the weight-loading bottleneck is relaxed. But the principle is clear: smaller weights mean less memory traffic, and less memory traffic means faster token generation.

💡 A useful mental model: training quantization (QLoRA) is about fitting a large model into a small GPU so you can train. Serving quantization is about making a trained model respond faster and cheaper in production. The techniques overlap (both use 4-bit weights), but the goals and constraints are different.

Weight-Only Quantization

The simplest serving-quantization strategy is weight-only quantization : compress the model's weight matrices to low precision (INT8, INT4, or even INT3), but keep activations in their original FP16 or BF16 format. During each matrix multiplication, the quantized weights are dequantized on-the-fly back to FP16, multiplied by the FP16 activations, and the result is accumulated in FP16. The quantized weights are never used directly in arithmetic — they're a compact storage format that gets unpacked just before use.

Why quantize weights but not activations? Weights are fixed : they don't change from one input to the next. This means the quantization error we introduce is constant and predictable — every user's request hits the same approximation, and we can measure that error offline before deployment. Activations, on the other hand, vary with every input and are prone to outliers : certain hidden dimensions can spike to values 10–100$\times$ larger than the median, making uniform quantization for activations far more lossy. The INT8 outlier channels that Dettmers et al. documented in LLM.int8() showed that even a few extreme activation values can destroy model quality if naively quantized. Weights are better-behaved — their distributions are roughly Gaussian and stable, making them easier to compress.

How much does weight-only quantization help? The effective memory bandwidth improvement is simply the ratio of original to quantized bit-widths:

\text{bandwidth improvement} = \frac{b_{\text{original}}}{b_{\text{quantized}}}

where $b_{\text{original}}$ is the number of bits per weight in the original model and $b_{\text{quantized}}$ is the number of bits after quantization. For INT4 weights replacing FP16:

\frac{b_{\text{original}}}{b_{\text{quantized}}} = \frac{16}{4} = 4

That's $4\times$ less memory to read per token during decode. In the ideal case — where decode is purely memory-bandwidth-bound and the dequantization compute is negligible — this translates directly to $4\times$ higher decode throughput. In practice, the dequantization kernels add some computational overhead, and the prefill phase (which is compute-bound, not memory-bound) sees less benefit. But for the autoregressive decode phase that dominates wall-clock time in long-generation workloads, the improvement is substantial.

Let's look at the boundary cases. At $b_{\text{quantized}} = 16$ (no quantization), the ratio is 1 — no improvement. At $b_{\text{quantized}} = 8$ (INT8), we get $2\times$. At $b_{\text{quantized}} = 4$ (INT4), we get $4\times$. At $b_{\text{quantized}} = 2$ (INT2), we'd get $8\times$ — but 2-bit quantization typically destroys model quality for all but the smallest tasks. The sweet spot for large language models today is 4-bit, which provides a large bandwidth improvement with minimal quality degradation when done carefully.

💡 Weight-only INT4 quantization also cuts the model's memory footprint by $4\times$, which means you can serve a model that would normally require 4 GPUs on a single GPU — or fit a 70B-parameter model (140 GB in FP16) into roughly 35 GB, which fits on a single A100 80 GB with room for KV-cache and activations.

GPTQ: Post-Training Quantization via Second-Order Information

Simply rounding each weight to the nearest INT4 value ("round-to-nearest" or RTN) works at 8 bits, but at 4 bits the accumulated rounding errors across billions of parameters noticeably degrade model quality. Can we do better? What if, when we quantize one weight, we adjust the remaining weights to compensate for the error we just introduced?

That's the insight behind GPTQ (Frantar et al., 2022) , a post-training quantization method that uses second-order information (the Hessian matrix) to quantize weights one column at a time while adjusting the not-yet-quantized columns to minimise the overall output error. GPTQ builds on the Optimal Brain Quantization (OBQ) framework, but re-engineers it to scale to billion-parameter models by processing columns instead of individual weights and using a fixed column ordering rather than a greedy one.

The core idea works layer by layer. For a given linear layer with weight matrix $\mathbf{W}$ and a small set of calibration inputs $\mathbf{X}$ (typically 128 examples from a representative dataset), we want to find a quantized weight matrix $\hat{\mathbf{W}}$ that minimises the squared error in the layer's output:

\min_{\hat{\mathbf{W}}} \| \mathbf{W}\mathbf{X} - \hat{\mathbf{W}}\mathbf{X} \|_2^2

If we quantize column $j$ of $\mathbf{W}$ and introduce an error $\delta_j = w_j - \hat{w}_j$ (the difference between the original and quantized weight for that column), we can compensate the remaining columns $j+1, \ldots, d$ by shifting them proportionally. The optimal compensation is given by the Hessian $\mathbf{H} = 2\mathbf{X}\mathbf{X}^\top$, which captures how changes in each weight affect the layer's output. Specifically, when we quantize column $j$, the update to the remaining un-quantized weights in the same row is:

\delta_{\text{remaining}} = -\frac{\delta_j}{[\mathbf{H}^{-1}]_{jj}} \cdot \mathbf{H}^{-1}_{j, j+1:d}

This update distributes the quantization error of column $j$ across the remaining columns in a way that minimises the total output error, weighted by how sensitive the output is to each weight (captured by $\mathbf{H}^{-1}$). The denominator $[\mathbf{H}^{-1}]_{jj}$ normalises by the self-sensitivity of the quantized weight. At one extreme, if the quantization error $\delta_j$ is zero (the weight happened to already be at a quantization grid point), no compensation is needed. At the other extreme, if $[\mathbf{H}^{-1}]_{jj}$ is very large (the output is not very sensitive to this weight), the compensation factors are small — the error doesn't matter much, so there's little to fix.

In practice, GPTQ processes 128 columns at a time in blocks (for better GPU utilisation), uses a dampening factor to stabilise the Hessian inverse, and runs a Cholesky decomposition for numerical stability. The typical configuration is INT4 with group size 128 : the weights in each row are divided into groups of 128, and each group gets its own scale factor and zero-point. This means the quantization grid adapts to the local weight distribution within each group, rather than using a single scale for an entire row of potentially thousands of values. The overhead is storing one FP16 scale and one FP16 zero-point per 128 weights, which adds roughly 0.25 bits per weight — so "INT4 g128" is effectively about 4.25 bits per weight.

The result: GPTQ produces near-lossless 4-bit models for most architectures (Llama, Mistral, Phi, etc.), with perplexity increases of typically less than 0.1 on standard benchmarks. At 3-bit, quality degradation becomes noticeable — the 8 representable levels ($2^3 = 8$) are too few for many weight distributions, and even Hessian-based compensation can't fully recover the lost information. Quantization itself takes minutes to hours depending on model size (a one-time offline cost), but the resulting model serves at INT4 speed indefinitely.

💡 GPTQ requires a calibration dataset of roughly 128 examples to estimate the Hessian. These should be representative of the model's intended use (e.g. C4 text for a general-purpose model). The calibration data is not used for training — only for measuring weight sensitivities. Using a mismatched calibration set can hurt quality: a code model calibrated on news articles may quantize poorly for code tasks.

AWQ: Activation-Aware Weight Quantization

GPTQ compensates for quantization error using second-order information, which works well but requires computing and inverting a Hessian matrix for every layer. Is there a simpler approach? What if we could figure out which weights matter most and give them more quantization precision, without the full Hessian machinery?

That's the approach taken by AWQ (Activation-Aware Weight Quantization) (Lin et al., 2023) . The key observation is that not all weights are equally important for model quality. Some weight channels correspond to salient activation channels — hidden dimensions that consistently carry large activation magnitudes across diverse inputs. Roughly 1% of weight columns fall into this category, and quantizing them carelessly causes disproportionate quality loss. The other 99% of weights can be quantized aggressively with minimal impact.

How does AWQ identify these salient channels? By running a small calibration set through the model and observing which activation channels have the largest average magnitudes. If channel $j$ consistently produces large activations $\bar{a}_j = \mathbb{E}[|a_j|]$, the weights in column $j$ of the weight matrix are salient — errors in those weights get amplified by the large activations they multiply.

A naive fix would be to keep salient weights in FP16 while quantizing the rest to INT4. But mixed-precision storage breaks the uniform memory layout that GPU kernels need for efficient computation — you'd lose most of the speed benefit. Instead, AWQ applies a per-channel scaling trick . Before quantization, it multiplies the salient weight columns by a scale factor $s_j > 1$, effectively "zooming in" on their value range:

w_j' = w_j \cdot s_j, \qquad a_j' = a_j / s_j

The product $w_j \cdot a_j = w_j' \cdot a_j'$ is unchanged, so the model's output is mathematically identical before quantization. But now $w_j'$ occupies a wider numerical range, which means the INT4 quantization grid (which is spaced evenly across the range) allocates more of its 16 levels to the values that matter most. The scale factor is absorbed into the preceding layer's weights or normalisation parameters, so the activation division happens implicitly — no runtime overhead.

The optimal scale factor for each channel is found by a simple grid search that minimises the quantization error on the calibration set:

s_j^* = \arg\min_{s_j} \| Q(w_j \cdot s_j) \cdot (a_j / s_j) - w_j \cdot a_j \|^2

where $Q(\cdot)$ denotes the quantization operation (round to nearest INT4 grid point). At $s_j = 1$, we get standard quantization with no scaling — the baseline. As $s_j$ increases, the weight values spread across a wider range and get finer quantization granularity relative to the activation magnitude, reducing quantization error for that channel. But there's a trade-off: scaling up the weights also amplifies their absolute rounding error (the INT4 grid steps get bigger in absolute terms), so there's a sweet spot where the reduction in relative error outweighs the increase in absolute error. The grid search typically evaluates values from the set $\{s_j^{\alpha} : \alpha \in [0, 1]\}$ where the initial $s_j$ is proportional to the activation magnitude $\bar{a}_j$.

Compared to GPTQ, AWQ is simpler and faster to run (no Hessian computation or matrix inversions), and empirically matches or exceeds GPTQ quality on most benchmarks. Both methods are widely supported in serving frameworks: vLLM, TGI (Text Generation Inference), and TensorRT-LLM all have optimised kernels for both GPTQ and AWQ INT4 models.

💡 AWQ also requires a small calibration set (typically 128 examples), but only uses it to measure activation magnitudes — not to compute a Hessian. This makes AWQ quantization faster (minutes for a 70B model) and less sensitive to the specific calibration data chosen.

GGUF and CPU/Hybrid Inference

GPTQ and AWQ are designed for GPU serving: they produce quantized weight matrices that are loaded into GPU memory and dequantized by custom CUDA kernels during inference. But what if you don't have a GPU — or don't have enough GPU memory to hold the full model? What if you want to run a 70B model on a MacBook with 64 GB of unified memory, or on a server with a small GPU supplemented by system RAM?

That's the niche filled by the GGUF format and the llama.cpp ecosystem (Gerganov et al.) . GGUF (GGML Universal Format) is a file format purpose-built for quantized model storage and efficient CPU inference. It packages the model's architecture metadata, tokenizer, and quantized weights into a single file that llama.cpp (and tools built on it, like Ollama, LM Studio, and koboldcpp) can load and run with no external dependencies — no Python, no PyTorch, no CUDA toolkit required.

GGUF supports a rich variety of quantization levels, each identified by a short code:

Q2_K: 2-bit quantization with k-quant structure. Very aggressive compression, noticeable quality loss. Useful only when memory is extremely tight.
Q3_K_S / Q3_K_M / Q3_K_L: 3-bit with small/medium/large group sizes. Marginal quality, but fits very large models into limited RAM.
Q4_K_S / Q4_K_M: 4-bit small/medium. The most popular choice — good balance of quality and compression. Q4_K_M uses slightly more bits for important layers.
Q5_K_S / Q5_K_M: 5-bit. Near-FP16 quality for most tasks, at roughly 3$\times$ compression.
Q6_K: 6-bit. Very close to FP16 quality, useful when you have enough memory and want minimal degradation.
Q8_0: 8-bit. Essentially lossless for all practical purposes, at 2$\times$ compression.

The "K" in these names stands for k-quants — a quantization scheme that assigns different precisions to different layers within the same model. The insight is that not all layers are equally sensitive to quantization. Attention layers and the first/last layers of the network tend to be more sensitive than the feed-forward layers in the middle. K-quants assign more bits (higher precision) to sensitive layers and fewer bits to robust ones, achieving better quality at the same average bit-width than uniform quantization. For example, Q4_K_M might use 6-bit quantization for the attention Q/K projections and first/last layers, while using 4-bit for the bulk of the FFN layers.

The most important difference between GGUF and GPTQ/AWQ is the target hardware. GPTQ and AWQ models are served by GPU-optimised runtimes (vLLM, TGI, TensorRT-LLM) that need the entire model in GPU memory. GGUF models are designed for CPU inference and CPU+GPU hybrid inference . In hybrid mode, llama.cpp places as many layers as will fit on the GPU, and the remaining layers run on the CPU using system RAM. This makes it possible to run models that exceed GPU memory — a 70B Q4_K_M model at roughly 40 GB can run on a system with 24 GB of VRAM plus 64 GB of system RAM, with the GPU handling the layers it can fit and the CPU handling the rest.

The trade-off is speed. CPU inference is much slower than GPU inference for large models — system RAM bandwidth (DDR5 at ~50-80 GB/s) is far lower than HBM bandwidth (A100 at ~2 TB/s). But for many use cases — local development, privacy-sensitive deployments, edge devices, or simply experimenting with models you can't afford to serve on GPUs — GGUF makes large language models accessible on hardware that would otherwise be unable to run them at all.

💡 Apple Silicon Macs are particularly well-suited for GGUF inference because their unified memory architecture gives the GPU direct access to system RAM at ~100-200 GB/s, much faster than a discrete CPU+GPU setup. A Mac Studio with 192 GB of unified memory can run a 70B model at Q4_K_M entirely "on-device" with reasonable token-generation speeds.

Choosing the Right Quantization

With multiple quantization methods and formats available, how do you choose? The decision comes down to three factors: where you're serving (GPU, CPU, or hybrid), how much memory you have, and how much quality degradation you can tolerate. The table below summarises the trade-offs.

import json, js

rows = [
    ["GPTQ INT4 (g128)", "4.25", "GPU", "Near-lossless", "Fast (vLLM, TGI)", "GPU serving at scale"],
    ["AWQ INT4",         "4.25", "GPU", "Near-lossless", "Fast (vLLM, TGI)", "GPU serving at scale"],
    ["GPTQ INT3 (g128)", "3.25", "GPU", "Noticeable loss", "Fast", "Max compression (GPU)"],
    ["GGUF Q4_K_M",      "~4.5", "CPU / hybrid", "Good", "Moderate", "Local / laptop / Ollama"],
    ["GGUF Q5_K_M",      "~5.5", "CPU / hybrid", "Very good", "Moderate", "Local, quality-sensitive"],
    ["GGUF Q8_0",        "8.0",  "CPU / hybrid", "Lossless*", "Slower (more bytes)", "Max quality, enough RAM"],
    ["FP16 (no quant)",  "16.0", "GPU", "Baseline", "Memory-bound", "When memory allows"],
    ["INT8 (LLM.int8)", "8.0",   "GPU", "Lossless*", "~1.5x vs FP16", "Simple, safe compression"],
]

js.window.py_table_data = json.dumps({
    "headers": ["Method", "Bits/weight", "Hardware", "Quality", "Speed", "Use case"],
    "rows": rows
})

print("* 'Lossless' means perplexity increase < 0.01 on standard benchmarks.")
print("  Actual quality depends on the model and task.")

Some rules of thumb for common scenarios:

GPU serving at scale (vLLM, TGI, TensorRT-LLM): use AWQ or GPTQ INT4 with group size 128. Both are well-supported by production serving frameworks, give near-lossless quality, and provide the full $4\times$ memory bandwidth improvement. AWQ is often slightly easier to produce and marginally better on quality; GPTQ has a longer track record and wider tool support. Either works.
Local / laptop / edge (llama.cpp, Ollama, LM Studio): use GGUF Q4_K_M for the best quality-per-byte, or Q5_K_M if you have the RAM and want higher quality. Q4_K_M is the community's default recommendation for a reason — it hits the sweet spot where quality is still strong and the model fits in reasonable memory.
Maximum quality, memory is not a constraint: serve in FP16 or BF16. If you want some compression without measurable quality loss, INT8 (via LLM.int8() or FP8 on Hopper GPUs) is effectively free.
Maximum compression, quality is secondary: GPTQ 3-bit or GGUF Q3_K_M. Expect noticeable degradation on reasoning and knowledge-intensive tasks, but acceptable for simpler generation tasks or when the model is very large relative to available memory.

One final point: quantization composes with every other optimisation in this track. You can serve an AWQ INT4 model with grouped-query attention (GQA), PagedAttention for KV-cache management, continuous batching to maximise GPU utilisation, and speculative decoding to reduce latency — all at the same time. These techniques are orthogonal: quantization reduces the bytes per weight, GQA reduces the bytes per KV-cache entry, continuous batching amortises fixed overhead across requests, and speculative decoding trades extra compute for fewer sequential decode steps. Stacking them is how production serving systems achieve the throughput and latency numbers that make large-model inference economically viable.

Quiz

Test your understanding of quantization for serving.

Why does weight-only INT4 quantization speed up the decode phase of autoregressive generation?

It reduces the number of floating-point operations by 4x

Decode is memory-bandwidth-bound, and INT4 weights require 4x less memory to read per token

INT4 arithmetic is natively 4x faster on modern GPUs

It eliminates the need for a KV-cache

What does GPTQ use the Hessian matrix for during quantization?

To determine which layers to skip during quantization

To train the model for an additional epoch after quantization

To optimally adjust remaining weights to compensate for quantization error in each column

To convert activations from FP16 to INT4

How does AWQ handle the ~1% of salient weight channels instead of keeping them in higher precision?

It skips quantization for those channels entirely

It scales salient channels up before quantization, giving them finer relative granularity within the INT4 grid

It uses a separate INT8 codebook for salient channels

It applies GPTQ-style Hessian compensation only to salient channels

What is the key advantage of GGUF k-quants over uniform quantization?

They use GPU-specific CUDA kernels for faster dequantization

They assign different bit-widths to different layers based on sensitivity, achieving better quality at the same average bits per weight

They require no calibration data at all

They support only CPU inference, which is inherently more precise