Notes on Post-Training Quantization

I really wanted to understand the methods in PTQ so I spent a few days. It was not as easy to understand this landscape. Here is what I learned.

June 21, 2025

Post-training quantization (PTQ) reduces model size and speeds up inference by converting high-precision weights and/or activations to low-bit integers — with minimal or no retraining.

1. What is Quantization?

Quantization maps high-precision floating-point values to integer representations in N bits using:

Scale: \(s\)
Zero-point \(z\): Used in asymmetric quantization to adjust the range, set to 0 for symmetric quantization

The standard formula:

\[ x_q = \text{Quantized value} = \left\lfloor \frac{x}{s} \right\rceil + z \]

To recover:

\[ \text{Dequantized value} = s \cdot (x_q - z) \]

Dequantization is required before operations like softmax, LayerNorm, or nonlinear activations.

Note: Typically, quantization involves creating uniform bins, but NF-4 (Normalfloat-4), introduced by \(\text{QLoRA}\) , employs non-uniform bins. In NF-4, each 4-bit integer is mapped to a specific floating-point value. You can explore the implementation details here and review the derivation in this note .

2. What Can Be Quantized?

Component	Description
Weights	Common in all PTQ setups
Weights + Activations	Supports full INT inference but requires offline calibration

Granularity:

Due to the presence of outliers, fine-grained quantization is particularly advantageous for efficiently utilizing available bits in INT.

Tensor-wise: one scale/zeropoint per tensor
Row-wise / Channel-wise: parameters for each row, column; provides better precision
Block-wise: \(N \times N\) blocks; common in GGUF, GPTQ (e.g., 32 weights per block)

3. Methods

A. Weights-Only Quantization

Used when:

Storage/memory reduction is primary
Full-INT computation is unnecessary; kernels fuse the dequantization of weights with the multiplication process.

Examples:

Method	Key Idea	Notes
GPTQ	Second-order optimization for quant error	Uses block-wise quantization; highly accurate
GGML/GGUF	Block quantization with int Look up tabless	In `Q4_0`, `Q5_1`, etc.
NF4 + double quant	Non-uniform binning, 2-layer quant	Matches weight distribution better

GGUF formats

If the target bits are N:

Q_N_0: Uses the absmax method
Q_N_1: Utilizes the zero point method
Q_N_K: Employs double quant, i.e., quantize the scales; if N=4, uses NF-4
Q_N_M: critical layers are given higher precision, leading to mixed precision inference.

B. Quantizing Activations

Needed for full INT arithmetic (e.g., INT8 matmuls). Requires:

Calibration on representative data (offline)
Handling dynamic ranges

Examples:

Method	Key Idea
SmoothQuant	Learn input-output scale alignment
ZeroQuant	Layer-wise group quant + distill
LLM.int8()	Outlier-aware activation quant