Notes on Post-Training Quantization
I really wanted to understand the methods in PTQ so I spent a few days. It was not as easy to understand this landscape. Here is what I learned.
Post-training quantization (PTQ) reduces model size and speeds up inference by converting high-precision weights and/or activations to low-bit integers — with minimal or no retraining.
1. What is Quantization?
Quantization maps high-precision floating-point values to integer representations in N bits using:
- Scale: \(s\)
- Zero-point \(z\): Used in asymmetric quantization to adjust the range, set to 0 for symmetric quantization
The standard formula:
\[ x_q = \text{Quantized value} = \left\lfloor \frac{x}{s} \right\rceil + z \]
To recover:
\[ \text{Dequantized value} = s \cdot (x_q - z) \]
Dequantization is required before operations like softmax, LayerNorm, or nonlinear activations.
Note: Typically, quantization involves creating uniform bins, but NF-4 (Normalfloat-4), introduced by \(\text{QLoRA}\)
2. What Can Be Quantized?
Component | Description |
---|---|
Weights | Common in all PTQ setups |
Weights + Activations | Supports full INT inference but requires offline calibration |
Granularity:
Due to the presence of outliers, fine-grained quantization is particularly advantageous for efficiently utilizing available bits in INT.
- Tensor-wise: one scale/zeropoint per tensor
- Row-wise / Channel-wise: parameters for each row, column; provides better precision
- Block-wise: \(N \times N\) blocks; common in GGUF, GPTQ (e.g., 32 weights per block)
3. Methods
A. Weights-Only Quantization
Used when:
- Storage/memory reduction is primary
- Full-INT computation is unnecessary; kernels fuse the dequantization of weights with the multiplication process.
Examples:
Method | Key Idea | Notes |
---|---|---|
GPTQ |
Second-order optimization for quant error | Uses block-wise quantization; highly accurate |
GGML/GGUF | Block quantization with int Look up tabless | In Q4_0 , Q5_1 , etc. |
NF4 + double quant | Non-uniform binning, 2-layer quant | Matches weight distribution better |
GGUF formats
If the target bits are N:
- Q_N_0: Uses the absmax method
- Q_N_1: Utilizes the zero point method
- Q_N_K: Employs double quant, i.e., quantize the scales; if N=4, uses NF-4
- Q_N_M: critical layers are given higher precision, leading to mixed precision inference.
B. Quantizing Activations
Needed for full INT arithmetic (e.g., INT8 matmuls). Requires:
- Calibration on representative data (offline)
- Handling dynamic ranges
Examples:
Method | Key Idea |
---|---|
SmoothQuant |
Learn input-output scale alignment |
ZeroQuant |
Layer-wise group quant + distill |
LLM.int8() |
Outlier-aware activation quant |