Notes on Post-Training Quantization

I really wanted to understand the methods in PTQ so I spent a few days. It was not as easy to understand this landscape. Here is what I learned.

June 21, 2025

Post-training quantization (PTQ) reduces model size and speeds up inference by converting high-precision weights and/or activations to low-bit integers — with minimal or no retraining.

1. What is Quantization?

Quantization maps high-precision floating-point values to integer representations in N bits using:

  • Scale: \(s\)
  • Zero-point \(z\): Used in asymmetric quantization to adjust the range, set to 0 for symmetric quantization

The standard formula:

\[ x_q = \text{Quantized value} = \left\lfloor \frac{x}{s} \right\rceil + z \]

To recover:

\[ \text{Dequantized value} = s \cdot (x_q - z) \]

Dequantization is required before operations like softmax, LayerNorm, or nonlinear activations.

Note: Typically, quantization involves creating uniform bins, but NF-4 (Normalfloat-4), introduced by \(\text{QLoRA}\) , employs non-uniform bins. In NF-4, each 4-bit integer is mapped to a specific floating-point value. You can explore the implementation details here and review the derivation in this note .

2. What Can Be Quantized?

Component Description
Weights Common in all PTQ setups
Weights + Activations Supports full INT inference but requires offline calibration

Granularity:

Due to the presence of outliers, fine-grained quantization is particularly advantageous for efficiently utilizing available bits in INT.

  • Tensor-wise: one scale/zeropoint per tensor
  • Row-wise / Channel-wise: parameters for each row, column; provides better precision
  • Block-wise: \(N \times N\) blocks; common in GGUF, GPTQ (e.g., 32 weights per block)

3. Methods

A. Weights-Only Quantization

Used when:

  • Storage/memory reduction is primary
  • Full-INT computation is unnecessary; kernels fuse the dequantization of weights with the multiplication process.

Examples:

Method Key Idea Notes
GPTQ Second-order optimization for quant error Uses block-wise quantization; highly accurate
GGML/GGUF Block quantization with int Look up tabless In Q4_0, Q5_1, etc.
NF4 + double quant Non-uniform binning, 2-layer quant Matches weight distribution better

GGUF formats

If the target bits are N:

  • Q_N_0: Uses the absmax method
  • Q_N_1: Utilizes the zero point method
  • Q_N_K: Employs double quant, i.e., quantize the scales; if N=4, uses NF-4
  • Q_N_M: critical layers are given higher precision, leading to mixed precision inference.

B. Quantizing Activations

Needed for full INT arithmetic (e.g., INT8 matmuls). Requires:

  • Calibration on representative data (offline)
  • Handling dynamic ranges

Examples:

Method Key Idea
SmoothQuant Learn input-output scale alignment
ZeroQuant Layer-wise group quant + distill
LLM.int8() Outlier-aware activation quant