Dynamic Quantization

Store the weights as 8-bit integers and decide the activation scale on the fly.

Key Insight

Quantization stores a model's weights in low-precision integers like int8 instead of 32-bit floats. Dynamic quantization keeps the weights quantized ahead of time but computes the scale for each layer's activations at runtime, just before the matmul.

Why This Matters

int8 weights use a quarter of the memory and run faster on many CPUs, which helps most with the large linear layers in an LLM. Measuring the quality drop tells you whether the speedup is worth it.

Key Insight​

Why This Matters​

Key Insight

Why This Matters