Quantize a 7B Model

Smaller numbers, the same model, a fraction of the memory.

Key Insight

This project takes a 7B model and applies two quantization recipes — GPTQ to compress weights to INT8 and AWQ to compress them to 4-bit — then measures how much GPU memory is saved and how a few benchmark scores change.

Why This Matters

Shrinking weights from 16-bit down to 8- or 4-bit can fit a 7B model on a consumer GPU and read it through memory faster, which is the simplest way to cut both the cost and the latency of serving while usually paying only a small quality cost.

Key Insight​

Why This Matters​

Key Insight

Why This Matters