Skip to main content

Triton Matmul


Matrix multiply is where the GPU lives or dies — tile it well and you can rival the vendor.


Key Insight

A fast matmul kernel works by tiling: loading small blocks of each matrix into fast on-chip memory, multiplying them there, and reusing them before touching slow memory again. Writing this in Triton and aiming for >50% of cuBLAS throughput teaches why memory movement, not arithmetic, is the real cost.

Why This Matters

Matrix multiplication dominates the runtime of almost every neural network, so understanding how a good matmul kernel is structured is the key to understanding GPU performance in general.