Mini FlashAttention
Never write the giant score matrix to memory — stream it through in tiles instead.
Key Insight
Standard attention builds a large T×T softmax score matrix in slow GPU memory (HBM). FlashAttention avoids this with tiling and an online softmax that updates the running result block by block, so the full matrix is never stored. Building a small version and checking it matches eager attention shows how the same math can cost far less memory.
Why This Matters
The single idea — do the same FLOPs but touch memory far less — is why FlashAttention made long-context transformers practical, and it is the canonical example of memory-aware kernel design.