CUDA Graphs for Decode
Record the dozens of tiny per-token kernels once, then replay them with almost no launch cost.
Key Insight
This project captures the sequence of tiny kernels that one decode step launches as a CUDA Graph, then replays that graph each step instead of launching the kernels one at a time.
Why This Matters
Each decode step fires dozens of tiny kernels, and on small models the cost of launching them rivals the actual work. Replaying a captured graph removes most of that overhead for a 5–20% speedup.