Skinny-M Kernel Study
A tall, thin matmul leaves the Tensor Cores half-asleep — different kernels wake them up differently.
Key Insight
This project takes a decode-shaped GEMM (a matrix multiply with a very small batch dimension) and compares how cuBLAS, a Triton version, and Marlin perform on it — reporting TFLOPs and memory bandwidth.
Why This Matters
Decode matmuls are "skinny" (tiny M), so they barely use the Tensor Cores and become a memory-bandwidth problem instead. Seeing how different kernels handle the same skinny shape explains why production engines refuse to call one generic matmul for everything.