Skip to main content

Profile a Single Decode Step


Open one decode step under a profiler and watch where the microseconds actually go.


Key Insight

This project uses a profiler to record a single decode step and pick apart where the time goes — the longest kernel, the HBM read pattern, and the overhead of launching many tiny kernels.

Why This Matters

Decode is memory-bound, so its bottleneck is rarely "the math is slow"; only a profiler trace shows the real culprit. Reading one is the skill that lets you fix a latency regression instead of guessing at it.