FSDP from Scratch (Toy)
Split the model across two GPUs by hand, and the magic of sharded training disappears.
Key Insight
FSDP (Fully Sharded Data Parallel) splits a model's weights, gradients, and optimizer state across GPUs so no single GPU has to hold the whole thing. Building a toy version by hand on 2 GPUs — and checking it reaches the same result as plain data parallelism — shows exactly what the library does for you.
Why This Matters
AdamW's optimizer state alone is several times the size of the model, so large models do not fit on one GPU. Sharding with FSDP (and its cousin ZeRO) is what makes training beyond a few billion parameters possible at all.