FSDP a Transformer
Don't copy the whole model to every GPU — give each one a slice and borrow the rest just in time.
Key Insight
FSDP splits a model's parameters, gradients, and optimizer state into shards, one per GPU, and gathers each full layer only for the moment it is needed. This lets you train a transformer that is far too large to fit on a single GPU under DDP.
Why This Matters
FSDP is the modern default for training large models on ordinary clusters. Seeing a model run under FSDP that crashes under DDP makes the memory savings concrete.