Tensor Parallel Attention

When one layer is too big for one GPU, cut the layer itself in half.

Key Insight

Tensor parallelism splits the weights of a single layer across GPUs, instead of replicating the whole model. Splitting a multi-head attention layer column-wise across two GPUs (the Megatron style) lets each GPU compute part of the heads and then combine the results.

Why This Matters

Some layers are too large to fit or run on one GPU. Tensor parallelism is the standard way to spread that single layer's work across several GPUs, and it is a core building block for training the very largest models.

Key Insight​

Why This Matters​

Key Insight

Why This Matters