MMDiT Block

Key Insight

In a standard DiT, text guides the image through cross-attention — the image tokens look at the text tokens, but not the other way around. MMDiT (Multi-Modal Diffusion Transformer), the block behind Stable Diffusion 3 and Flux, instead concatenates text and image tokens and runs them through one shared self-attention ("joint attention"), while giving each modality its own normalization and MLP weights. Letting text and image attend to each other freely in both directions improves compositional prompts — getting "a red cube on a blue sphere" right instead of swapping the colors — which is why most strong 2024+ open models adopted it.

Key Insight​

Key Insight