2D RoPE for DiT

Key Insight

A transformer has no built-in sense of position — to it, a sequence of image patches is just an unordered bag — so you must tell it where each patch sits. 2D RoPE (rotary position embedding) does this by rotating each token's query and key vectors by an angle set from its row and column, so the attention dot product between two patches depends only on their relative spacing rather than their absolute coordinates. Swapping a DiT's learned position vectors for 2D RoPE usually improves quality — especially when generating at resolutions larger than the model was trained on, because rotations extrapolate to unseen positions far more gracefully than a fixed lookup table of learned vectors.

To unpack that last point: learned position vectors are a lookup table — during training the model memorizes one vector for position 0, one for position 1, and so on, up to the largest grid it ever saw (say 32×32 patches). Ask it to place patch number 40 and there is simply no row in the table for it — the model never learned one, so it improvises badly and the extra-large image comes out distorted. It is like a printed seating chart that lists seats 1–32: show up holding ticket 40 and the chart is blank, leaving you to guess where to stand. RoPE has no table at all. It turns a position into an angle with a fixed formula — position 1 rotates a little, position 2 a little more — so any position, even one never seen in training, just gets a slightly larger rotation and the math keeps working smoothly. That is the difference between a memorized chart and a rule like "each seat is half a metre further along the wall": the rule still tells you exactly where seat 40 — or seat 400 — sits, because it computes the answer instead of looking it up.

Key Insight​

Key Insight