Implement DiT-S/2

Key Insight

This project replaces the U-Net at the heart of a diffusion model with a Diffusion Transformer (DiT): you cut the noisy latent into 2×2 patches (that is the "/2"), feed the resulting sequence of tokens through 12 transformer blocks of width 384 (the "S", for small), and predict the added noise just as before. The conditioning — the timestep plus the class label — enters through AdaLN-Zero, which predicts per-channel shift/scale/gate values that start at zero, so each block begins as a do-nothing identity and only gradually learns to modulate the activations. Training it class-conditionally on CIFAR-10 and comparing FID against your earlier U-Net baseline reveals the transformer's real selling point: it is not magically better at this small scale, but it scales far more predictably as you add parameters and compute.

Key Insight​

Key Insight