Gated Cross-Attention

Key Insight

This project rebuilds Flamingo's key mechanism: new cross-attention layers inserted between the frozen LLM's blocks, each one gated by a learned multiplier that starts at exactly zero. The verification step is the whole lesson — at initialization the gate contributes nothing, so the model's output must be bit-for-bit identical to the original text-only LLM; only as training opens the gate does image information begin to flow in. That "start as the unmodified model, then blend the new capability in gradually" design is why you can add a modality to a strong pretrained network without breaking the behavior it already has, and confirming the identity at init is the cheapest way to catch a wiring bug before a long training run.

Key Insight​

Key Insight