Multimodal Learning: From Beginner to Advanced

A comprehensive guide to understanding and building systems that learn from and reason across multiple modalities — text, images, audio, video, and beyond — from contrastive pretraining to modern vision-language models and unified any-to-any architectures.

"Multimodal learning" is the slice of machine learning that operates on more than one input/output type. A model that reads an image and writes a caption is multimodal. A model that listens to audio and produces text is multimodal. A model that takes a text prompt and a reference image and produces a video is very multimodal. This guide is about how those systems learn a shared representation across modalities, how to train them, and where the field is going.

Scope and boundaries

This guide owns the problem of getting two or more modalities to share one representation — aligning them, fusing them, and reasoning jointly over them. To keep the AI Learning Guides mutually exclusive and collectively exhaustive (MECE), it deliberately stops at a few borders and links forward to the guide that owns each one.

In scope — this guide owns these topics:

Contrastive / aligned pretraining (CLIP, SigLIP, ImageBind) — making separate modality spaces comparable
Fusion architectures — cross-attention, Q-Former, projectors, gated attention, interleaved/early fusion
Vision-language models (VLMs) — image(+text)→text understanding, VQA, grounding, OCR-heavy models
Audio and speech as modalities — encoders, ASR/TTS, audio LMs, neural audio codecs (no dedicated audio guide exists, so multimodal is the home for them)
Video, audio, and image understanding — encoding non-text modalities for joint reasoning
Unified / any-to-any models — the "one transformer for every modality" framing (Chameleon, GPT-4o, Gemini, Janus, Transfusion)
Multimodal-specific training, data, alignment, and evaluation — the parts that differ from the single-modality recipes

Out of scope — deferred to the owning guide:

Transformer/LLM architecture, tokenization, pretraining, post-training, and text-only agents → LLM. This guide uses a pretrained LLM as a frozen (or fine-tuned) backbone but does not re-teach how to build one.
Generative modeling of images (VAEs, GANs, diffusion, latent diffusion, DiT, flow matching) and image tokenizers (VQ-VAE, VQ-GAN, FSQ) → Image Generation. When a multimodal model needs to produce pixels, the generation machinery lives there; here we care about the cross-modal modeling.
Generative modeling of video (video diffusion, 3D VAEs, world models) → Video Generation. This guide owns video understanding; video synthesis is theirs.
Vision-Language-Action models as a robot policy class, imitation learning, sim-to-real → Robotics Phase 8. We cover the multimodal-modeling side of VLAs and link out for the control side.
RLHF / DPO / GRPO algorithm internals → RL Phase 9 and LLM Phase 5. We cover what is different about preference data with image/audio inputs.
Serving, batching, KV-cache, and inference latency for deployed multimodal models → Inference Systems. Kernel-level performance and quantization → AI Hardware.
Tensor, autograd, mixed-precision, distributed-training, and training-loop fundamentals → PyTorch Deep Dive.

When this guide touches an out-of-scope topic, it does so only to the depth needed to make a multimodal modeling decision, and it links to the owning guide.

Phase 0: Prerequisites
Phase 1: Foundations — What "Multimodal" Actually Means
Phase 2: Encoders for Each Modality
Phase 3: Contrastive Learning — CLIP and Friends
Phase 4: Fusion Architectures — How Modalities Talk to Each Other
Phase 5: Vision-Language Models (VLMs)
Phase 6: Audio, Speech, and Video
Phase 7: Unified and Any-to-Any Models
Phase 8: Training at Scale — Data, Compute, and Alignment
Phase 9: Evaluation and Benchmarks
Phase 10: Frontier Topics
Suggested Timeline
Key Advice
Common Pitfalls to Avoid
Additional Resources
Glossary

Phase 0: Prerequisites

Multimodal learning sits on top of two stacks (NLP and vision) and borrows from a third (audio). You do not need to be an expert in all of them, but the foundations cannot be skipped.

Concepts to Know

Transformers: self-attention, cross-attention, positional embeddings, layer norm, residual connections — from the LLM guide Phase 2
Vision basics: convolution, ViT (Vision Transformer), how an image becomes a sequence of tokens
Text basics: tokenization (BPE, SentencePiece), language modeling, the next-token-prediction objective — see LLM Phase 1
PyTorch fluency: nn.Module, autograd, mixed precision, basic training loops — see PyTorch Deep Dive
Embedding spaces: what an L2-normalized vector looks like, cosine similarity, the geometry of high-dimensional spaces
Contrastive intuition (helpful but not required yet): pulling similar things together, pushing different things apart

What you do not need yet. You don't need diffusion or GAN internals to start — those are the Image Generation guide's territory, and you only need them once you want a model that outputs pixels (Phases 7 and 10 here). You also don't need RLHF internals; Phase 8 covers only what's different about multimodal preference data.

The One Equation Everything Comes Back To

Multimodal learning = map every modality into a SHARED representation,
                      then either:
                        (a) compare them (retrieval, classification), or
                        (b) generate one from the other (caption, image-from-text), or
                        (c) reason jointly over all of them (VQA, agents).

The shared representation can be:
  - a single vector per item     (CLIP-style)
  - a sequence of tokens         (LLaVA / VLM-style)
  - a discrete token alphabet    (any-to-any, e.g. Chameleon, Gemini-style)

Resources

Transformers from Scratch — Brandon Rohrer
Andrej Karpathy — Let's build the GPT tokenizer
ViT paper (Dosovitskiy et al., 2020) — required reading
CLIP paper (Radford et al., 2021) — read this before Phase 3

Phase 1: Foundations — What "Multimodal" Actually Means

Before architectures, get the conceptual map right. "Multimodal" is a fuzzy umbrella term that hides at least four distinct problems.

Concepts to Learn

The four canonical tasks:
- Cross-modal retrieval — given an image, find the matching caption (or vice versa)
- Cross-modal generation — given text, produce an image (or audio, or video). The generative half lives in Image Generation / Video Generation; here we care about the conditioning and the cross-modal interface.
- Multimodal understanding — given image + text, answer a question (VQA, captioning)
- Joint/any-to-any — flexibly map any subset of modalities to any other
Modality gap — even well-trained models keep text and image embeddings in noticeably different regions of the shared space
Alignment vs fusion — alignment is making spaces comparable; fusion is combining information
Early vs late fusion:
- Late fusion: encode each modality separately, combine at the end (CLIP-style)
- Early fusion: interleave modalities into one sequence from the start (Chameleon-style)
- Middle fusion: encode separately, then attend across (Flamingo, LLaVA)
Pretraining objectives: contrastive, masked, generative, and combinations

A Taxonomy Diagram

                          MULTIMODAL MODELS
                                 │
        ┌────────────────────────┼────────────────────────┐
        │                        │                        │
   DUAL-ENCODER           ENCODER-DECODER             UNIFIED
   (alignment)            (understanding)         (any-to-any)
        │                        │                        │
   CLIP, SigLIP,           Flamingo, BLIP-2,         Chameleon,
   ImageBind               LLaVA, Qwen2.5-VL,         Gemini, GPT-4o,
                           PaliGemma                  Janus, Transfusion
        │                        │                        │
   Best at:                Best at:                  Best at:
   - retrieval             - VQA                     - everything
   - zero-shot             - captioning              - but expensive
     classification        - dialogue                  to train
   - data filtering        - reasoning

Projects

Project	Description	Difficulty
Modality survey	Pick 5 multimodal papers, classify each by fusion type and objective; write a one-paragraph summary of each	⭐
Visualize the modality gap	Encode 1k images and 1k captions with a pretrained CLIP; PCA them; observe the separation	⭐⭐
Toy retrieval	Build a tiny retrieval system: encode a few hundred images and captions with CLIP, retrieve top-5 for each query	⭐⭐

Key Insight

The choice of fusion strategy determines what your model can do. Dual encoders (CLIP) are fast and great at retrieval but can't reason or generate. Encoder-decoder VLMs (LLaVA) reason and generate text but not images. Unified models (Chameleon, GPT-4o) do everything but need vastly more data and compute. There is no free lunch; pick the architecture that matches your task.

Resources

On the Opportunities and Risks of Foundation Models (Stanford CRFM) — long but the multimodal sections are excellent context
A Survey on Multimodal Large Language Models (Yin et al., 2024)
A Survey on Vision-Language Models (Zhang et al., 2024)
Hugging Face — Vision Language Models Explained

Phase 2: Encoders for Each Modality

Before you can fuse modalities, you have to encode each one into vectors. This phase is about the representation building blocks — the encoders that turn pixels, waveforms, and frames into sequences a transformer can align.

MECE note. This phase teaches encoders as feature extractors for alignment and understanding. The generative backbones that turn latents back into pixels (U-Nets, DiTs) belong to Image Generation and Video Generation. The discrete tokenizers (VQ-VAE, VQ-GAN, FSQ, MagViT-v2) are taught in Image Generation Phase 3; here we use them (Phase 7) and only summarize.

Concepts to Learn

Image encoders:
- CNNs (ResNet, EfficientNet) — still useful, especially for small models
- Vision Transformers (ViT) — the modern default; how patchification works
- Patch size and resolution tradeoffs — smaller patches = more tokens = better detail = quadratically more compute
- SigLIP / SigLIP 2 / DFN / EVA-CLIP / DINOv2 — modern improvements over the original CLIP vision tower; DINOv2 is the dominant self-supervised (non-contrastive) choice
Text encoders:
- BERT-style bidirectional encoders (for dual-encoder models)
- Decoder-only LLMs as encoders (just take hidden states)
- The tradeoff: bidirectional sees both sides but is non-causal; decoder-only is causal but composes naturally with generation
Audio encoders:
- Mel spectrograms — the standard input representation
- Whisper-style encoders for speech
- HuBERT, wav2vec 2.0 for general audio
- Neural audio codecs (EnCodec, SoundStream, DAC, Mimi) for discrete audio tokens
Video encoders:
- Frame-by-frame ViT (cheap but loses motion)
- 3D convolutions or 3D-ViT for spatiotemporal patches
- Temporal pooling, hierarchical encoding

Image Patchification, Visualized

Input image: 224×224×3

Split into 16×16 patches:    14 × 14 = 196 patches

Each patch: 16×16×3 = 768 numbers

Linear projection → embedding of dim D (e.g., 768)

Add positional embedding (learned or sinusoidal)

+ [CLS] token at position 0

→ sequence of 197 tokens, each D-dim
→ feed to a stack of transformer blocks
→ output is 197 contextualized vectors
→ pool (CLS token, mean, attention pool) → single image vec

Projects

Project	Description	Difficulty
Implement ViT from scratch	Patchify, linear-project, transformer blocks, CLS pooling — train on CIFAR-10	⭐⭐⭐
Compare encoders	Take ResNet-50, ViT-B/16, SigLIP, and DINOv2 — extract features for 1k ImageNet images, compare via linear probe	⭐⭐
Mel spectrogram pipeline	Take a 10-second `.wav` file, produce a mel spectrogram, feed through a small CNN	⭐⭐
Whisper encoder reuse	Use just the encoder of Whisper to get audio embeddings; build a simple audio classifier on top	⭐⭐⭐
Patch-size study	Train ViT with patches of 8, 16, 32 — measure accuracy and FLOPs	⭐⭐⭐

Sample Code: A Minimal ViT Patch Embedding

import torch
import torch.nn as nn

class PatchEmbed(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
        super().__init__()
        self.n_patches = (img_size // patch_size) ** 2
        # A conv with stride=patch_size is the standard trick:
        self.proj = nn.Conv2d(in_chans, embed_dim,
                              kernel_size=patch_size, stride=patch_size)
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, self.n_patches + 1, embed_dim))

    def forward(self, x):
        # x: (B, 3, H, W)
        x = self.proj(x)                       # (B, D, H/P, W/P)
        x = x.flatten(2).transpose(1, 2)       # (B, N, D)
        cls = self.cls_token.expand(x.size(0), -1, -1)
        x = torch.cat([cls, x], dim=1)         # (B, N+1, D)
        return x + self.pos_embed

Key Insight

The patchification trick — using a single strided convolution to both split the image into patches and project them to the embedding dimension — is one of those "obvious in hindsight" moves that made ViT practical. It's mathematically identical to the unfold-then-linear approach but is dramatically faster. The deeper lesson: every modality reduces to "turn it into a sequence of D-dimensional vectors," after which a transformer doesn't care whether those vectors came from pixels, waveforms, or words.

Resources

ViT paper
SigLIP paper — sigmoid loss replaces softmax, scales better
DINOv2 paper — strong self-supervised vision features
Whisper paper — best read on a modern speech encoder
EnCodec paper — for discrete audio tokens
VideoMAE paper — masked autoencoding for video

Phase 3: Contrastive Learning — CLIP and Friends

CLIP is the model that made modern multimodal learning take off, and contrastive alignment is this guide's home turf — the Image Generation guide explicitly defers CLIP/contrastive pretraining here. Understanding it cold pays compound interest.

Concepts to Learn

The contrastive objective: pull matched (image, caption) pairs together, push unmatched pairs apart
InfoNCE loss — the workhorse contrastive loss; relationship to mutual information
The temperature parameter τ — what it controls, why it's learnable, and why it matters more than you'd think
Batch size in contrastive learning — why bigger is dramatically better, and the tricks (memory bank, MoCo, distributed gathering) to fake it cheaply
Hard negatives — easy negatives don't teach the model anything; mining hard ones helps
CLIP variants:
- SigLIP / SigLIP 2 — sigmoid (per-pair) loss instead of softmax (over-batch) loss; works at smaller batch sizes
- ALIGN — Google's CLIP-equivalent, trained on noisier web data
- OpenCLIP, EVA-CLIP, DFN, MetaCLIP — community and Meta scaling efforts
- ImageBind — extends contrastive learning to 6 modalities (text, image, audio, depth, thermal, IMU)
Zero-shot classification — how CLIP does classification without ever seeing labels
CLIP as a filter — using CLIP scores to clean web-scale training data (e.g., LAION); the same trick reappears for data curation in Image Generation Phase 10 and Video Generation Phase 10

The CLIP Training Step

Batch of N (image, text) pairs:

    images:    [I₁, I₂, ..., I_N]   → image encoder  → [v₁, v₂, ..., v_N]
    captions:  [T₁, T₂, ..., T_N]   → text encoder   → [u₁, u₂, ..., u_N]

L2-normalize both, then compute similarity matrix:

    S = (V · Uᵀ) / τ      # shape (N, N)

Diagonal entries S[i, i] are matched pairs (should be high).
Off-diagonal entries S[i, j], j≠i are unmatched (should be low).

Loss = (cross_entropy(S, labels=identity, axis=rows)
      + cross_entropy(S, labels=identity, axis=cols)) / 2

i.e. for each image, the matched caption should be the most similar
out of all N captions in the batch — and vice versa.

Projects

Project	Description	Difficulty
Implement InfoNCE	Write the symmetric contrastive loss from scratch; verify gradients	⭐⭐
Tiny CLIP	Train a small CLIP on Flickr30k or COCO captions; image encoder = small ViT, text encoder = small transformer	⭐⭐⭐⭐
Zero-shot ImageNet	Use a pretrained CLIP to classify ImageNet without ever training on its labels; tune the prompt template	⭐⭐
Hard-negative mining	Train CLIP with mined hard negatives vs random; measure retrieval improvement	⭐⭐⭐⭐
Temperature ablation	Vary τ from 0.01 to 1.0; observe accuracy and the geometry of the embedding space	⭐⭐⭐
Data filtering with CLIP	Filter a noisy image-text dataset by CLIP similarity score; train a downstream model on filtered vs unfiltered	⭐⭐⭐

Sample Code: CLIP-Style Contrastive Loss

import torch
import torch.nn.functional as F

def clip_loss(image_features, text_features, logit_scale):
    # L2-normalize
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    # Cosine similarity matrix, scaled by learned temperature
    logits_per_image = logit_scale * image_features @ text_features.T
    logits_per_text  = logits_per_image.T

    n = image_features.size(0)
    labels = torch.arange(n, device=image_features.device)

    loss_i = F.cross_entropy(logits_per_image, labels)
    loss_t = F.cross_entropy(logits_per_text,  labels)
    return (loss_i + loss_t) / 2

# logit_scale is usually parameterized as exp(theta) where theta is learned,
# initialized to ln(1/0.07) ≈ 2.66, clamped to ≤ ln(100) for stability.

Key Insight

CLIP's most important contribution was not the architecture; it was the realization that the internet is a labeled dataset. Every image with an alt-text or surrounding caption is a free supervised example. The contrastive objective turned this firehose of noisy data into a useful signal. The architecture (two transformers + cosine similarity) is almost an afterthought.

Resources

CLIP paper (Radford et al., 2021)
SigLIP paper (Zhai et al., 2023) and SigLIP 2 (2025)
OpenCLIP — reproducible CLIP training in the open
ImageBind — contrastive across 6 modalities
Lucas Beyer — Lecture on CLIP and SigLIP

Phase 4: Fusion Architectures — How Modalities Talk to Each Other

Once each modality is encoded, you have to combine them. There are more options than people realize.

Concepts to Learn

Concatenation — the trivial baseline; works surprisingly well sometimes
Cross-attention — one modality's tokens attend to another's; the most common fusion in modern VLMs
Q-Former (BLIP-2) — learnable queries that distill an image into a fixed number of tokens for an LLM
Perceiver IO — cross-attention with a small latent set, modality-agnostic
Adapter modules — small layers inserted into a frozen backbone, trained to adapt to a new modality
Gated cross-attention (Flamingo) — adds new cross-attention layers between LLM layers, gated so the pretrained behavior isn't broken at init
Projector-only fusion (LLaVA) — just a linear or MLP projection from image features to LLM token space; surprisingly effective
Interleaved sequences — treat image tokens and text tokens as one sequence (early fusion)

Five Fusion Patterns Side by Side

1. Concatenation (simplest):
     [text emb] ─┐
                 ├──► classifier
     [image emb]─┘

2. Cross-attention (Flamingo-style):
     LLM block ──► cross-attn(text tokens, image tokens) ──► next LLM block
                                       ▲
                                       │
                              image encoder features

3. Q-Former (BLIP-2):
     image encoder ──► 32 learned queries ◄── cross-attn ◄── frozen
                              │                              image features
                              ▼
                          → projected → fed into LLM as 32 "tokens"

4. Projector + interleaved tokens (LLaVA):
     image encoder ──► linear/MLP ──► N "image tokens"
                                          │
                                          ▼
     LLM sees: [<system>...<image_tokens>...<user_text>...<assistant>]

5. Early fusion (Chameleon, native multimodal):
     image ──► VQ tokenizer ──► discrete image tokens
                                   │
                                   ▼
     unified sequence: [text tokens][image tokens][text tokens]...
     trained with one next-token prediction loss over the whole alphabet

Projects

Project	Description	Difficulty
Concat vs cross-attn	On a small VQA task, compare concatenation, cross-attention, and projector fusion; report accuracy and parameter counts	⭐⭐⭐
Implement Q-Former	A small Q-Former with 16 learned queries; train on COCO captions	⭐⭐⭐⭐
Adapter for a new modality	Add a depth-image input to a frozen CLIP image encoder via an adapter layer	⭐⭐⭐
Perceiver IO	Implement the Perceiver IO architecture on a small toy task	⭐⭐⭐⭐
Gated cross-attention	Implement Flamingo's gated mechanism; verify that at init, output equals the unimodal LLM	⭐⭐⭐⭐

Sample Code: A Cross-Attention Block

import torch
import torch.nn as nn

class CrossAttention(nn.Module):
    """LLM hidden states attend to image features."""
    def __init__(self, dim, num_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
        self.norm_q = nn.LayerNorm(dim)
        self.norm_kv = nn.LayerNorm(dim)
        self.gate = nn.Parameter(torch.zeros(1))   # Flamingo trick

    def forward(self, x_text, x_image):
        # x_text: (B, T, D)   — query (LLM hidden states)
        # x_image: (B, N, D)  — key/value (image features, projected to dim D)
        q = self.norm_q(x_text)
        kv = self.norm_kv(x_image)
        out, _ = self.attn(q, kv, kv)
        return x_text + self.gate.tanh() * out     # tanh(0) = 0 at init → identity

Key Insight

The progression from BLIP-2 → LLaVA → Chameleon is a story of simplification. BLIP-2 used a complex Q-Former with a multi-stage training recipe. LLaVA showed that a single linear projection works almost as well if you have good instruction data. Chameleon showed that you don't even need separate encoders — just tokenize everything. The lesson: in deep learning, "simpler + more data" usually wins.

Resources

Phase 5: Vision-Language Models (VLMs)

The current workhorse class. A VLM takes images (+ text) in and produces text. Most "multimodal AI" products you can name are VLMs.

Concepts to Learn

The standard recipe: pretrained vision encoder + projector + pretrained LLM → train projector first, then jointly fine-tune. (The vision encoder comes from Phase 2/3; the LLM comes from the LLM guide — a VLM is mostly glue and data.)
Image preprocessing for VLMs:
- Fixed resolution vs dynamic resolution / AnyRes (Qwen2-VL, InternVL2): tile the image to handle any aspect ratio
- Native-resolution ViT (Qwen2-VL's NaViT-style approach) — process the image at its true resolution instead of a fixed grid
- Token budget per image — typically 256 to a few thousand image tokens
Instruction tuning for VLMs — the "LLaVA-Instruct" recipe: GPT-4-generated multimodal instructions
Visual question answering (VQA) — classic benchmark task
OCR-heavy VLMs — Donut, Nougat, GOT — for documents
Grounding — output bounding boxes or pixel coordinates; teaching the LLM to "point" (Molmo's pointing supervision is a clean recent example)
Modern frontier VLMs (2025–2026):
- Qwen2.5-VL / Qwen3-VL — Alibaba, strong open VLM family
- InternVL2.5 / InternVL3 — Shanghai AI Lab
- PaliGemma 2 — Google's small, strong VLM family
- Molmo — Allen AI, open weights and open data, strong grounding/pointing
- Pixtral — Mistral's VLM
- Llama 3.2 Vision / Llama 4 — Meta's official VLM line (Llama 4 is natively multimodal, MoE, early-fusion)
- DeepSeek-VL2 — MoE VLM
- Closed: GPT-4o, Claude (with vision), Gemini 2.0/2.5

The Standard VLM Training Pipeline

Stage 1 (alignment):       Stage 2 (visual instruction tuning):
─────────────────────      ─────────────────────────────────────
Freeze: vision encoder     Unfreeze: projector + LLM
Freeze: LLM                Freeze:   vision encoder (often)
Train:  projector only     Train on: ~500k–5M instruction pairs
Data:   ~500k–10M           Format:   conversational (image, Q, A)
        image-caption       Sources:  LLaVA-Instruct, ShareGPT4V,
        pairs                         Cauldron, custom

Result: LLM "speaks image"  Result: VLM that follows visual instructions

Projects

Project	Description	Difficulty
LLaVA from scratch	Connect a CLIP-ViT-L/14 to a 1–3B LLM with an MLP projector; do stage-1 alignment on COCO captions	⭐⭐⭐⭐
Visual instruction tuning	Fine-tune the above on the LLaVA-Instruct dataset; evaluate on a few VQA benchmarks	⭐⭐⭐⭐⭐
Dynamic resolution	Implement AnyRes tiling; verify it improves OCR-heavy benchmarks	⭐⭐⭐⭐
Grounding head	Add bounding-box outputs to a VLM via a special `<box>` token vocabulary	⭐⭐⭐⭐
Compare projectors	Linear vs 2-layer MLP vs Q-Former on the same downstream task; report quality and speed	⭐⭐⭐
Inference optimization	Take an open VLM, serve it with vLLM or SGLang; measure tokens/sec at different image counts (deep dive: Inference Systems)	⭐⭐⭐

Sample Code: A LLaVA-Style Forward Pass

import torch
import torch.nn as nn

class TinyVLM(nn.Module):
    def __init__(self, vision_encoder, llm, vision_dim, llm_dim):
        super().__init__()
        self.vision_encoder = vision_encoder       # frozen in stage 1
        self.llm = llm                              # frozen in stage 1
        self.projector = nn.Sequential(
            nn.Linear(vision_dim, llm_dim),
            nn.GELU(),
            nn.Linear(llm_dim, llm_dim),
        )

    def forward(self, image, input_ids, image_token_idx):
        # 1. Encode image → sequence of visual tokens
        v = self.vision_encoder(image)              # (B, N_img, vision_dim)
        v = self.projector(v)                       # (B, N_img, llm_dim)

        # 2. Get text embeddings
        text_emb = self.llm.get_input_embeddings()(input_ids)  # (B, T, llm_dim)

        # 3. Replace the placeholder <image> token with the visual tokens
        # (in practice you splice them in at image_token_idx)
        # ... splicing logic ...
        merged = splice_in_visual_tokens(text_emb, v, image_token_idx)

        # 4. Run the LLM over the merged sequence
        return self.llm(inputs_embeds=merged).logits

Key Insight

The biggest difference between a "good" and "great" VLM is rarely the architecture — it's the data. The projector is trivial. The vision encoder and LLM are both pretrained. What separates Qwen2.5-VL from LLaVA-1.5 is millions of carefully curated visual instructions and high-quality OCR data. If you want to build a competitive VLM, budget 70% of your effort for data, not modeling.

Resources

LLaVA paper and LLaVA-1.5
Qwen2-VL paper and Qwen2.5-VL
PaliGemma paper
Molmo paper (Allen AI, 2024) — open weights and open data
InternVL2.5 / InternVL3
The Cauldron dataset (HF) — 50 VLM instruction datasets unified

Phase 6: Audio, Speech, and Video

Vision is the most popular non-text modality, but audio and video are catching up fast. There is no dedicated audio guide in this collection, so this phase is the home for audio and speech. For video, this phase owns understanding (encoding video for reasoning); video synthesis belongs to Video Generation.

Concepts to Learn

Audio representations:
- Raw waveform — high resolution but very long sequences
- Mel spectrogram — compact, perceptually grounded, the usual choice
- Discrete audio tokens (EnCodec, SoundStream, DAC, Mimi) — the audio analog of BPE; enables LM-style modeling of audio
Speech recognition (ASR) — Whisper, Conformer; encoder-decoder transformers on mel spectrograms
Text-to-speech (TTS) — non-autoregressive (FastSpeech), autoregressive (Tortoise, VALL-E), and modern hybrid neural codec approaches
Music and general audio generation — MusicGen, AudioGen, AudioLDM, Stable Audio (the audio analog of diffusion image models)
Speech LLMs and full-duplex voice — taking the LLM-with-projector recipe and replacing the vision encoder with an audio encoder (Qwen2-Audio); real-time, full-duplex speech (Moshi, GPT-4o voice mode, Gemini Live)
Video as a modality:
- Frame-by-frame ViT encoding — cheap but discards motion
- Spatiotemporal transformers — 3D attention over (frame, height, width)
- Video tokenization (MagViT-v2, OmniTokenizer) — compress video into discrete tokens (the same tokenizers used for video generation)
Video-language models: Video-LLaVA, VideoChat, VideoLLaMA, LLaVA-Video, Qwen2.5-VL (handles video natively)

The Audio Spectrogram Pipeline

Raw audio waveform (16 kHz, 1 channel):
    [-0.02, 0.01, 0.05, -0.03, ...]   ← millions of samples for a minute

Short-Time Fourier Transform (STFT):
    Sliding window (e.g., 25 ms hop 10 ms) → magnitude spectrogram (freq × time)

Mel-scale filterbank:
    Project linear-frequency bins onto a perceptually-spaced mel scale (e.g., 80 mel bins)

Log:
    log(mel_spectrogram + eps)   → range ~ [-10, 10]

Result shape: (T, 80)   ← T is the number of time frames
              ↓
              Treat as a "1D image" with 80 channels, feed to a convolutional or transformer encoder.

Projects

Project	Description	Difficulty
Mel spectrogram from scratch	Implement STFT and a mel filterbank; visualize a 10-second clip	⭐⭐
Whisper fine-tune	Fine-tune Whisper-small on a low-resource language or custom domain	⭐⭐⭐
EnCodec tour	Use EnCodec to encode/decode audio at multiple bandwidths; listen to the reconstructions	⭐⭐
Speech LLM	Glue an audio encoder to a small LLM with a projector; train on AudioSet captions	⭐⭐⭐⭐⭐
Video frame VLM	Sample 8 frames from a video, treat them as 8 images for a VLM, do video QA	⭐⭐⭐
Native video model	Use spatiotemporal patches (TubeViT-style) and train a small video classifier	⭐⭐⭐⭐

Sample Code: Mel Spectrogram with `torchaudio`

import torch
import torchaudio
import torchaudio.transforms as T

waveform, sr = torchaudio.load("audio.wav")        # (channels, samples)
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)

mel = T.MelSpectrogram(
    sample_rate=16000,
    n_fft=400,
    hop_length=160,        # 10 ms hop
    n_mels=80,
)(waveform)                # (channels, n_mels, time)

log_mel = torch.log(mel + 1e-6)
# log_mel is now ready to feed to an encoder — same shape semantics as an image with 80 channels.

Key Insight

Once you tokenize a modality — turn it into a discrete sequence with a fixed vocabulary — it becomes "just another language" for a transformer. This is why neural audio codecs are such a big deal: they let you do language-model-style generation on audio. The same applies to images (VQ-VAE → discrete image tokens, Image Generation Phase 3), video (MagViT-v2, Video Generation Phase 5), and even actions (in Robotics, action tokenizers). The unified-token view is the path to true any-to-any models — the subject of Phase 7.

Resources

Whisper paper
EnCodec paper
MusicGen paper
VALL-E paper — neural codec language model for TTS
Moshi paper (Kyutai, 2024) — full-duplex speech LM with the Mimi codec
MagViT-v2 paper — best video tokenizer
Video-LLaVA paper

Phase 7: Unified and Any-to-Any Models

The frontier, and this guide's signature territory — both Image Generation and Video Generation defer the "one transformer for every modality" framing to here. A single model that takes any combination of modalities in and produces any combination out.

Concepts to Learn

The unified-token hypothesis — if you can tokenize every modality, you can train one model on the union. (The tokenizers themselves are built in Image Generation Phase 3; here we assemble them into a joint model.)
Native multimodal models (Chameleon, GPT-4o, Gemini, Llama 4): trained from scratch on all modalities, no separate "vision tower"
Mixture-of-Experts (MoE) — a common scaling tool for unified models; experts can specialize by modality (Llama 4, DeepSeek-VL2)
Generation across modalities: a unified model can in principle output <image_token> sequences as easily as text tokens, then a decoder turns them into pixels
AR + diffusion hybrids — Transfusion (one transformer, next-token loss on text + diffusion loss on image patches) and Janus / Janus-Pro (decoupled visual encoders for understanding vs generation) are the 2024–2025 designs that close the gap with diffusion image quality while keeping a single backbone
Omni models — single model, single inference path, all modalities in and out (GPT-4o for text/audio/vision; Qwen2.5-Omni, MiniCPM-o on the open side)
Late-stage vs early-stage fusion at scale — the empirical evidence is increasingly that earlier fusion wins when you have enough compute
Trade-offs: unified models lose some specialist quality; the question is whether the joint flexibility makes up for it

Two Architectural Stances

Stance A: "Bolt-on" multimodality (LLaVA, Qwen2.5-VL)
─────────────────────────────────────────────────────
[image] → [vision encoder] → [projector] → fed as tokens to a pretrained LLM
[audio] → [audio encoder]  → [projector] → fed as tokens to the same LLM
Output: text only (or text + tool calls)

Pros: efficient; reuses huge pretrained LLMs; modular
Cons: can't generate non-text modalities; bottleneck at the projector


Stance B: "Native" multimodality (Chameleon, GPT-4o, Llama 4)
─────────────────────────────────────────────────────
Tokenize every modality into one shared discrete vocabulary:
   text tokens     [50000 entries]
   image tokens    [8192 entries from a VQ-VAE]
   audio tokens    [4096 entries from a neural codec]

One sequence: [text][image][text][audio][image]...
One transformer with one next-token prediction loss over the union.
(Variant — Transfusion: keep image patches continuous and apply a
 diffusion loss on them inside the same transformer.)

Pros: any-to-any natively; one model, one loss
Cons: enormous compute; harder to leverage existing LLMs;
      discrete-image-token decoder quality historically lagged diffusion
      (the gap Transfusion/Janus are closing)

Projects

Project	Description	Difficulty
Discrete image tokens	Train a small VQ-VAE on a face dataset; verify reconstruction at 1024 tokens/image (the tokenizer recipe is Image Gen Phase 3)	⭐⭐⭐⭐
Tiny Chameleon	Tokenize images with the above VQ-VAE, interleave with text from COCO captions, train one transformer over the unified sequence	⭐⭐⭐⭐⭐
Modality balancing	Train a unified model on text+image+audio; observe and fix one modality dominating loss	⭐⭐⭐⭐
MoE for multimodal	Add a small MoE layer to a multimodal model; observe whether experts naturally specialize	⭐⭐⭐⭐⭐
Reverse direction	Take a VLM (image-in → text-out) and add image generation by training an image-token output head	⭐⭐⭐⭐⭐

Key Insight

The bet behind native multimodal models is that the same scaling laws that gave us GPT-4 from text will give us GPT-4o from text+vision+audio together — if you have enough data and compute, the model will figure out the cross-modal structure on its own. The bet behind bolt-on multimodal models is that you can get 90% of the benefit at 10% of the cost. Both bets are still being played out; as of 2026 the bolt-on architecture still wins on cost/quality for most understanding tasks, while native + AR/diffusion-hybrid architectures (GPT-4o image generation, Janus-Pro, Transfusion) are pulling ahead on the generation side and on the capability ceiling.

Resources

Chameleon paper (Meta, 2024)
Transfusion paper (Meta, 2024) — one transformer, next-token + diffusion losses
Janus-Pro paper (DeepSeek, 2025) — decoupled understanding/generation encoders
Unified-IO 2 paper
Emu3 paper (BAAI, 2024) — next-token prediction, all modalities
Qwen2.5-Omni paper (2025) — open any-to-any omni model
Gemini technical report
GPT-4o announcement and analyses

Phase 8: Training at Scale — Data, Compute, and Alignment

Multimodal models are dataset-bound long before they are compute-bound. This phase is about everything that surrounds the model itself — the parts that are specific to multimodality. (The generic distributed-training machinery is PyTorch Deep Dive Phase 7; the RLHF/DPO algorithms are RL Phase 9.)

Concepts to Learn

Web-scale multimodal datasets:
- LAION-5B, LAION-2B-en — the canonical open image-text corpus (with all its problems)
- DataComp, COYO-700M — alternatives and successors
- OBELICS — interleaved image-text web documents
- WebLI — Google's large internal alternative
Data filtering — most of LAION is unusable; CLIP-score filtering, NSFW filtering, dedup, OCR filtering, aesthetic filtering (the CLIP-score filter is the Phase 3 trick reused at scale)
Synthetic captions — recaptioning web images with a strong VLM dramatically improves downstream training (the trick behind DALL-E 3, ShareGPT4V); this is shared lore with Image Gen Phase 10 and Video Gen Phase 10
Curriculum and staged training — start with clean alignment data, then noisier scale data, then instruction data
Modality balancing — in a unified model, if 99% of your tokens are text, the image loss will be ignored; need to upsample or reweight
Multimodal alignment / RLHF — preference data with image inputs; sycophancy and hallucination are harder to fix when the model has multiple modalities to "hallucinate from." The algorithms (PPO, DPO, GRPO) are owned by RL Phase 9; what's multimodal-specific is the preference-data collection and the visual grounding of the reward.
Safety: NSFW filtering, CSAM detection (mandatory), bias evaluation across demographics, hallucination benchmarks
Compute budgets — typical pretraining for an open VLM is 10⁸–10⁹ image-text pairs; native multimodal is 10× more

A Pragmatic Data Pipeline

Raw web crawl (e.g., Common Crawl + image URLs)
        │
        ▼
Deduplicate (URL, perceptual hash)
        │
        ▼
NSFW + CSAM filtering (must, not optional)
        │
        ▼
CLIP-score filtering (keep top ~30%)
        │
        ▼
Aspect-ratio and resolution filtering (drop tiny / extreme ratios)
        │
        ▼
Synthetic recaption with a strong VLM (recommended)
        │
        ▼
Aesthetic + OCR-quality scoring (task-dependent)
        │
        ▼
Tokenize text, store as shards (WebDataset / Parquet / Arrow)
        │
        ▼
Training-ready: ~10–20% of the original crawl, dramatically higher quality

Projects

Project	Description	Difficulty
Mini LAION pipeline	Take 1M LAION URLs, download, filter with CLIP, dedup, recaption with a small VLM — produce a clean shard	⭐⭐⭐⭐
Caption ablation	Train two small VLMs: one on original alt-text, one on recaptioned text; compare downstream	⭐⭐⭐⭐
Modality ratio sweep	In a unified model run, deliberately under/oversample one modality; sweep the sampling ratio and measure each modality's loss curve as a diagnostic	⭐⭐⭐⭐
Multimodal DPO	Collect a small set of preference pairs over VLM outputs; fine-tune with DPO (algorithm reference)	⭐⭐⭐⭐
Hallucination eval	Build a small benchmark of trick questions ("is there a dog in this image?" when there is none); evaluate several open VLMs	⭐⭐⭐

Key Insight

Two facts that dominate multimodal training at scale: (1) web alt-text is terrible — short, generic, often wrong; and (2) synthetic captions from a strong VLM are much better than human-written captions on average. The implication: data quality is itself a model-output problem. Better models make better captions, which make better models. This recursive improvement is one of the unexplained engines of recent progress.

Resources

LAION-5B paper
DataComp paper — careful study of data curation
DALL-E 3 technical report — recaptioning recipe
ShareGPT4V — recaptioning for VLMs

Phase 9: Evaluation and Benchmarks

Multimodal evaluation is notoriously broken. Knowing which benchmarks to trust (and how they fail) is its own skill. (This is the multimodal evaluation playbook; text-only LM evaluation is LLM Phase 8.)

Concepts to Learn

The benchmark landscape:
- MMMU / MMMU-Pro — multidiscipline multimodal understanding, hardest open VQA benchmark
- MMBench, MME — general multimodal capability
- DocVQA, OCRBench — OCR-heavy document understanding
- MathVista, MathVision — math + diagrams
- ChartQA, AI2D — charts and diagrams
- POPE, HallusionBench — object/visual hallucination
- RefCOCO — referring expressions / grounding
- VideoMME, MLVU, LongVideoBench — video understanding
The captioning benchmarks (CIDEr, BLEU, METEOR on COCO, NoCaps): largely solved and increasingly meaningless
LLM-as-judge — using a strong model (often GPT-4 or Claude) to grade open-ended outputs; introduces its own biases
Hallucination measurement — counting "things in the caption that aren't in the image"
Robustness probes — adversarial images, distribution shifts, demographic balance
Reasoning benchmarks — multimodal CoT, M³CoT, ScienceQA; the rise of multimodal reasoning models (QVQ, Gemini/o-series thinking with vision)
The leakage problem — many benchmarks are now in pretraining corpora; suspect any too-good result

A Sane Evaluation Suite for a New VLM

Capability                    Benchmark          What it measures
──────────────────────────    ───────────────    ─────────────────────
General multimodal QA         MMBench, MMMU      breadth, hard problems
OCR / docs                    DocVQA, OCRBench   text in images
Grounding                     RefCOCO            spatial precision
Hallucination                 POPE               object existence
Math + diagrams               MathVista          structured reasoning
Charts                        ChartQA            quantitative reading
Video (if applicable)         VideoMME           temporal understanding
Open-ended (LLM-judge)        MM-Vet, LLaVA-Wild user-style queries

Projects

Project	Description	Difficulty
Run a VLM evaluation harness	Use `lmms-eval` or `VLMEvalKit` to score an open VLM across 6+ benchmarks	⭐⭐
Build a hallucination probe	Construct 200 "is X in this image" questions; half true, half false; measure precision/recall	⭐⭐⭐
Reproduce a leaderboard result	Pick a paper's MMBench number; reproduce; document the gap	⭐⭐⭐
Benchmark contamination check	Search for a benchmark's test questions in a pretraining corpus shard	⭐⭐⭐
Human-correlated eval	For 100 outputs, get 3 human ratings and 3 LLM-as-judge ratings; measure agreement	⭐⭐⭐

Key Insight

There is no single number that captures "VLM quality." MMMU measures different things from POPE which measures different things from MathVista. The right move when launching a new VLM is to publish a suite — and to explicitly report the benchmarks where your model is worse than the prior state of the art. The field rewards honesty; reviewers see through cherry-picking.

Resources

lmms-eval — modern multimodal eval harness
VLMEvalKit — alternative, broad coverage
MMMU paper
POPE paper — hallucination eval
Are Emergent Abilities a Mirage? (Schaeffer et al., 2023) — required reading on eval skepticism

Phase 10: Frontier Topics

Where the field is going. Pick one or two threads and go deep. Several of these threads live on a border with another guide — the cross-links below tell you who owns the rest of the story.

Vision-Language-Action (VLA) Models for Robotics

A VLA takes (image, instruction) → action. From the multimodal angle, a VLA is "a VLM whose output head emits action tokens instead of text" — the foundational works are RT-2, OpenVLA, π0, Gemini Robotics. The robot policy, imitation-learning, and sim-to-real side is owned by Robotics Phase 8 (which reads better after this guide's Phase 5); here we care only about the multimodal-modeling interface.

Long-Context Multimodal

A 1-hour video has tens of thousands of frames. How do you fit that into a context window? Streaming attention, hierarchical encoding, learned memory, image-token compression. (The serving of long multimodal contexts — KV-cache sizing, prefix caching — is Inference Systems.)

Generative Unified Models

Models that produce images, audio, and video natively from one transformer: Chameleon's image-generation half, Show-o, Janus-Pro, Emu3, Transfusion. The cross-modal modeling is this guide's Phase 7; the underlying image/video generation machinery they decode through is Image Generation / Video Generation.

Multimodal Reasoning

Multimodal chain-of-thought, visual program synthesis (ViperGPT, VisProg), self-consistency over visual problems, and the new wave of multimodal thinking models (QVQ, Gemini/o-series with vision). The text-reasoning core is LLM Phase 6; what's new here is reasoning grounded in pixels.

Embodied and Agentic Multimodal

Web/GUI agents that see screenshots (SeeClick, ShowUI, OS-Atlas, UI-TARS), computer use (Claude Computer Use, Gemini's GUI mode), mobile agents. The screen is just another modality. (Tool-use and agent orchestration is LLM Phase 7; the embodied-control side is Robotics.)

Multimodal Interpretability

What does the projector actually do? Where do image features live inside the LLM? Mechanistic interpretability for VLMs is largely virgin territory (text-side interpretability is LLM Phase 10).

Efficient Multimodal Inference

Token merging, pruning, image-token compression, KV-cache sharing across modalities — the model-side techniques. The serving-stack side (batching, paged KV cache, multi-LoRA, throughput) is owned by Inference Systems; kernel-level work and quantization are AI Hardware.

Safety in Multimodal Models

Visual jailbreaks (typographic attacks, adversarial images), CSAM detection, deepfake detection, watermarking for generative outputs (SynthID). New attack surfaces and new defenses.

Resources for the Frontier

Awesome-Multimodal-Large-Language-Models — actively maintained survey
OpenVLA paper
π0 paper (Physical Intelligence)
Show-o paper
UI-TARS paper (ByteDance, 2025) — native GUI agent
Anthropic — Computer Use blog

Suggested Timeline

Phase	Duration	Outcome
0. Prerequisites	0–2 weeks	Transformers + PyTorch fluent; ViT and BPE understood
1. Foundations	1 week	Can map any multimodal paper to a taxonomy
2. Encoders	1–2 weeks	Implemented ViT; comfortable with mel spectrograms
3. Contrastive	2 weeks	Trained a tiny CLIP; zero-shot ImageNet working
4. Fusion	1–2 weeks	Implemented at least three fusion patterns
5. VLMs	3–4 weeks	Built and instruction-tuned a small VLM end to end
6. Audio + video	2 weeks	Trained a speech LM or video classifier
7. Unified models	2–4 weeks	Built a tiny Chameleon over interleaved tokens
8. Scale + data	2 weeks	Ran a real data filtering pipeline; understand recaptioning
9. Evaluation	1 week	Ran a real eval harness; built one hallucination probe
10. Frontier	Ongoing	Picked one thread (VLA, unified, agents, ...) and going deep

Total to research-comfortable: ~4 months of focused study. Longer if combined with serious projects (recommended).

Key Advice

Start with CLIP, end with CLIP. Every multimodal model traces back to or generalizes CLIP. If you understand contrastive learning deeply, the rest comes faster.
Build the smallest end-to-end VLM you can, early. A 100M-parameter VLM on COCO is one weekend's work. The lessons from doing it transfer directly to everything bigger.
Data is the model. Spend at least as much time on data filtering and recaptioning as on architecture. The papers don't emphasize this enough.
Beware the modality gap. Even well-aligned dual encoders keep text and image embeddings in separable regions. This affects retrieval, generation, and downstream fine-tuning.
Don't trust a single benchmark. Especially captioning metrics. Always evaluate on a suite, including LLM-as-judge for open-ended outputs.
Image tokens are expensive. A single image is often 256–2000 tokens. Multi-image and video contexts blow up fast. Know your token budget — and see Inference Systems when you deploy.
Reuse pretrained components. Almost no one trains a vision encoder from scratch in 2026; you start from SigLIP, DINOv2, or similar, and from a pretrained LLM from the LLM guide.
Synthetic captions are a superpower. Recaptioning with a strong VLM is the highest-leverage data trick in the field.
bf16 everywhere. The same advice as for LLMs and PyTorch; multimodal training is no different.
Visualize your model's attention. Especially the cross-attention in VLMs. It tells you whether the model is actually looking at the right region.

Common Pitfalls to Avoid

❌ Comparing CLIP variants on retrieval and concluding one "is better" — depends entirely on the dataset
❌ Using captioning metrics (BLEU, CIDEr) as your primary VLM evaluation
❌ Training a VLM on COCO and being surprised it can't OCR a screenshot
❌ Ignoring the temperature parameter in contrastive learning
❌ Loading a VLM at fp32 and wondering why inference is slow
❌ Forgetting that web data contains CSAM; not filtering for it
❌ Trusting a single MMMU score without checking the rest of the suite
❌ Building a unified model with 1M tokens of text and 10k tokens of image and wondering why image quality is bad
❌ Skipping the alignment stage and going straight to instruction tuning
❌ Evaluating on a benchmark that's already in your pretraining corpus
❌ Re-deriving diffusion/VQ-VAE internals here instead of reading Image Generation — this guide uses them, it doesn't re-teach them

Additional Resources

Books and Long-Form Reading

Talks and Lectures

Lucas Beyer — Vision and Multimodal Models
CMU 11-777 Multimodal Machine Learning
Stanford CS231n — for the vision foundations

Key Papers, Chronologically

Year	Paper	Contribution
2020	ViT	Transformers for vision
2021	CLIP	Web-scale contrastive
2022	Flamingo	Gated cross-attention into LLMs
2023	BLIP-2	Q-Former, modular fusion
2023	LLaVA	Linear projector + instruction tuning
2023	SigLIP	Sigmoid contrastive, scales better
2024	Qwen2-VL	Dynamic / native resolution, strong open VLM
2024	Chameleon	Native any-to-any with discrete tokens
2024	OpenVLA	VLA for robotics, open-source
2024	Emu3	Next-token prediction, all modalities
2024	Transfusion	AR text + diffusion image in one transformer
2025	Janus-Pro	Decoupled understanding/generation
2025	Qwen2.5-Omni	Open any-to-any omni model

Tools You Should Know

transformers (Hugging Face) — VLMs, multimodal pipelines
open_clip — reproducible CLIP training
lmms-eval / VLMEvalKit — evaluation harnesses
vLLM / SGLang — multimodal inference serving (see Inference Systems)
torchaudio — audio loading and transforms
decord / pyav — fast video frame loading
webdataset — streaming multimodal data

Communities

Hugging Face forums — most VLM Q&A
EleutherAI Discord — open research
r/MachineLearning, X/Twitter ML community

License

MIT License. See the LICENSE file for details.

Scope and boundaries​

Table of Contents​

Phase 0: Prerequisites​

Concepts to Know​

The One Equation Everything Comes Back To​

Resources​

Phase 1: Foundations — What "Multimodal" Actually Means​

Concepts to Learn​

A Taxonomy Diagram​

Projects​

Key Insight​

Resources​

Phase 2: Encoders for Each Modality​

Concepts to Learn​

Image Patchification, Visualized​

Projects​

Sample Code: A Minimal ViT Patch Embedding​

Key Insight​

Resources​

Phase 3: Contrastive Learning — CLIP and Friends​

Concepts to Learn​

The CLIP Training Step​

Projects​

Sample Code: CLIP-Style Contrastive Loss​

Key Insight​

Resources​

Phase 4: Fusion Architectures — How Modalities Talk to Each Other​

Concepts to Learn​

Five Fusion Patterns Side by Side​

Projects​

Sample Code: A Cross-Attention Block​

Key Insight​

Resources​

Phase 5: Vision-Language Models (VLMs)​

Concepts to Learn​

The Standard VLM Training Pipeline​

Projects​

Sample Code: A LLaVA-Style Forward Pass​

Key Insight​

Resources​

Phase 6: Audio, Speech, and Video​

Concepts to Learn​

The Audio Spectrogram Pipeline​

Projects​

Sample Code: Mel Spectrogram with torchaudio​

Key Insight​

Resources​

Phase 7: Unified and Any-to-Any Models​

Concepts to Learn​

Two Architectural Stances​

Projects​

Key Insight​

Resources​

Phase 8: Training at Scale — Data, Compute, and Alignment​

Concepts to Learn​

A Pragmatic Data Pipeline​

Projects​

Key Insight​

Resources​

Phase 9: Evaluation and Benchmarks​

Concepts to Learn​

A Sane Evaluation Suite for a New VLM​

Projects​

Key Insight​

Resources​

Phase 10: Frontier Topics​

Vision-Language-Action (VLA) Models for Robotics​

Long-Context Multimodal​

Generative Unified Models​

Multimodal Reasoning​

Embodied and Agentic Multimodal​

Multimodal Interpretability​

Efficient Multimodal Inference​

Safety in Multimodal Models​

Resources for the Frontier​

Suggested Timeline​

Key Advice​

Common Pitfalls to Avoid​

Additional Resources​

Books and Long-Form Reading​

Scope and boundaries

Table of Contents

Phase 0: Prerequisites

Concepts to Know

The One Equation Everything Comes Back To

Resources

Phase 1: Foundations — What "Multimodal" Actually Means

Concepts to Learn

A Taxonomy Diagram

Projects

Key Insight

Resources

Phase 2: Encoders for Each Modality

Concepts to Learn

Image Patchification, Visualized

Projects

Sample Code: A Minimal ViT Patch Embedding

Key Insight

Resources

Phase 3: Contrastive Learning — CLIP and Friends

Concepts to Learn

The CLIP Training Step

Projects

Sample Code: CLIP-Style Contrastive Loss

Key Insight

Resources

Phase 4: Fusion Architectures — How Modalities Talk to Each Other

Concepts to Learn

Five Fusion Patterns Side by Side

Projects

Sample Code: A Cross-Attention Block

Key Insight

Resources

Phase 5: Vision-Language Models (VLMs)

Concepts to Learn

The Standard VLM Training Pipeline

Projects

Sample Code: A LLaVA-Style Forward Pass

Key Insight

Resources

Phase 6: Audio, Speech, and Video

Concepts to Learn

The Audio Spectrogram Pipeline

Projects

Sample Code: Mel Spectrogram with `torchaudio`

Key Insight

Resources

Phase 7: Unified and Any-to-Any Models

Concepts to Learn

Two Architectural Stances

Projects

Key Insight

Resources

Phase 8: Training at Scale — Data, Compute, and Alignment

Concepts to Learn

A Pragmatic Data Pipeline

Projects

Key Insight

Resources

Phase 9: Evaluation and Benchmarks

Concepts to Learn

A Sane Evaluation Suite for a New VLM

Projects

Key Insight

Resources

Phase 10: Frontier Topics

Vision-Language-Action (VLA) Models for Robotics

Long-Context Multimodal

Generative Unified Models

Multimodal Reasoning

Embodied and Agentic Multimodal

Multimodal Interpretability

Efficient Multimodal Inference

Safety in Multimodal Models

Resources for the Frontier

Suggested Timeline

Key Advice

Common Pitfalls to Avoid

Additional Resources

Books and Long-Form Reading