Multimodal Learning: From Beginner to Advanced
A comprehensive guide to understanding and building systems that learn from and reason across multiple modalities — text, images, audio, video, and beyond — from contrastive pretraining to modern vision-language models and unified any-to-any architectures.
"Multimodal learning" is the slice of machine learning that operates on more than one input/output type. A model that reads an image and writes a caption is multimodal. A model that listens to audio and produces text is multimodal. A model that takes a text prompt and a reference image and produces a video is very multimodal. This guide is about how those systems learn a shared representation across modalities, how to train them, and where the field is going.
Scope and boundaries
This guide owns the problem of getting two or more modalities to share one representation — aligning them, fusing them, and reasoning jointly over them. To keep the AI Learning Guides mutually exclusive and collectively exhaustive (MECE), it deliberately stops at a few borders and links forward to the guide that owns each one.
In scope — this guide owns these topics:
- Contrastive / aligned pretraining (CLIP, SigLIP, ImageBind) — making separate modality spaces comparable
- Fusion architectures — cross-attention, Q-Former, projectors, gated attention, interleaved/early fusion
- Vision-language models (VLMs) — image(+text)→text understanding, VQA, grounding, OCR-heavy models
- Audio and speech as modalities — encoders, ASR/TTS, audio LMs, neural audio codecs (no dedicated audio guide exists, so multimodal is the home for them)
- Video, audio, and image understanding — encoding non-text modalities for joint reasoning
- Unified / any-to-any models — the "one transformer for every modality" framing (Chameleon, GPT-4o, Gemini, Janus, Transfusion)
- Multimodal-specific training, data, alignment, and evaluation — the parts that differ from the single-modality recipes
Out of scope — deferred to the owning guide:
- Transformer/LLM architecture, tokenization, pretraining, post-training, and text-only agents → LLM. This guide uses a pretrained LLM as a frozen (or fine-tuned) backbone but does not re-teach how to build one.
- Generative modeling of images (VAEs, GANs, diffusion, latent diffusion, DiT, flow matching) and image tokenizers (VQ-VAE, VQ-GAN, FSQ) → Image Generation. When a multimodal model needs to produce pixels, the generation machinery lives there; here we care about the cross-modal modeling.
- Generative modeling of video (video diffusion, 3D VAEs, world models) → Video Generation. This guide owns video understanding; video synthesis is theirs.
- Vision-Language-Action models as a robot policy class, imitation learning, sim-to-real → Robotics Phase 8. We cover the multimodal-modeling side of VLAs and link out for the control side.
- RLHF / DPO / GRPO algorithm internals → RL Phase 9 and LLM Phase 5. We cover what is different about preference data with image/audio inputs.
- Serving, batching, KV-cache, and inference latency for deployed multimodal models → Inference Systems. Kernel-level performance and quantization → AI Hardware.
- Tensor, autograd, mixed-precision, distributed-training, and training-loop fundamentals → PyTorch Deep Dive.
When this guide touches an out-of-scope topic, it does so only to the depth needed to make a multimodal modeling decision, and it links to the owning guide.
Table of Contents
- Phase 0: Prerequisites
- Phase 1: Foundations — What "Multimodal" Actually Means
- Phase 2: Encoders for Each Modality
- Phase 3: Contrastive Learning — CLIP and Friends
- Phase 4: Fusion Architectures — How Modalities Talk to Each Other
- Phase 5: Vision-Language Models (VLMs)
- Phase 6: Audio, Speech, and Video
- Phase 7: Unified and Any-to-Any Models
- Phase 8: Training at Scale — Data, Compute, and Alignment
- Phase 9: Evaluation and Benchmarks
- Phase 10: Frontier Topics
- Suggested Timeline
- Key Advice
- Common Pitfalls to Avoid
- Additional Resources
- Glossary
Phase 0: Prerequisites
Multimodal learning sits on top of two stacks (NLP and vision) and borrows from a third (audio). You do not need to be an expert in all of them, but the foundations cannot be skipped.
Concepts to Know
- Transformers: self-attention, cross-attention, positional embeddings, layer norm, residual connections — from the LLM guide Phase 2
- Vision basics: convolution, ViT (Vision Transformer), how an image becomes a sequence of tokens
- Text basics: tokenization (BPE, SentencePiece), language modeling, the next-token-prediction objective — see LLM Phase 1
- PyTorch fluency:
nn.Module, autograd, mixed precision, basic training loops — see PyTorch Deep Dive - Embedding spaces: what an L2-normalized vector looks like, cosine similarity, the geometry of high-dimensional spaces
- Contrastive intuition (helpful but not required yet): pulling similar things together, pushing different things apart
What you do not need yet. You don't need diffusion or GAN internals to start — those are the Image Generation guide's territory, and you only need them once you want a model that outputs pixels (Phases 7 and 10 here). You also don't need RLHF internals; Phase 8 covers only what's different about multimodal preference data.
The One Equation Everything Comes Back To
Multimodal learning = map every modality into a SHARED representation,
then either:
(a) compare them (retrieval, classification), or
(b) generate one from the other (caption, image-from-text), or
(c) reason jointly over all of them (VQA, agents).
The shared representation can be:
- a single vector per item (CLIP-style)
- a sequence of tokens (LLaVA / VLM-style)
- a discrete token alphabet (any-to-any, e.g. Chameleon, Gemini-style)
Resources
- Transformers from Scratch — Brandon Rohrer
- Andrej Karpathy — Let's build the GPT tokenizer
- ViT paper (Dosovitskiy et al., 2020) — required reading
- CLIP paper (Radford et al., 2021) — read this before Phase 3
Phase 1: Foundations — What "Multimodal" Actually Means
Before architectures, get the conceptual map right. "Multimodal" is a fuzzy umbrella term that hides at least four distinct problems.
Concepts to Learn
- The four canonical tasks:
- Cross-modal retrieval — given an image, find the matching caption (or vice versa)
- Cross-modal generation — given text, produce an image (or audio, or video). The generative half lives in Image Generation / Video Generation; here we care about the conditioning and the cross-modal interface.
- Multimodal understanding — given image + text, answer a question (VQA, captioning)
- Joint/any-to-any — flexibly map any subset of modalities to any other
- Modality gap — even well-trained models keep text and image embeddings in noticeably different regions of the shared space
- Alignment vs fusion — alignment is making spaces comparable; fusion is combining information
- Early vs late fusion:
- Late fusion: encode each modality separately, combine at the end (CLIP-style)
- Early fusion: interleave modalities into one sequence from the start (Chameleon-style)
- Middle fusion: encode separately, then attend across (Flamingo, LLaVA)
- Pretraining objectives: contrastive, masked, generative, and combinations
A Taxonomy Diagram
MULTIMODAL MODELS
│
┌────────────────────────┼────────────────────────┐
│ │ │
DUAL-ENCODER ENCODER-DECODER UNIFIED
(alignment) (understanding) (any-to-any)
│ │ │
CLIP, SigLIP, Flamingo, BLIP-2, Chameleon,
ImageBind LLaVA, Qwen2.5-VL, Gemini, GPT-4o,
PaliGemma Janus, Transfusion
│ │ │
Best at: Best at: Best at:
- retrieval - VQA - everything
- zero-shot - captioning - but expensive
classification - dialogue to train
- data filtering - reasoning
Projects
| Project | Description | Difficulty |
|---|---|---|
| Modality survey | Pick 5 multimodal papers, classify each by fusion type and objective; write a one-paragraph summary of each | ⭐ |
| Visualize the modality gap | Encode 1k images and 1k captions with a pretrained CLIP; PCA them; observe the separation | ⭐⭐ |
| Toy retrieval | Build a tiny retrieval system: encode a few hundred images and captions with CLIP, retrieve top-5 for each query | ⭐⭐ |
Key Insight
The choice of fusion strategy determines what your model can do. Dual encoders (CLIP) are fast and great at retrieval but can't reason or generate. Encoder-decoder VLMs (LLaVA) reason and generate text but not images. Unified models (Chameleon, GPT-4o) do everything but need vastly more data and compute. There is no free lunch; pick the architecture that matches your task.
Resources
- On the Opportunities and Risks of Foundation Models (Stanford CRFM) — long but the multimodal sections are excellent context
- A Survey on Multimodal Large Language Models (Yin et al., 2024)
- A Survey on Vision-Language Models (Zhang et al., 2024)
- Hugging Face — Vision Language Models Explained
Phase 2: Encoders for Each Modality
Before you can fuse modalities, you have to encode each one into vectors. This phase is about the representation building blocks — the encoders that turn pixels, waveforms, and frames into sequences a transformer can align.
MECE note. This phase teaches encoders as feature extractors for alignment and understanding. The generative backbones that turn latents back into pixels (U-Nets, DiTs) belong to Image Generation and Video Generation. The discrete tokenizers (VQ-VAE, VQ-GAN, FSQ, MagViT-v2) are taught in Image Generation Phase 3; here we use them (Phase 7) and only summarize.
Concepts to Learn
- Image encoders:
- CNNs (ResNet, EfficientNet) — still useful, especially for small models
- Vision Transformers (ViT) — the modern default; how patchification works
- Patch size and resolution tradeoffs — smaller patches = more tokens = better detail = quadratically more compute
- SigLIP / SigLIP 2 / DFN / EVA-CLIP / DINOv2 — modern improvements over the original CLIP vision tower; DINOv2 is the dominant self-supervised (non-contrastive) choice
- Text encoders:
- BERT-style bidirectional encoders (for dual-encoder models)
- Decoder-only LLMs as encoders (just take hidden states)
- The tradeoff: bidirectional sees both sides but is non-causal; decoder-only is causal but composes naturally with generation
- Audio encoders:
- Mel spectrograms — the standard input representation
- Whisper-style encoders for speech
- HuBERT, wav2vec 2.0 for general audio
- Neural audio codecs (EnCodec, SoundStream, DAC, Mimi) for discrete audio tokens
- Video encoders:
- Frame-by-frame ViT (cheap but loses motion)
- 3D convolutions or 3D-ViT for spatiotemporal patches
- Temporal pooling, hierarchical encoding
Image Patchification, Visualized
Input image: 224×224×3
Split into 16×16 patches: 14 × 14 = 196 patches
Each patch: 16×16×3 = 768 numbers
Linear projection → embedding of dim D (e.g., 768)
Add positional embedding (learned or sinusoidal)
+ [CLS] token at position 0
→ sequence of 197 tokens, each D-dim
→ feed to a stack of transformer blocks
→ output is 197 contextualized vectors
→ pool (CLS token, mean, attention pool) → single image vec
Projects
| Project | Description | Difficulty |
|---|---|---|
| Implement ViT from scratch | Patchify, linear-project, transformer blocks, CLS pooling — train on CIFAR-10 | ⭐⭐⭐ |
| Compare encoders | Take ResNet-50, ViT-B/16, SigLIP, and DINOv2 — extract features for 1k ImageNet images, compare via linear probe | ⭐⭐ |
| Mel spectrogram pipeline | Take a 10-second .wav file, produce a mel spectrogram, feed through a small CNN | ⭐⭐ |
| Whisper encoder reuse | Use just the encoder of Whisper to get audio embeddings; build a simple audio classifier on top | ⭐⭐⭐ |
| Patch-size study | Train ViT with patches of 8, 16, 32 — measure accuracy and FLOPs | ⭐⭐⭐ |
Sample Code: A Minimal ViT Patch Embedding
import torch
import torch.nn as nn
class PatchEmbed(nn.Module):
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
super().__init__()
self.n_patches = (img_size // patch_size) ** 2
# A conv with stride=patch_size is the standard trick:
self.proj = nn.Conv2d(in_chans, embed_dim,
kernel_size=patch_size, stride=patch_size)
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
self.pos_embed = nn.Parameter(torch.zeros(1, self.n_patches + 1, embed_dim))
def forward(self, x):
# x: (B, 3, H, W)
x = self.proj(x) # (B, D, H/P, W/P)
x = x.flatten(2).transpose(1, 2) # (B, N, D)
cls = self.cls_token.expand(x.size(0), -1, -1)
x = torch.cat([cls, x], dim=1) # (B, N+1, D)
return x + self.pos_embed
Key Insight
The patchification trick — using a single strided convolution to both split the image into patches and project them to the embedding dimension — is one of those "obvious in hindsight" moves that made ViT practical. It's mathematically identical to the unfold-then-linear approach but is dramatically faster. The deeper lesson: every modality reduces to "turn it into a sequence of D-dimensional vectors," after which a transformer doesn't care whether those vectors came from pixels, waveforms, or words.
Resources
- ViT paper
- SigLIP paper — sigmoid loss replaces softmax, scales better
- DINOv2 paper — strong self-supervised vision features
- Whisper paper — best read on a modern speech encoder
- EnCodec paper — for discrete audio tokens
- VideoMAE paper — masked autoencoding for video
Phase 3: Contrastive Learning — CLIP and Friends
CLIP is the model that made modern multimodal learning take off, and contrastive alignment is this guide's home turf — the Image Generation guide explicitly defers CLIP/contrastive pretraining here. Understanding it cold pays compound interest.
Concepts to Learn
- The contrastive objective: pull matched (image, caption) pairs together, push unmatched pairs apart
- InfoNCE loss — the workhorse contrastive loss; relationship to mutual information
- The temperature parameter τ — what it controls, why it's learnable, and why it matters more than you'd think
- Batch size in contrastive learning — why bigger is dramatically better, and the tricks (memory bank, MoCo, distributed gathering) to fake it cheaply
- Hard negatives — easy negatives don't teach the model anything; mining hard ones helps
- CLIP variants:
- SigLIP / SigLIP 2 — sigmoid (per-pair) loss instead of softmax (over-batch) loss; works at smaller batch sizes
- ALIGN — Google's CLIP-equivalent, trained on noisier web data
- OpenCLIP, EVA-CLIP, DFN, MetaCLIP — community and Meta scaling efforts
- ImageBind — extends contrastive learning to 6 modalities (text, image, audio, depth, thermal, IMU)
- Zero-shot classification — how CLIP does classification without ever seeing labels
- CLIP as a filter — using CLIP scores to clean web-scale training data (e.g., LAION); the same trick reappears for data curation in Image Generation Phase 10 and Video Generation Phase 10
The CLIP Training Step
Batch of N (image, text) pairs:
images: [I₁, I₂, ..., I_N] → image encoder → [v₁, v₂, ..., v_N]
captions: [T₁, T₂, ..., T_N] → text encoder → [u₁, u₂, ..., u_N]
L2-normalize both, then compute similarity matrix:
S = (V · Uᵀ) / τ # shape (N, N)
Diagonal entries S[i, i] are matched pairs (should be high).
Off-diagonal entries S[i, j], j≠i are unmatched (should be low).
Loss = (cross_entropy(S, labels=identity, axis=rows)
+ cross_entropy(S, labels=identity, axis=cols)) / 2
i.e. for each image, the matched caption should be the most similar
out of all N captions in the batch — and vice versa.
Projects
| Project | Description | Difficulty |
|---|---|---|
| Implement InfoNCE | Write the symmetric contrastive loss from scratch; verify gradients | ⭐⭐ |
| Tiny CLIP | Train a small CLIP on Flickr30k or COCO captions; image encoder = small ViT, text encoder = small transformer | ⭐⭐⭐⭐ |
| Zero-shot ImageNet | Use a pretrained CLIP to classify ImageNet without ever training on its labels; tune the prompt template | ⭐⭐ |
| Hard-negative mining | Train CLIP with mined hard negatives vs random; measure retrieval improvement | ⭐⭐⭐⭐ |
| Temperature ablation | Vary τ from 0.01 to 1.0; observe accuracy and the geometry of the embedding space | ⭐⭐⭐ |
| Data filtering with CLIP | Filter a noisy image-text dataset by CLIP similarity score; train a downstream model on filtered vs unfiltered | ⭐⭐⭐ |
Sample Code: CLIP-Style Contrastive Loss
import torch
import torch.nn.functional as F
def clip_loss(image_features, text_features, logit_scale):
# L2-normalize
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)
# Cosine similarity matrix, scaled by learned temperature
logits_per_image = logit_scale * image_features @ text_features.T
logits_per_text = logits_per_image.T
n = image_features.size(0)
labels = torch.arange(n, device=image_features.device)
loss_i = F.cross_entropy(logits_per_image, labels)
loss_t = F.cross_entropy(logits_per_text, labels)
return (loss_i + loss_t) / 2
# logit_scale is usually parameterized as exp(theta) where theta is learned,
# initialized to ln(1/0.07) ≈ 2.66, clamped to ≤ ln(100) for stability.
Key Insight
CLIP's most important contribution was not the architecture; it was the realization that the internet is a labeled dataset. Every image with an alt-text or surrounding caption is a free supervised example. The contrastive objective turned this firehose of noisy data into a useful signal. The architecture (two transformers + cosine similarity) is almost an afterthought.
Resources
- CLIP paper (Radford et al., 2021)
- SigLIP paper (Zhai et al., 2023) and SigLIP 2 (2025)
- OpenCLIP — reproducible CLIP training in the open
- ImageBind — contrastive across 6 modalities
- Lucas Beyer — Lecture on CLIP and SigLIP
Phase 4: Fusion Architectures — How Modalities Talk to Each Other
Once each modality is encoded, you have to combine them. There are more options than people realize.
Concepts to Learn
- Concatenation — the trivial baseline; works surprisingly well sometimes
- Cross-attention — one modality's tokens attend to another's; the most common fusion in modern VLMs
- Q-Former (BLIP-2) — learnable queries that distill an image into a fixed number of tokens for an LLM
- Perceiver IO — cross-attention with a small latent set, modality-agnostic
- Adapter modules — small layers inserted into a frozen backbone, trained to adapt to a new modality
- Gated cross-attention (Flamingo) — adds new cross-attention layers between LLM layers, gated so the pretrained behavior isn't broken at init
- Projector-only fusion (LLaVA) — just a linear or MLP projection from image features to LLM token space; surprisingly effective
- Interleaved sequences — treat image tokens and text tokens as one sequence (early fusion)
Five Fusion Patterns Side by Side
1. Concatenation (simplest):
[text emb] ─┐
├──► classifier
[image emb]─┘
2. Cross-attention (Flamingo-style):
LLM block ──► cross-attn(text tokens, image tokens) ──► next LLM block
▲
│
image encoder features
3. Q-Former (BLIP-2):
image encoder ──► 32 learned queries ◄── cross-attn ◄── frozen
│ image features
▼
→ projected → fed into LLM as 32 "tokens"
4. Projector + interleaved tokens (LLaVA):
image encoder ──► linear/MLP ──► N "image tokens"
│
▼
LLM sees: [<system>...<image_tokens>...<user_text>...<assistant>]
5. Early fusion (Chameleon, native multimodal):
image ──► VQ tokenizer ──► discrete image tokens
│
▼
unified sequence: [text tokens][image tokens][text tokens]...
trained with one next-token prediction loss over the whole alphabet
Projects
| Project | Description | Difficulty |
|---|---|---|
| Concat vs cross-attn | On a small VQA task, compare concatenation, cross-attention, and projector fusion; report accuracy and parameter counts | ⭐⭐⭐ |
| Implement Q-Former | A small Q-Former with 16 learned queries; train on COCO captions | ⭐⭐⭐⭐ |
| Adapter for a new modality | Add a depth-image input to a frozen CLIP image encoder via an adapter layer | ⭐⭐⭐ |
| Perceiver IO | Implement the Perceiver IO architecture on a small toy task | ⭐⭐⭐⭐ |
| Gated cross-attention | Implement Flamingo's gated mechanism; verify that at init, output equals the unimodal LLM | ⭐⭐⭐⭐ |
Sample Code: A Cross-Attention Block
import torch
import torch.nn as nn
class CrossAttention(nn.Module):
"""LLM hidden states attend to image features."""
def __init__(self, dim, num_heads):
super().__init__()
self.attn = nn.MultiheadAttention(dim, num_heads, batch_first=True)
self.norm_q = nn.LayerNorm(dim)
self.norm_kv = nn.LayerNorm(dim)
self.gate = nn.Parameter(torch.zeros(1)) # Flamingo trick
def forward(self, x_text, x_image):
# x_text: (B, T, D) — query (LLM hidden states)
# x_image: (B, N, D) — key/value (image features, projected to dim D)
q = self.norm_q(x_text)
kv = self.norm_kv(x_image)
out, _ = self.attn(q, kv, kv)
return x_text + self.gate.tanh() * out # tanh(0) = 0 at init → identity
Key Insight
The progression from BLIP-2 → LLaVA → Chameleon is a story of simplification. BLIP-2 used a complex Q-Former with a multi-stage training recipe. LLaVA showed that a single linear projection works almost as well if you have good instruction data. Chameleon showed that you don't even need separate encoders — just tokenize everything. The lesson: in deep learning, "simpler + more data" usually wins.
Resources
- Flamingo paper (Alayrac et al., 2022)
- BLIP-2 paper (Li et al., 2023)
- LLaVA paper (Liu et al., 2023)
- Perceiver IO paper
- Chameleon paper (Meta, 2024)
Phase 5: Vision-Language Models (VLMs)
The current workhorse class. A VLM takes images (+ text) in and produces text. Most "multimodal AI" products you can name are VLMs.
Concepts to Learn
- The standard recipe: pretrained vision encoder + projector + pretrained LLM → train projector first, then jointly fine-tune. (The vision encoder comes from Phase 2/3; the LLM comes from the LLM guide — a VLM is mostly glue and data.)
- Image preprocessing for VLMs:
- Fixed resolution vs dynamic resolution / AnyRes (Qwen2-VL, InternVL2): tile the image to handle any aspect ratio
- Native-resolution ViT (Qwen2-VL's NaViT-style approach) — process the image at its true resolution instead of a fixed grid
- Token budget per image — typically 256 to a few thousand image tokens
- Instruction tuning for VLMs — the "LLaVA-Instruct" recipe: GPT-4-generated multimodal instructions
- Visual question answering (VQA) — classic benchmark task
- OCR-heavy VLMs — Donut, Nougat, GOT — for documents
- Grounding — output bounding boxes or pixel coordinates; teaching the LLM to "point" (Molmo's pointing supervision is a clean recent example)
- Modern frontier VLMs (2025–2026):
- Qwen2.5-VL / Qwen3-VL — Alibaba, strong open VLM family
- InternVL2.5 / InternVL3 — Shanghai AI Lab
- PaliGemma 2 — Google's small, strong VLM family
- Molmo — Allen AI, open weights and open data, strong grounding/pointing
- Pixtral — Mistral's VLM
- Llama 3.2 Vision / Llama 4 — Meta's official VLM line (Llama 4 is natively multimodal, MoE, early-fusion)
- DeepSeek-VL2 — MoE VLM
- Closed: GPT-4o, Claude (with vision), Gemini 2.0/2.5
The Standard VLM Training Pipeline
Stage 1 (alignment): Stage 2 (visual instruction tuning):
───────────────────── ─────────────────────────────────────
Freeze: vision encoder Unfreeze: projector + LLM
Freeze: LLM Freeze: vision encoder (often)
Train: projector only Train on: ~500k–5M instruction pairs
Data: ~500k–10M Format: conversational (image, Q, A)
image-caption Sources: LLaVA-Instruct, ShareGPT4V,
pairs Cauldron, custom
Result: LLM "speaks image" Result: VLM that follows visual instructions
Projects
| Project | Description | Difficulty |
|---|---|---|
| LLaVA from scratch | Connect a CLIP-ViT-L/14 to a 1–3B LLM with an MLP projector; do stage-1 alignment on COCO captions | ⭐⭐⭐⭐ |
| Visual instruction tuning | Fine-tune the above on the LLaVA-Instruct dataset; evaluate on a few VQA benchmarks | ⭐⭐⭐⭐⭐ |
| Dynamic resolution | Implement AnyRes tiling; verify it improves OCR-heavy benchmarks | ⭐⭐⭐⭐ |
| Grounding head | Add bounding-box outputs to a VLM via a special <box> token vocabulary | ⭐⭐⭐⭐ |
| Compare projectors | Linear vs 2-layer MLP vs Q-Former on the same downstream task; report quality and speed | ⭐⭐⭐ |
| Inference optimization | Take an open VLM, serve it with vLLM or SGLang; measure tokens/sec at different image counts (deep dive: Inference Systems) | ⭐⭐⭐ |
Sample Code: A LLaVA-Style Forward Pass
import torch
import torch.nn as nn
class TinyVLM(nn.Module):
def __init__(self, vision_encoder, llm, vision_dim, llm_dim):
super().__init__()
self.vision_encoder = vision_encoder # frozen in stage 1
self.llm = llm # frozen in stage 1
self.projector = nn.Sequential(
nn.Linear(vision_dim, llm_dim),
nn.GELU(),
nn.Linear(llm_dim, llm_dim),
)
def forward(self, image, input_ids, image_token_idx):
# 1. Encode image → sequence of visual tokens
v = self.vision_encoder(image) # (B, N_img, vision_dim)
v = self.projector(v) # (B, N_img, llm_dim)
# 2. Get text embeddings
text_emb = self.llm.get_input_embeddings()(input_ids) # (B, T, llm_dim)
# 3. Replace the placeholder <image> token with the visual tokens
# (in practice you splice them in at image_token_idx)
# ... splicing logic ...
merged = splice_in_visual_tokens(text_emb, v, image_token_idx)
# 4. Run the LLM over the merged sequence
return self.llm(inputs_embeds=merged).logits
Key Insight
The biggest difference between a "good" and "great" VLM is rarely the architecture — it's the data. The projector is trivial. The vision encoder and LLM are both pretrained. What separates Qwen2.5-VL from LLaVA-1.5 is millions of carefully curated visual instructions and high-quality OCR data. If you want to build a competitive VLM, budget 70% of your effort for data, not modeling.
Resources
- LLaVA paper and LLaVA-1.5
- Qwen2-VL paper and Qwen2.5-VL
- PaliGemma paper
- Molmo paper (Allen AI, 2024) — open weights and open data
- InternVL2.5 / InternVL3
- The Cauldron dataset (HF) — 50 VLM instruction datasets unified
Phase 6: Audio, Speech, and Video
Vision is the most popular non-text modality, but audio and video are catching up fast. There is no dedicated audio guide in this collection, so this phase is the home for audio and speech. For video, this phase owns understanding (encoding video for reasoning); video synthesis belongs to Video Generation.
Concepts to Learn
- Audio representations:
- Raw waveform — high resolution but very long sequences
- Mel spectrogram — compact, perceptually grounded, the usual choice
- Discrete audio tokens (EnCodec, SoundStream, DAC, Mimi) — the audio analog of BPE; enables LM-style modeling of audio
- Speech recognition (ASR) — Whisper, Conformer; encoder-decoder transformers on mel spectrograms
- Text-to-speech (TTS) — non-autoregressive (FastSpeech), autoregressive (Tortoise, VALL-E), and modern hybrid neural codec approaches
- Music and general audio generation — MusicGen, AudioGen, AudioLDM, Stable Audio (the audio analog of diffusion image models)
- Speech LLMs and full-duplex voice — taking the LLM-with-projector recipe and replacing the vision encoder with an audio encoder (Qwen2-Audio); real-time, full-duplex speech (Moshi, GPT-4o voice mode, Gemini Live)
- Video as a modality:
- Frame-by-frame ViT encoding — cheap but discards motion
- Spatiotemporal transformers — 3D attention over (frame, height, width)
- Video tokenization (MagViT-v2, OmniTokenizer) — compress video into discrete tokens (the same tokenizers used for video generation)
- Video-language models: Video-LLaVA, VideoChat, VideoLLaMA, LLaVA-Video, Qwen2.5-VL (handles video natively)
The Audio Spectrogram Pipeline
Raw audio waveform (16 kHz, 1 channel):
[-0.02, 0.01, 0.05, -0.03, ...] ← millions of samples for a minute
Short-Time Fourier Transform (STFT):
Sliding window (e.g., 25 ms hop 10 ms) → magnitude spectrogram (freq × time)
Mel-scale filterbank:
Project linear-frequency bins onto a perceptually-spaced mel scale (e.g., 80 mel bins)
Log:
log(mel_spectrogram + eps) → range ~ [-10, 10]
Result shape: (T, 80) ← T is the number of time frames
↓
Treat as a "1D image" with 80 channels, feed to a convolutional or transformer encoder.
Projects
| Project | Description | Difficulty |
|---|---|---|
| Mel spectrogram from scratch | Implement STFT and a mel filterbank; visualize a 10-second clip | ⭐⭐ |
| Whisper fine-tune | Fine-tune Whisper-small on a low-resource language or custom domain | ⭐⭐⭐ |
| EnCodec tour | Use EnCodec to encode/decode audio at multiple bandwidths; listen to the reconstructions | ⭐⭐ |
| Speech LLM | Glue an audio encoder to a small LLM with a projector; train on AudioSet captions | ⭐⭐⭐⭐⭐ |
| Video frame VLM | Sample 8 frames from a video, treat them as 8 images for a VLM, do video QA | ⭐⭐⭐ |
| Native video model | Use spatiotemporal patches (TubeViT-style) and train a small video classifier | ⭐⭐⭐⭐ |
Sample Code: Mel Spectrogram with torchaudio
import torch
import torchaudio
import torchaudio.transforms as T
waveform, sr = torchaudio.load("audio.wav") # (channels, samples)
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
mel = T.MelSpectrogram(
sample_rate=16000,
n_fft=400,
hop_length=160, # 10 ms hop
n_mels=80,
)(waveform) # (channels, n_mels, time)
log_mel = torch.log(mel + 1e-6)
# log_mel is now ready to feed to an encoder — same shape semantics as an image with 80 channels.
Key Insight
Once you tokenize a modality — turn it into a discrete sequence with a fixed vocabulary — it becomes "just another language" for a transformer. This is why neural audio codecs are such a big deal: they let you do language-model-style generation on audio. The same applies to images (VQ-VAE → discrete image tokens, Image Generation Phase 3), video (MagViT-v2, Video Generation Phase 5), and even actions (in Robotics, action tokenizers). The unified-token view is the path to true any-to-any models — the subject of Phase 7.
Resources
- Whisper paper
- EnCodec paper
- MusicGen paper
- VALL-E paper — neural codec language model for TTS
- Moshi paper (Kyutai, 2024) — full-duplex speech LM with the Mimi codec
- MagViT-v2 paper — best video tokenizer
- Video-LLaVA paper
Phase 7: Unified and Any-to-Any Models
The frontier, and this guide's signature territory — both Image Generation and Video Generation defer the "one transformer for every modality" framing to here. A single model that takes any combination of modalities in and produces any combination out.
Concepts to Learn
- The unified-token hypothesis — if you can tokenize every modality, you can train one model on the union. (The tokenizers themselves are built in Image Generation Phase 3; here we assemble them into a joint model.)
- Native multimodal models (Chameleon, GPT-4o, Gemini, Llama 4): trained from scratch on all modalities, no separate "vision tower"
- Mixture-of-Experts (MoE) — a common scaling tool for unified models; experts can specialize by modality (Llama 4, DeepSeek-VL2)
- Generation across modalities: a unified model can in principle output
<image_token>sequences as easily as text tokens, then a decoder turns them into pixels - AR + diffusion hybrids — Transfusion (one transformer, next-token loss on text + diffusion loss on image patches) and Janus / Janus-Pro (decoupled visual encoders for understanding vs generation) are the 2024–2025 designs that close the gap with diffusion image quality while keeping a single backbone
- Omni models — single model, single inference path, all modalities in and out (GPT-4o for text/audio/vision; Qwen2.5-Omni, MiniCPM-o on the open side)
- Late-stage vs early-stage fusion at scale — the empirical evidence is increasingly that earlier fusion wins when you have enough compute
- Trade-offs: unified models lose some specialist quality; the question is whether the joint flexibility makes up for it
Two Architectural Stances
Stance A: "Bolt-on" multimodality (LLaVA, Qwen2.5-VL)
─────────────────────────────────────────────────────
[image] → [vision encoder] → [projector] → fed as tokens to a pretrained LLM
[audio] → [audio encoder] → [projector] → fed as tokens to the same LLM
Output: text only (or text + tool calls)
Pros: efficient; reuses huge pretrained LLMs; modular
Cons: can't generate non-text modalities; bottleneck at the projector
Stance B: "Native" multimodality (Chameleon, GPT-4o, Llama 4)
─────────────────────────────────────────────────────
Tokenize every modality into one shared discrete vocabulary:
text tokens [50000 entries]
image tokens [8192 entries from a VQ-VAE]
audio tokens [4096 entries from a neural codec]
One sequence: [text][image][text][audio][image]...
One transformer with one next-token prediction loss over the union.
(Variant — Transfusion: keep image patches continuous and apply a
diffusion loss on them inside the same transformer.)
Pros: any-to-any natively; one model, one loss
Cons: enormous compute; harder to leverage existing LLMs;
discrete-image-token decoder quality historically lagged diffusion
(the gap Transfusion/Janus are closing)
Projects
| Project | Description | Difficulty |
|---|---|---|
| Discrete image tokens | Train a small VQ-VAE on a face dataset; verify reconstruction at 1024 tokens/image (the tokenizer recipe is Image Gen Phase 3) | ⭐⭐⭐⭐ |
| Tiny Chameleon | Tokenize images with the above VQ-VAE, interleave with text from COCO captions, train one transformer over the unified sequence | ⭐⭐⭐⭐⭐ |
| Modality balancing | Train a unified model on text+image+audio; observe and fix one modality dominating loss | ⭐⭐⭐⭐ |
| MoE for multimodal | Add a small MoE layer to a multimodal model; observe whether experts naturally specialize | ⭐⭐⭐⭐⭐ |
| Reverse direction | Take a VLM (image-in → text-out) and add image generation by training an image-token output head | ⭐⭐⭐⭐⭐ |
Key Insight
The bet behind native multimodal models is that the same scaling laws that gave us GPT-4 from text will give us GPT-4o from text+vision+audio together — if you have enough data and compute, the model will figure out the cross-modal structure on its own. The bet behind bolt-on multimodal models is that you can get 90% of the benefit at 10% of the cost. Both bets are still being played out; as of 2026 the bolt-on architecture still wins on cost/quality for most understanding tasks, while native + AR/diffusion-hybrid architectures (GPT-4o image generation, Janus-Pro, Transfusion) are pulling ahead on the generation side and on the capability ceiling.
Resources
- Chameleon paper (Meta, 2024)
- Transfusion paper (Meta, 2024) — one transformer, next-token + diffusion losses
- Janus-Pro paper (DeepSeek, 2025) — decoupled understanding/generation encoders
- Unified-IO 2 paper
- Emu3 paper (BAAI, 2024) — next-token prediction, all modalities
- Qwen2.5-Omni paper (2025) — open any-to-any omni model
- Gemini technical report
- GPT-4o announcement and analyses
Phase 8: Training at Scale — Data, Compute, and Alignment
Multimodal models are dataset-bound long before they are compute-bound. This phase is about everything that surrounds the model itself — the parts that are specific to multimodality. (The generic distributed-training machinery is PyTorch Deep Dive Phase 7; the RLHF/DPO algorithms are RL Phase 9.)
Concepts to Learn
- Web-scale multimodal datasets:
- LAION-5B, LAION-2B-en — the canonical open image-text corpus (with all its problems)
- DataComp, COYO-700M — alternatives and successors
- OBELICS — interleaved image-text web documents
- WebLI — Google's large internal alternative
- Data filtering — most of LAION is unusable; CLIP-score filtering, NSFW filtering, dedup, OCR filtering, aesthetic filtering (the CLIP-score filter is the Phase 3 trick reused at scale)
- Synthetic captions — recaptioning web images with a strong VLM dramatically improves downstream training (the trick behind DALL-E 3, ShareGPT4V); this is shared lore with Image Gen Phase 10 and Video Gen Phase 10
- Curriculum and staged training — start with clean alignment data, then noisier scale data, then instruction data
- Modality balancing — in a unified model, if 99% of your tokens are text, the image loss will be ignored; need to upsample or reweight
- Multimodal alignment / RLHF — preference data with image inputs; sycophancy and hallucination are harder to fix when the model has multiple modalities to "hallucinate from." The algorithms (PPO, DPO, GRPO) are owned by RL Phase 9; what's multimodal-specific is the preference-data collection and the visual grounding of the reward.
- Safety: NSFW filtering, CSAM detection (mandatory), bias evaluation across demographics, hallucination benchmarks
- Compute budgets — typical pretraining for an open VLM is 10⁸–10⁹ image-text pairs; native multimodal is 10× more
A Pragmatic Data Pipeline
Raw web crawl (e.g., Common Crawl + image URLs)
│
▼
Deduplicate (URL, perceptual hash)
│
▼
NSFW + CSAM filtering (must, not optional)
│
▼
CLIP-score filtering (keep top ~30%)
│
▼
Aspect-ratio and resolution filtering (drop tiny / extreme ratios)
│
▼
Synthetic recaption with a strong VLM (recommended)
│
▼
Aesthetic + OCR-quality scoring (task-dependent)
│
▼
Tokenize text, store as shards (WebDataset / Parquet / Arrow)
│
▼
Training-ready: ~10–20% of the original crawl, dramatically higher quality
Projects
| Project | Description | Difficulty |
|---|---|---|
| Mini LAION pipeline | Take 1M LAION URLs, download, filter with CLIP, dedup, recaption with a small VLM — produce a clean shard | ⭐⭐⭐⭐ |
| Caption ablation | Train two small VLMs: one on original alt-text, one on recaptioned text; compare downstream | ⭐⭐⭐⭐ |
| Modality ratio sweep | In a unified model run, deliberately under/oversample one modality; sweep the sampling ratio and measure each modality's loss curve as a diagnostic | ⭐⭐⭐⭐ |
| Multimodal DPO | Collect a small set of preference pairs over VLM outputs; fine-tune with DPO (algorithm reference) | ⭐⭐⭐⭐ |
| Hallucination eval | Build a small benchmark of trick questions ("is there a dog in this image?" when there is none); evaluate several open VLMs | ⭐⭐⭐ |
Key Insight
Two facts that dominate multimodal training at scale: (1) web alt-text is terrible — short, generic, often wrong; and (2) synthetic captions from a strong VLM are much better than human-written captions on average. The implication: data quality is itself a model-output problem. Better models make better captions, which make better models. This recursive improvement is one of the unexplained engines of recent progress.
Resources
- LAION-5B paper
- DataComp paper — careful study of data curation
- DALL-E 3 technical report — recaptioning recipe
- ShareGPT4V — recaptioning for VLMs
Phase 9: Evaluation and Benchmarks
Multimodal evaluation is notoriously broken. Knowing which benchmarks to trust (and how they fail) is its own skill. (This is the multimodal evaluation playbook; text-only LM evaluation is LLM Phase 8.)
Concepts to Learn
- The benchmark landscape:
- MMMU / MMMU-Pro — multidiscipline multimodal understanding, hardest open VQA benchmark
- MMBench, MME — general multimodal capability
- DocVQA, OCRBench — OCR-heavy document understanding
- MathVista, MathVision — math + diagrams
- ChartQA, AI2D — charts and diagrams
- POPE, HallusionBench — object/visual hallucination
- RefCOCO — referring expressions / grounding
- VideoMME, MLVU, LongVideoBench — video understanding
- The captioning benchmarks (CIDEr, BLEU, METEOR on COCO, NoCaps): largely solved and increasingly meaningless
- LLM-as-judge — using a strong model (often GPT-4 or Claude) to grade open-ended outputs; introduces its own biases
- Hallucination measurement — counting "things in the caption that aren't in the image"
- Robustness probes — adversarial images, distribution shifts, demographic balance
- Reasoning benchmarks — multimodal CoT, M³CoT, ScienceQA; the rise of multimodal reasoning models (QVQ, Gemini/o-series thinking with vision)
- The leakage problem — many benchmarks are now in pretraining corpora; suspect any too-good result
A Sane Evaluation Suite for a New VLM
Capability Benchmark What it measures
────────────────────────── ─────────────── ─────────────────────
General multimodal QA MMBench, MMMU breadth, hard problems
OCR / docs DocVQA, OCRBench text in images
Grounding RefCOCO spatial precision
Hallucination POPE object existence
Math + diagrams MathVista structured reasoning
Charts ChartQA quantitative reading
Video (if applicable) VideoMME temporal understanding
Open-ended (LLM-judge) MM-Vet, LLaVA-Wild user-style queries
Projects
| Project | Description | Difficulty |
|---|---|---|
| Run a VLM evaluation harness | Use lmms-eval or VLMEvalKit to score an open VLM across 6+ benchmarks | ⭐⭐ |
| Build a hallucination probe | Construct 200 "is X in this image" questions; half true, half false; measure precision/recall | ⭐⭐⭐ |
| Reproduce a leaderboard result | Pick a paper's MMBench number; reproduce; document the gap | ⭐⭐⭐ |
| Benchmark contamination check | Search for a benchmark's test questions in a pretraining corpus shard | ⭐⭐⭐ |
| Human-correlated eval | For 100 outputs, get 3 human ratings and 3 LLM-as-judge ratings; measure agreement | ⭐⭐⭐ |
Key Insight
There is no single number that captures "VLM quality." MMMU measures different things from POPE which measures different things from MathVista. The right move when launching a new VLM is to publish a suite — and to explicitly report the benchmarks where your model is worse than the prior state of the art. The field rewards honesty; reviewers see through cherry-picking.
Resources
lmms-eval— modern multimodal eval harnessVLMEvalKit— alternative, broad coverage- MMMU paper
- POPE paper — hallucination eval
- Are Emergent Abilities a Mirage? (Schaeffer et al., 2023) — required reading on eval skepticism
Phase 10: Frontier Topics
Where the field is going. Pick one or two threads and go deep. Several of these threads live on a border with another guide — the cross-links below tell you who owns the rest of the story.
Vision-Language-Action (VLA) Models for Robotics
A VLA takes (image, instruction) → action. From the multimodal angle, a VLA is "a VLM whose output head emits action tokens instead of text" — the foundational works are RT-2, OpenVLA, π0, Gemini Robotics. The robot policy, imitation-learning, and sim-to-real side is owned by Robotics Phase 8 (which reads better after this guide's Phase 5); here we care only about the multimodal-modeling interface.
Long-Context Multimodal
A 1-hour video has tens of thousands of frames. How do you fit that into a context window? Streaming attention, hierarchical encoding, learned memory, image-token compression. (The serving of long multimodal contexts — KV-cache sizing, prefix caching — is Inference Systems.)
Generative Unified Models
Models that produce images, audio, and video natively from one transformer: Chameleon's image-generation half, Show-o, Janus-Pro, Emu3, Transfusion. The cross-modal modeling is this guide's Phase 7; the underlying image/video generation machinery they decode through is Image Generation / Video Generation.
Multimodal Reasoning
Multimodal chain-of-thought, visual program synthesis (ViperGPT, VisProg), self-consistency over visual problems, and the new wave of multimodal thinking models (QVQ, Gemini/o-series with vision). The text-reasoning core is LLM Phase 6; what's new here is reasoning grounded in pixels.
Embodied and Agentic Multimodal
Web/GUI agents that see screenshots (SeeClick, ShowUI, OS-Atlas, UI-TARS), computer use (Claude Computer Use, Gemini's GUI mode), mobile agents. The screen is just another modality. (Tool-use and agent orchestration is LLM Phase 7; the embodied-control side is Robotics.)
Multimodal Interpretability
What does the projector actually do? Where do image features live inside the LLM? Mechanistic interpretability for VLMs is largely virgin territory (text-side interpretability is LLM Phase 10).
Efficient Multimodal Inference
Token merging, pruning, image-token compression, KV-cache sharing across modalities — the model-side techniques. The serving-stack side (batching, paged KV cache, multi-LoRA, throughput) is owned by Inference Systems; kernel-level work and quantization are AI Hardware.
Safety in Multimodal Models
Visual jailbreaks (typographic attacks, adversarial images), CSAM detection, deepfake detection, watermarking for generative outputs (SynthID). New attack surfaces and new defenses.
Resources for the Frontier
- Awesome-Multimodal-Large-Language-Models — actively maintained survey
- OpenVLA paper
- π0 paper (Physical Intelligence)
- Show-o paper
- UI-TARS paper (ByteDance, 2025) — native GUI agent
- Anthropic — Computer Use blog
Suggested Timeline
| Phase | Duration | Outcome |
|---|---|---|
| 0. Prerequisites | 0–2 weeks | Transformers + PyTorch fluent; ViT and BPE understood |
| 1. Foundations | 1 week | Can map any multimodal paper to a taxonomy |
| 2. Encoders | 1–2 weeks | Implemented ViT; comfortable with mel spectrograms |
| 3. Contrastive | 2 weeks | Trained a tiny CLIP; zero-shot ImageNet working |
| 4. Fusion | 1–2 weeks | Implemented at least three fusion patterns |
| 5. VLMs | 3–4 weeks | Built and instruction-tuned a small VLM end to end |
| 6. Audio + video | 2 weeks | Trained a speech LM or video classifier |
| 7. Unified models | 2–4 weeks | Built a tiny Chameleon over interleaved tokens |
| 8. Scale + data | 2 weeks | Ran a real data filtering pipeline; understand recaptioning |
| 9. Evaluation | 1 week | Ran a real eval harness; built one hallucination probe |
| 10. Frontier | Ongoing | Picked one thread (VLA, unified, agents, ...) and going deep |
Total to research-comfortable: ~4 months of focused study. Longer if combined with serious projects (recommended).
Key Advice
- Start with CLIP, end with CLIP. Every multimodal model traces back to or generalizes CLIP. If you understand contrastive learning deeply, the rest comes faster.
- Build the smallest end-to-end VLM you can, early. A 100M-parameter VLM on COCO is one weekend's work. The lessons from doing it transfer directly to everything bigger.
- Data is the model. Spend at least as much time on data filtering and recaptioning as on architecture. The papers don't emphasize this enough.
- Beware the modality gap. Even well-aligned dual encoders keep text and image embeddings in separable regions. This affects retrieval, generation, and downstream fine-tuning.
- Don't trust a single benchmark. Especially captioning metrics. Always evaluate on a suite, including LLM-as-judge for open-ended outputs.
- Image tokens are expensive. A single image is often 256–2000 tokens. Multi-image and video contexts blow up fast. Know your token budget — and see Inference Systems when you deploy.
- Reuse pretrained components. Almost no one trains a vision encoder from scratch in 2026; you start from SigLIP, DINOv2, or similar, and from a pretrained LLM from the LLM guide.
- Synthetic captions are a superpower. Recaptioning with a strong VLM is the highest-leverage data trick in the field.
bf16everywhere. The same advice as for LLMs and PyTorch; multimodal training is no different.- Visualize your model's attention. Especially the cross-attention in VLMs. It tells you whether the model is actually looking at the right region.
Common Pitfalls to Avoid
- ❌ Comparing CLIP variants on retrieval and concluding one "is better" — depends entirely on the dataset
- ❌ Using captioning metrics (BLEU, CIDEr) as your primary VLM evaluation
- ❌ Training a VLM on COCO and being surprised it can't OCR a screenshot
- ❌ Ignoring the temperature parameter in contrastive learning
- ❌ Loading a VLM at fp32 and wondering why inference is slow
- ❌ Forgetting that web data contains CSAM; not filtering for it
- ❌ Trusting a single MMMU score without checking the rest of the suite
- ❌ Building a unified model with 1M tokens of text and 10k tokens of image and wondering why image quality is bad
- ❌ Skipping the alignment stage and going straight to instruction tuning
- ❌ Evaluating on a benchmark that's already in your pretraining corpus
- ❌ Re-deriving diffusion/VQ-VAE internals here instead of reading Image Generation — this guide uses them, it doesn't re-teach them
Additional Resources
Books and Long-Form Reading
- Dive into Deep Learning — Multimodal chapter
- Foundations of Multimodal Machine Learning (CMU 11-777 course notes)
Talks and Lectures
- Lucas Beyer — Vision and Multimodal Models
- CMU 11-777 Multimodal Machine Learning
- Stanford CS231n — for the vision foundations
Key Papers, Chronologically
| Year | Paper | Contribution |
|---|---|---|
| 2020 | ViT | Transformers for vision |
| 2021 | CLIP | Web-scale contrastive |
| 2022 | Flamingo | Gated cross-attention into LLMs |
| 2023 | BLIP-2 | Q-Former, modular fusion |
| 2023 | LLaVA | Linear projector + instruction tuning |
| 2023 | SigLIP | Sigmoid contrastive, scales better |
| 2024 | Qwen2-VL | Dynamic / native resolution, strong open VLM |
| 2024 | Chameleon | Native any-to-any with discrete tokens |
| 2024 | OpenVLA | VLA for robotics, open-source |
| 2024 | Emu3 | Next-token prediction, all modalities |
| 2024 | Transfusion | AR text + diffusion image in one transformer |
| 2025 | Janus-Pro | Decoupled understanding/generation |
| 2025 | Qwen2.5-Omni | Open any-to-any omni model |
Tools You Should Know
transformers(Hugging Face) — VLMs, multimodal pipelinesopen_clip— reproducible CLIP traininglmms-eval/VLMEvalKit— evaluation harnessesvLLM/SGLang— multimodal inference serving (see Inference Systems)torchaudio— audio loading and transformsdecord/pyav— fast video frame loadingwebdataset— streaming multimodal data
Communities
- Hugging Face forums — most VLM Q&A
- EleutherAI Discord — open research
- r/MachineLearning, X/Twitter ML community
License
MIT License. See the LICENSE file for details.