Reinforcement Learning: From Beginner to Advanced

A comprehensive guide to reinforcement learning as a subject — not just as the toolbox people pull stable-baselines from. The goal is to take you from "I have heard of Q-learning" to "I can read a 2026 RL paper, understand why the authors made every design choice, and reproduce its core algorithm." The guide is biased toward what matters in practice now: the things that ship.

An honest framing. Classical RL — the part of the field taught in textbooks — is a beautiful theory built on Markov decision processes, dynamic programming, and stochastic approximation. Most of it does not work directly on the problems people actually care about in 2026. What does work is a handful of robust algorithms (PPO, SAC, GRPO, MuZero-style search, DPO) applied with a mountain of practical knowledge about reward shaping, exploration, normalization, and infrastructure. This guide teaches the theory because you cannot debug what you do not understand, and then teaches the practice that the theory does not cover.

Phase 0: Prerequisites
Phase 1: MDPs and the Bellman Equations
Phase 2: Tabular Methods — DP, Monte Carlo, and TD
Phase 3: Function Approximation and DQN
Phase 4: Policy Gradients — REINFORCE to PPO
Phase 5: Continuous Control — DDPG, TD3, SAC
Phase 6: Model-Based RL
Phase 7: Offline RL
Phase 8: Exploration
Phase 9: RL for Language Models — RLHF, DPO, GRPO, RLVR
Phase 10: Frontier Topics
Suggested Timeline
Key Advice
Common Pitfalls
Additional Resources
Glossary

Phase 0: Prerequisites

RL combines probability, optimization, and deep learning under a deceptively simple wrapper. You can get started without mastering everything below, but if more than a couple are unfamiliar, slow down before continuing.

Concepts to Know

Probability: random variables, expectation, conditional expectation, law of total expectation, basic Markov chains
Linear algebra: matrix multiplication, eigenvalues (for contraction arguments), vector norms
Calculus: gradients, chain rule, the log-derivative trick (∇ log p = ∇p / p) — you'll use this constantly
Optimization: SGD, Adam, learning rates, the difference between an objective you maximize vs. minimize
Deep learning: training loops, nn.Module, basic CNNs and transformers. If shaky, do the PyTorch Deep Dive first
Some programming maturity: you will debug stochastic, asynchronous code where the bug only appears every 50k steps

The One Equation Everything Comes Back To

              ┌─────────────────────────────────────────┐
              │  V(s) = E[ Σ γᵏ r_{t+k} | s_t = s ]    │
              └─────────────────────────────────────────┘

The value of being in state s is the expected sum of (discounted) future rewards
if you act according to your policy from s onward.

Everything in RL — every algorithm, every loss function, every trick —
is some clever way of estimating this expectation when you don't know
the dynamics, the reward function, or even what "state" means.

If that sentence is fuzzy now, it will be sharp by the end of Phase 2.

What You Need Installed

Python 3.10+, PyTorch, NumPy
Gymnasium (the maintained fork of OpenAI Gym) — the de facto environment API
Stable-Baselines3 — well-tested reference implementations; read its source
CleanRL — single-file implementations of every major algorithm; the best teaching resource you can pip install
MuJoCo (via gymnasium[mujoco]) — for continuous control benchmarks
A GPU — not strictly required for tabular work, but essential by Phase 4

Resources

Sutton & Barto, Reinforcement Learning: An Introduction (2nd ed.) — the bible, free online. Read it cover to cover. Yes, really.
David Silver's RL course (DeepMind/UCL, 2015) — still the best lecture series for the fundamentals
Spinning Up in Deep RL (OpenAI) — the cleanest practical on-ramp
CleanRL docs and code — every algorithm in one file

Phase 1: MDPs and the Bellman Equations

The single most important phase. Everything else is variations on the math you learn here.

Concepts to Learn

Markov Decision Process (MDP): the tuple (S, A, P, R, γ) — states, actions, transition probabilities, reward function, discount factor
The Markov property: the future depends only on the present state, not the past. When this is violated (partial observability), you have a POMDP and life becomes harder
Policy π(a | s): a distribution over actions given a state. Deterministic policies are a special case
Return G_t = Σ γᵏ r_{t+k}: the discounted sum of future rewards from step t
Value functions:
- State value V^π(s) = E_π[G_t | s_t = s]
- Action value Q^π(s, a) = E_π[G_t | s_t = s, a_t = a]
- Advantage A^π(s, a) = Q^π(s, a) − V^π(s) — "how much better than average is this action?"
The Bellman equations for V^π and Q^π — recursive consistency conditions
The Bellman optimality equations for V* and Q* — the same thing but for the best policy
Discount factor γ: why γ < 1 makes the math tractable (geometric series convergence) and what it means in practice (effective horizon ≈ 1/(1−γ))
Episodic vs continuing tasks, finite vs infinite horizon

The Two Bellman Equations You Must Know Cold

For a fixed policy π:
   V^π(s) = E_a~π(·|s) [ R(s, a) + γ · E_s'~P(·|s,a) [ V^π(s') ] ]

For the optimal policy:
   V*(s) = max_a [ R(s, a) + γ · E_s'~P(·|s,a) [ V*(s') ] ]

In words:
   "The value of a state is the expected reward right now plus the
    discounted value of where you end up."

The max in the optimal version is what makes RL non-trivial.
Without it (policy fixed), value estimation is a linear system.
With it, you're solving a fixed-point problem on a nonlinear operator.

The Geometric Intuition

The Bellman operator is a contraction in the supremum norm: applying it repeatedly to any starting V converges to V* at rate γ. This single fact justifies every iterative algorithm in RL. When your training "isn't converging," it's almost always because something has broken contraction — function approximation, off-policy data, or both.

Projects

Project	Description	Difficulty
Build a gridworld	5×5 grid with rewards and obstacles; expose `(S, A, P, R, γ)` explicitly	⭐
Policy evaluation by matrix inverse	For a small MDP, solve `V^π = (I − γP^π)⁻¹ r^π` directly; verify against iterative evaluation	⭐⭐
Hand-trace Bellman backups	On a 3-state MDP, do 10 Bellman backups by hand and plot `V` over iterations	⭐⭐
Discount factor study	Same task, sweep `γ ∈ {0.5, 0.9, 0.99, 0.999}`; observe how the optimal policy changes	⭐⭐
POMDP exercise	Build a gridworld where the agent only sees its row, not its column; verify the optimal Markov policy is suboptimal	⭐⭐⭐

Sample Code: Policy Evaluation on a Small MDP

import numpy as np

# 3-state MDP, 2 actions
n_states, n_actions = 3, 2
P = np.random.dirichlet(np.ones(n_states), size=(n_states, n_actions))  # (S, A, S')
R = np.random.randn(n_states, n_actions)                                # (S, A)
gamma = 0.9

# A uniform random policy
pi = np.full((n_states, n_actions), 1.0 / n_actions)                    # (S, A)

# Iterative policy evaluation
V = np.zeros(n_states)
for _ in range(1000):
    V_new = np.zeros(n_states)
    for s in range(n_states):
        for a in range(n_actions):
            V_new[s] += pi[s, a] * (R[s, a] + gamma * P[s, a] @ V)
    if np.max(np.abs(V_new - V)) < 1e-9:
        break
    V = V_new
print("V^pi =", V)

# Verify via the closed-form solution V = (I - gamma P^pi)^-1 r^pi:
P_pi = np.einsum("sa,sat->st", pi, P)
r_pi = np.einsum("sa,sa->s",   pi, R)
V_closed = np.linalg.solve(np.eye(n_states) - gamma * P_pi, r_pi)
np.testing.assert_allclose(V, V_closed, atol=1e-6)

Key Insight

The Bellman equation is just a consistency check: "what I think about now must agree with what I think about next, plus the reward I get in between." Every RL algorithm is some way of enforcing this consistency when you can't compute the right-hand side exactly. Q-learning, DQN, TD(λ), advantage actor-critic — they differ in which version of the Bellman equation they're trying to satisfy and which approximations they make on the way.

Resources

Sutton & Barto, Chapters 3–4 — MDPs and DP
David Silver Lectures 2–3 — same material, video form
Csaba Szepesvári, Algorithms for Reinforcement Learning (free PDF) — for the theory-minded

Phase 2: Tabular Methods — DP, Monte Carlo, and TD

Tabular methods assume you can keep one number per state (or state-action pair) in a table. This is a toy assumption — but every modern algorithm is the tabular version with neural-network-shaped scaffolding. Master the tabular version and the neural version becomes a footnote.

Concepts to Learn

Dynamic programming (requires knowing P and R):
- Policy iteration — alternate policy evaluation and policy improvement
- Value iteration — Bellman optimality backup until convergence; extract the greedy policy at the end
- Generalized policy iteration (GPI) — the unifying picture
Monte Carlo (MC) methods (don't need P or R, only sampled returns):
- First-visit vs every-visit MC
- On-policy MC control with ε-greedy exploration
Temporal-difference (TD) learning — the central idea of RL:
- TD(0): V(s) ← V(s) + α(r + γ V(s') − V(s))
- SARSA (on-policy): Q(s,a) ← Q(s,a) + α(r + γ Q(s',a') − Q(s,a))
- Q-learning (off-policy): Q(s,a) ← Q(s,a) + α(r + γ max_{a'} Q(s',a') − Q(s,a))
Bias vs variance: MC is unbiased high-variance, TD is biased low-variance
n-step methods and TD(λ) — eligibility traces, the unifying parameter
Exploration vs exploitation: ε-greedy, Boltzmann/softmax, UCB
The deadly triad: function approximation + bootstrapping + off-policy training — the combination that breaks convergence guarantees

The Algorithm Family Tree (Tabular)

                     Know dynamics P, R?
                          │
                ┌─────────┴────────┐
               yes                 no
                │                   │
            DP methods       Sample-based methods
            (Phase 2a)          /         \
                              MC            TD
                          (full return)  (bootstrap)
                                          /    \
                                       SARSA   Q-learning
                                      (on-pol) (off-pol)
                                          \    /
                                       n-step / TD(λ)
                                        unifies both

Why TD is the Idea

MC update target:    G_t = r_t + γ r_{t+1} + γ² r_{t+2} + ...     (a whole trajectory)
TD(0) update target: r_t + γ V(s_{t+1})                            (one step + bootstrap)

MC is unbiased (the target IS the true return),
but high-variance (depends on every future step).

TD is biased (V(s_{t+1}) is an estimate, not the truth),
but low-variance (only one transition's randomness).

For most problems, TD's low variance dominates MC's bias.
This is why every modern algorithm uses bootstrapped targets.

Projects

Project	Description	Difficulty
Value iteration on FrozenLake	Solve the Gym/Gymnasium FrozenLake-v1 with value iteration; visualize `V*` and the greedy policy	⭐⭐
Policy iteration vs value iteration	Same problem, both algorithms; count iterations to convergence	⭐⭐
First-visit MC	On Blackjack-v1, learn `V^π` for a fixed policy by MC; verify against an analytic solution if possible	⭐⭐
Q-learning on FrozenLake	Tabular Q-learning, ε-greedy, decaying ε; report final policy success rate	⭐⭐
SARSA vs Q-learning on Cliff Walking	Reproduce Sutton & Barto Fig 6.5; explain why SARSA prefers the safe path	⭐⭐⭐
Eligibility traces	Implement TD(λ) with replacing traces; sweep `λ ∈ {0, 0.5, 0.9, 1.0}`	⭐⭐⭐

Sample Code: Tabular Q-Learning

import numpy as np
import gymnasium as gym

env = gym.make("FrozenLake-v1", is_slippery=True)
Q = np.zeros((env.observation_space.n, env.action_space.n))
alpha, gamma = 0.1, 0.99

for ep in range(20000):
    s, _ = env.reset()
    eps = max(0.01, 1.0 - ep / 10000)             # decaying exploration
    done = False
    while not done:
        a = env.action_space.sample() if np.random.rand() < eps else int(Q[s].argmax())
        s2, r, term, trunc, _ = env.step(a)
        done = term or trunc
        target = r + gamma * Q[s2].max() * (not term)
        Q[s, a] += alpha * (target - Q[s, a])     # the Bellman update, one row at a time
        s = s2

print("greedy policy =", Q.argmax(axis=1))

Key Insight

The single sentence: TD learning is gradient descent on the Bellman error, with the gradient through the target term stopped. That stopped gradient — the fact that V(s') in the target is treated as a constant rather than as a function of the same parameters — is what makes TD work and what makes it fragile under function approximation. Phase 3 is mostly about managing the consequences.

Resources

Sutton & Barto, Chapters 4–7 — DP, MC, TD, n-step
David Silver Lectures 3–5
Spinning Up — Key Equations

Phase 3: Function Approximation and DQN

Tabular methods need one entry per state. Real problems have continuous or astronomically large state spaces — Atari has 256^(84×84) possible screens. The fix is function approximation: replace the table with a neural net. This is where everything gets harder.

Concepts to Learn

Linear function approximation — features × weights; convergence guarantees mostly survive
Nonlinear (neural) function approximation — convergence guarantees mostly evaporate; empirical care required
The deadly triad in detail — and the tricks that tame it
DQN (Deep Q-Network) — the canonical recipe that made deep RL work on Atari:
- Experience replay buffer — break correlation between consecutive samples; reuse data
- Target network — a periodically-updated copy of the Q-network used in the bootstrap target; prevents instability
- ε-greedy with annealed ε, frame stacking, reward clipping
DQN family improvements (all worth knowing, all small wins individually, large in aggregate):
- Double DQN — decouple action selection and evaluation; reduces overestimation
- Dueling DQN — factor Q(s, a) = V(s) + A(s, a); better value estimation
- Prioritized Experience Replay (PER) — sample transitions by TD-error magnitude
- n-step returns — r_t + γ r_{t+1} + ... + γⁿ max_a Q(s_{t+n}, a)
- Noisy nets — parametric exploration noise
- Distributional RL (C51, QR-DQN, IQN) — predict the distribution of returns, not just the mean
- Rainbow — all of the above combined
When DQN-family algorithms are the right tool: discrete actions, off-policy data reuse, sample-efficiency matters

The DQN Loss, Annotated

loss = ((Q(s, a) - target)**2).mean()

           where target = r + γ · max_{a'} Q_target(s', a') · (not done)
                                 │              │
                                 │              └─ frozen target network
                                 └─ greedy action selection
                                    (Double DQN: use online net here, target net to evaluate)

  Crucial implementation details:
   - target = target.detach()           ← no gradients through the target
   - sample (s, a, r, s', d) from a big replay buffer
   - update Q_target ← Q every N steps (or use Polyak averaging)
   - clip rewards to [-1, 1] for Atari
   - Huber loss instead of MSE for robustness to outliers

Projects

Project	Description	Difficulty
DQN on CartPole	Single-file DQN that solves CartPole-v1 in <30k steps; no replay buffer tricks	⭐⭐⭐
Add a replay buffer	Now add experience replay and a target network; verify stability	⭐⭐⭐
Atari Pong	Full DQN with frame stacking and reward clipping; solve Pong	⭐⭐⭐⭐
Double + Dueling	Add both to your DQN; ablate each on Pong or Breakout	⭐⭐⭐⭐
Prioritized replay	Implement PER with a sum-tree; verify the priorities improve sample efficiency	⭐⭐⭐⭐
Mini Rainbow	Combine Double + Dueling + PER + n-step; reproduce a ~Rainbow-lite ablation	⭐⭐⭐⭐⭐
Distributional DQN (C51)	Predict a categorical distribution over returns; verify on a small env	⭐⭐⭐⭐⭐

Sample Code: A Minimal Working DQN

import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import deque
import random

class QNet(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, 128),     nn.ReLU(),
            nn.Linear(128, n_actions),
        )
    def forward(self, x):
        return self.net(x)

q       = QNet(4, 2)
q_targ  = QNet(4, 2); q_targ.load_state_dict(q.state_dict())
optim   = torch.optim.Adam(q.parameters(), lr=2.5e-4)
buf     = deque(maxlen=50_000)
gamma   = 0.99

def update():
    batch = random.sample(buf, 64)
    s, a, r, s2, d = map(torch.as_tensor, zip(*batch))
    with torch.no_grad():
        # Standard DQN target. For Double DQN:
        #   a_star = q(s2).argmax(-1); target_q = q_targ(s2).gather(-1, a_star[:, None])
        target = r + gamma * q_targ(s2.float()).max(-1).values * (1 - d.float())
    pred = q(s.float()).gather(-1, a[:, None].long()).squeeze(-1)
    loss = F.smooth_l1_loss(pred, target)              # Huber
    optim.zero_grad(); loss.backward(); optim.step()

Key Insight

DQN's three innovations — replay buffer, target network, frame stacking + clipping — were each a fix for a specific instability the underlying TD update has under nonlinear function approximation. The deadly triad doesn't go away; it gets managed. Every "modern" RL algorithm makes a slightly different bet about how to manage it. SAC bets on entropy regularization. PPO bets on small policy steps. Offline RL bets on staying near the data. None of them are mathematically clean. All of them work empirically because the bets are well-chosen.

Resources

Mnih et al. — Human-level control through deep reinforcement learning (Nature, 2015) — the original DQN paper
Rainbow paper (Hessel et al., 2017) — the survey of improvements
CleanRL's dqn_atari.py — single-file reference
Spinning Up — DQN
The 37 Implementation Details of PPO — same energy as DQN's implementation details; required reading

Phase 4: Policy Gradients — REINFORCE to PPO

Value-based methods estimate Q and act greedily. Policy-based methods skip the value function and directly parameterize π_θ(a | s). Policy gradients are the foundation of every modern algorithm that ships in production — PPO, GRPO, the entire RLHF stack.

Concepts to Learn

The policy gradient theorem: ∇_θ J(θ) = E_π[ ∇_θ log π_θ(a | s) · Q^π(s, a) ]
The log-derivative (REINFORCE) trick — why the policy gradient is a pure expectation; why this matters
REINFORCE — vanilla Monte-Carlo policy gradient with returns G_t as the weight
Baselines — subtract a state-dependent baseline b(s) (typically V(s)) from the return; reduces variance without bias
Advantage actor-critic — use a learned V as baseline; the weight becomes A_t = G_t − V(s_t) or its bootstrapped version
Generalized Advantage Estimation (GAE) — interpolates between high-bias-low-variance TD and low-bias-high-variance MC via the λ parameter
A2C / A3C — synchronous and asynchronous advantage actor-critic
Trust regions:
- TRPO — constrain the KL divergence between old and new policy; works but is awkward to implement
- PPO — clip the importance ratio instead; same spirit, far simpler. The default on-policy algorithm in 2026
On-policy vs off-policy: PPO is on-policy, so every gradient step must use fresh data. This is why PPO needs vast amounts of data and parallel envs
Entropy regularization — add β · H(π) to the objective to keep the policy from collapsing to a single action prematurely

The Five Lines That Are PPO

ratio = (new_logp - old_logp).exp()                  # π_new / π_old
clip_ratio = ratio.clamp(1 - eps, 1 + eps)
policy_loss = -torch.min(ratio * adv, clip_ratio * adv).mean()
value_loss  = (V_pred - returns).pow(2).mean()
loss = policy_loss + c1 * value_loss - c2 * entropy

That's it. Every other line in a PPO implementation is data wrangling, normalization, or logging.

Generalized Advantage Estimation, Visualized

TD(0) advantage:    A_t = r_t + γ V(s_{t+1}) - V(s_t)       (1-step; low var, high bias)
MC  advantage:      A_t = G_t - V(s_t)                       (full return; high var, low bias)

GAE(γ, λ):
    δ_t = r_t + γ V(s_{t+1}) - V(s_t)                        (the 1-step TD residual)
    A_t = δ_t + (γλ) δ_{t+1} + (γλ)² δ_{t+2} + ...

    λ = 0  → TD(0) advantage
    λ = 1  → MC advantage (minus value baseline)
    λ ≈ 0.95–0.97 → the sweet spot used in nearly every paper

Projects

Project	Description	Difficulty
REINFORCE on CartPole	Vanilla policy gradient, no baseline; observe the variance	⭐⭐
Add a value baseline	Same task, subtract a learned `V(s)`; verify variance drops	⭐⭐⭐
A2C with parallel envs	8 parallel envs, n-step returns, GAE; solve LunarLander	⭐⭐⭐⭐
PPO from scratch	Reproduce CleanRL's `ppo.py` line by line; explain each detail	⭐⭐⭐⭐
The 37 details	Implement (or audit) every one of the 37 PPO implementation details; measure each ablation	⭐⭐⭐⭐⭐
PPO on Atari	Apply your PPO to a few Atari games; compare against published numbers	⭐⭐⭐⭐⭐
TRPO for comparison	Implement TRPO; compare on a Mujoco task; see why nobody uses it anymore	⭐⭐⭐⭐⭐

Sample Code: The PPO Update Loop (Sketch)

# After collecting `rollout_len` steps from `n_envs` parallel environments...
# obs, actions, log_probs_old, advantages, returns are all shape (rollout_len * n_envs, ...)

for epoch in range(n_epochs):
    for batch in minibatches(data, size=64):
        new_logp, entropy, V_pred = policy.evaluate(batch.obs, batch.actions)
        ratio = (new_logp - batch.log_probs_old).exp()

        adv = batch.advantages
        adv = (adv - adv.mean()) / (adv.std() + 1e-8)            # the normalization that always matters

        pg_loss1 = -adv * ratio
        pg_loss2 = -adv * ratio.clamp(1 - eps, 1 + eps)
        pg_loss  = torch.max(pg_loss1, pg_loss2).mean()

        v_loss   = (V_pred - batch.returns).pow(2).mean()
        ent_loss = -entropy.mean()

        loss = pg_loss + 0.5 * v_loss + 0.01 * ent_loss
        optim.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(policy.parameters(), 0.5)        # the grad clip that always matters
        optim.step()

Key Insight

PPO is famous less for its math than for its robustness to implementation mistakes. It survives wrong learning rates, missing normalizations, and bad reward shaping in a way that TRPO, vanilla policy gradients, and most off-policy algorithms do not. This is why it's the default starting point for almost every new RL project — not because it's the best (it usually isn't), but because it works first try while you're still wrong about everything else.

Resources

Sutton & Barto, Chapter 13 — policy gradient methods
Schulman et al. — PPO paper (2017)
Schulman et al. — GAE paper (2015)
The 37 Implementation Details of PPO — the most useful single document in applied RL
CleanRL's ppo.py — one file, fully annotated

Phase 5: Continuous Control — DDPG, TD3, SAC

When actions are continuous (joint torques, steering angles, end-effector deltas), max-over-actions in Q-learning becomes impossible. The fix is a family of deterministic policy gradient and soft actor-critic methods that learn a continuous-action policy alongside a critic.

Concepts to Learn

Why discrete-action methods don't transfer directly — max_a Q(s, a) over continuous a is an optimization, not a lookup
The deterministic policy gradient theorem: ∇_θ J = E_s[ ∇_a Q(s, a)|_{a=μ_θ(s)} · ∇_θ μ_θ(s) ] — chain rule through the actor
DDPG (Deep Deterministic Policy Gradient):
- Actor μ_θ(s) and critic Q_φ(s, a)
- Off-policy, replay buffer, target networks (one for actor, one for critic)
- Exploration via additive noise on the action (Ornstein-Uhlenbeck or Gaussian)
TD3 (Twin Delayed DDPG) — DDPG with three fixes:
- Twin critics: take the min of two Q-networks → reduces overestimation
- Delayed actor updates: critic moves faster than actor → stability
- Target policy smoothing: add noise to the target action → regularization
SAC (Soft Actor-Critic) — the modern default for continuous control:
- Maximum entropy RL: maximize E[Σ r_t + α H(π(·|s_t))]
- Stochastic policy with learned mean and std; reparameterization trick for sampling
- Automatic temperature tuning (α) to hit a target entropy
When to use which: SAC is the default. TD3 is competitive on deterministic tasks. DDPG is a pedagogical stepping stone — don't ship it
Action saturation, tanh squashing, and the log-prob correction — small implementation details that matter a lot

The SAC Mental Model

Standard RL objective:                  J = E[ Σ γᵗ r_t ]
Maximum entropy RL objective:           J = E[ Σ γᵗ ( r_t + α H(π(·|s_t)) ) ]

The entropy bonus encourages:
  - Exploration (the policy stays stochastic)
  - Robust policies (no commitment to a fragile near-deterministic action)
  - Better critic learning (the Q-function sees a richer action distribution)

The temperature α controls how much you weight entropy.
SAC tunes α automatically by gradient descent on the dual problem.

Projects

Project	Description	Difficulty
DDPG on Pendulum	Single-file DDPG; verify it learns; observe the instability	⭐⭐⭐
TD3 on HalfCheetah	Full TD3 implementation with twin critics; compare to DDPG	⭐⭐⭐⭐
SAC on a Mujoco suite	Implement SAC; run on HalfCheetah, Walker2d, Ant, Humanoid; report final returns	⭐⭐⭐⭐
Automatic temperature tuning	Add the auto-α update to SAC; verify it stabilizes entropy across tasks	⭐⭐⭐⭐
Reparameterization audit	Verify your `tanh`-squashed Gaussian's log-prob correction is right; off-by-one here silently breaks SAC	⭐⭐⭐
Sample efficiency study	Compare PPO vs SAC on the same Mujoco task in wall-clock time and samples used	⭐⭐⭐⭐

Sample Code: The SAC Critic Update

# Sampled (s, a, r, s2, d) batch
with torch.no_grad():
    a2, logp2 = actor.sample(s2)                      # reparameterized + log-prob
    q1_t = q1_targ(s2, a2)
    q2_t = q2_targ(s2, a2)
    q_t  = torch.min(q1_t, q2_t) - alpha * logp2      # entropy bonus baked into target
    target = r + gamma * (1 - d) * q_t

q1_loss = F.mse_loss(q1(s, a), target)
q2_loss = F.mse_loss(q2(s, a), target)
critic_loss = q1_loss + q2_loss
optim_critic.zero_grad(); critic_loss.backward(); optim_critic.step()

Key Insight

SAC's "entropy bonus" looks like a small modification, but it's the deepest practical difference between modern continuous-control RL and the 2014–2016 generation. Maximum-entropy policies are robust to environment perturbations, sample-efficient because the critic learns from a wide action distribution, and naturally explore without hand-tuned noise. Almost everything that worked on humanoid locomotion, dexterous manipulation, and learned legged control between 2018 and 2023 used SAC or one of its near-cousins.

Resources

Lillicrap et al. — DDPG (2015)
Fujimoto et al. — TD3 (2018)
Haarnoja et al. — SAC (2018) — paper 1 of 2
Haarnoja et al. — SAC v2 (2018) — auto-tuned temperature
Spinning Up — SAC
CleanRL's sac_continuous_action.py

Phase 6: Model-Based RL

Everything so far has been model-free: learn a policy or value function from samples, never explicitly modeling the environment. Model-based RL (MBRL) learns a model of the dynamics P(s' | s, a) and the reward R(s, a), then uses that model to plan, generate synthetic data, or both. The payoff is sample efficiency — orders of magnitude in some cases — and the cost is the new burden of model error.

Concepts to Learn

What "the model" can be:
- A learned forward dynamics network f_θ(s, a) → s', r
- An ensemble of forward dynamics networks (the standard way to estimate uncertainty)
- A latent dynamics model: encoder s → z, transition z, a → z', decoder z → o (image obs); the Dreamer family
- A generative video model doing the same job (world models — see Phase 9 of the Video Generation Guide)
Three ways to use a model:
- Dyna-style — generate fake transitions to augment a model-free learner's replay buffer (MBPO is the modern version)
- Planning — use the model directly at decision time (MPC, CEM, MPPI)
- Search + learning — combine learned value + learned policy + model-based search (MuZero, AlphaZero, EfficientZero)
Model error and pessimism — short rollouts vs long rollouts; how error compounds
MuZero — the algorithm that learned chess, Go, shogi, and Atari from scratch by combining MCTS with a learned model that doesn't even reconstruct observations
Dreamer (V1/V2/V3) — image-based world models that learn behaviors by imagining rollouts in latent space; V3 is shockingly general across hundreds of tasks with one hyperparameter setting
TD-MPC / TD-MPC2 — short-horizon learned-model planning combined with a learned value to bootstrap from the planning horizon
The model-based / model-free divide in 2026: model-based wins on sample efficiency and few-shot adaptation; model-free still wins on raw asymptotic performance in many settings. The line is blurring fast

The Three Uses of a Model

┌─────────────────────────────────────────────────────────────────────┐
│ 1. AUGMENT DATA (Dyna, MBPO)                                        │
│    Train a model. Generate fake (s, a, r, s') tuples. Train SAC/PPO │
│    on real + fake data. Roll the model only k=1–5 steps to limit    │
│    compounding error.                                               │
├─────────────────────────────────────────────────────────────────────┤
│ 2. PLAN AT DECISION TIME (MPC, CEM, MPPI)                           │
│    Don't learn a policy. At each step, sample many action sequences,│
│    roll them out through the model, pick the best one, execute the  │
│    first action, replan. Model is everything; policy is implicit.   │
├─────────────────────────────────────────────────────────────────────┤
│ 3. SEARCH + LEARN (MuZero, EfficientZero, TD-MPC2)                  │
│    Learn a model, value, and policy jointly. Use Monte Carlo Tree   │
│    Search (MuZero) or short-horizon shooting (TD-MPC2) at decision  │
│    time, bootstrapping from the learned value beyond the search     │
│    horizon. The current frontier of model-based RL.                 │
└─────────────────────────────────────────────────────────────────────┘

Projects

Project	Description	Difficulty
PETS / random shooting MPC	Learn a 1-step dynamics model on Pendulum; do random-shooting MPC; compare to SAC	⭐⭐⭐
CEM-MPC	Replace random shooting with Cross-Entropy Method action search	⭐⭐⭐⭐
Mini MBPO	Train SAC with short model rollouts mixed into the replay buffer; verify the sample-efficiency win	⭐⭐⭐⭐
Dreamer V3 reproduction	Port the official Dreamer V3 to a custom env; observe how few hyperparameters it needs	⭐⭐⭐⭐⭐
Mini MuZero	Implement MuZero on a small game (Tic-Tac-Toe or 4x4 Connect4); see the recurrence between policy, value, and dynamics heads	⭐⭐⭐⭐⭐
TD-MPC2 study	Read TD-MPC2 paper; reproduce its DMC suite results	⭐⭐⭐⭐⭐

Sample Code: Random-Shooting MPC

def mpc_action(model, s, horizon=20, n_samples=1024, action_dim=1, action_lo=-1, action_hi=1):
    """Sample many action sequences, roll them out through the model, take the best first action."""
    actions = torch.empty(n_samples, horizon, action_dim).uniform_(action_lo, action_hi)
    s_curr = s.expand(n_samples, -1).clone()
    returns = torch.zeros(n_samples)
    for t in range(horizon):
        s_next, r = model(s_curr, actions[:, t])
        returns += (0.99 ** t) * r
        s_curr = s_next
    best = returns.argmax()
    return actions[best, 0]                       # execute only the first action, replan next step

Key Insight

Model-based RL has been "one breakthrough away from taking over" for a decade. What changed in 2023–2025 is that the model side and the policy side started to share representations — Dreamer's latent dynamics, MuZero's abstract state, TD-MPC2's encoder — so the model only needs to be accurate where it matters for the policy. This is the same trick that made transformers work: don't try to model everything, model what you'll be queried on. Expect MBRL to be the default by 2027–2028, especially on data-scarce embodied tasks where every real sample is expensive.

Resources

Sutton & Barto, Chapter 8 — planning and learning
Schrittwieser et al. — MuZero (Nature, 2020)
Hafner et al. — DreamerV3 (2023)
Hansen et al. — TD-MPC2 (2024)
Janner et al. — MBPO (2019)

Phase 7: Offline RL

In offline RL (also called batch RL), you have a fixed dataset of past interactions and cannot collect more. This is the regime that matches most real applications: medical records, recommender system logs, robot-fleet data, historical trading data. The challenge is distribution shift: the learned policy will want to take actions that aren't in the dataset, and the Q-function will hallucinate huge values for those out-of-distribution actions.

Concepts to Learn

Why "just run Q-learning on the data" fails — the bootstrapped target uses max_a Q(s', a), which is maximized at unseen actions where Q is wildly wrong
The two families of fixes:
- Policy constraint: stay close to the behavior policy that generated the data (BCQ, BEAR, AWAC, BRAC)
- Value pessimism: explicitly penalize Q-values at out-of-distribution actions (CQL, IQL — the modern default)
CQL (Conservative Q-Learning) — adds a penalty to the standard Bellman loss that pushes down Q-values at all actions and pulls them up only at the actions in the data
IQL (Implicit Q-Learning) — never queries Q at out-of-distribution actions at all; uses expectile regression on V. Simpler and more robust than CQL
Decision Transformer / Trajectory Transformer — reframe offline RL as autoregressive sequence modeling: condition on a desired return-to-go, predict the next action. Strong when the dataset is large and diverse
Behavior cloning baselines — sometimes BC is shockingly competitive with offline RL, especially when the dataset is high-quality. Always run it as a baseline
D4RL — the standard offline-RL benchmark suite
The relationship to RLHF: RLHF is a constrained-policy-improvement problem; offline-RL methods (especially DPO-as-offline-RL views) directly inform the RLHF stack

The Distribution-Shift Picture

Dataset D = {(s, a, r, s')} collected by some unknown behavior policy π_β

Naive Q-learning on D:
   target(s, a) = r + γ max_{a'} Q(s', a')   ← evaluated at the GREEDY action a',
                                                which may have NEVER been seen

   ↓
   Q estimates explode at unseen actions
   ↓
   Learned policy preferentially takes those exploded actions
   ↓
   Catastrophic failure on the real environment

Offline RL fix: either keep the policy close to π_β, or pessimistically
under-estimate Q for actions not in the data. Both work; pessimism is
the modern default.

Projects

Project	Description	Difficulty
BC baseline on D4RL	Behavior cloning on the `walker2d-medium-v2` task; report return	⭐⭐
Naive Q-learning on the same dataset	Verify the catastrophic failure	⭐⭐⭐
Implement CQL	Add the conservative penalty; reproduce CQL's D4RL numbers	⭐⭐⭐⭐
Implement IQL	Verify it works on the same tasks with fewer knobs	⭐⭐⭐⭐
Decision Transformer	Implement on D4RL; condition on return-to-go; compare to IQL	⭐⭐⭐⭐
Dataset-quality study	Same algorithm, three datasets (random, medium, expert); plot return-vs-data-quality	⭐⭐⭐

Sample Code: The Heart of IQL

# IQL's three losses:
# 1. Value function: expectile regression on the dataset Q-values
def value_loss(V, Q, s, a, tau=0.7):
    with torch.no_grad():
        q = torch.min(Q1(s, a), Q2(s, a))           # double critic
    v = V(s)
    diff = q - v
    # Asymmetric (expectile) loss: weight positive errors by tau, negative by (1-tau)
    weight = torch.where(diff > 0, tau, 1 - tau)
    return (weight * diff.pow(2)).mean()

# 2. Critic: standard TD with NO max over actions
def critic_loss(Q, V, s, a, r, s2, d):
    with torch.no_grad():
        target = r + gamma * V(s2) * (1 - d)
    return F.mse_loss(Q(s, a), target)

# 3. Actor: advantage-weighted regression
def actor_loss(pi, Q, V, s, a, beta=3.0):
    with torch.no_grad():
        adv = torch.min(Q1(s, a), Q2(s, a)) - V(s)
        weight = torch.clamp((beta * adv).exp(), max=100.0)
    return -(weight * pi.log_prob(a)).mean()

Key Insight

The offline-RL field oscillated between policy-constraint and value-pessimism approaches for several years before IQL clarified things: you don't actually need to query Q at out-of-distribution actions to make a good policy. By only using in-distribution Q-values to estimate V, then using advantage-weighted regression for the policy, IQL sidesteps the whole over-extrapolation problem. This insight — constrain what you query, not what you output — is increasingly visible in RLHF and reasoning-model training too.

Resources

Phase 8: Exploration

Reward-hungry agents in sparse-reward worlds spend almost all of their time wandering. Exploration is the part of RL where mathematical guarantees and practical methods diverge most sharply. Theory says use UCB or Thompson sampling; practice says... it's complicated.

Concepts to Learn

Exploration vs exploitation as a fundamental trade-off — and why ε-greedy is "good enough" in dense-reward environments and disastrous in sparse-reward ones
Optimism in the face of uncertainty — UCB, Bayesian posteriors over Q, bootstrapped DQN
Intrinsic motivation:
- Count-based exploration — bonus for novel states, scaled by 1/√N(s)
- Curiosity-driven exploration (ICM, RND) — bonus for high prediction error of a learned forward or random model
- Empowerment — maximize mutual information between actions and future state
- Disagreement — bonus for ensemble disagreement on the next state
Go-Explore — separate "go to a known state" from "explore from there"; spectacular results on Montezuma's Revenge
Maximum-entropy exploration — what SAC does implicitly
Adversarial / unsupervised exploration — DIAYN, APT, ProtoRL: learn skills with no reward at all
Hard exploration benchmarks: Montezuma's Revenge, Pitfall, NetHack, MineRL
Why this is unsolved: no algorithm robustly explores arbitrary sparse-reward worlds. Every method works on the benchmarks it was designed for and breaks on adversarial new ones

A Taxonomy

                         How is the bonus computed?
                                 │
          ┌──────────────────────┼─────────────────────────┐
          ↓                      ↓                         ↓
   Count-based            Prediction-error          Information-gain
   (visit counts,         (ICM, RND: train a        (Bayesian / ensemble
   pseudo-counts          predictor; bonus = its    disagreement; bonus
   from density           own error)                = uncertainty in V or P)
   models)
          ↓                      ↓                         ↓
   Works in tabular /      Works on Atari but      Theoretically clean,
   discrete state.         brittle; can latch onto  computationally heavy.
   Hard in continuous /    "noisy TV problem"
   pixel state.            (stochastic noise = high
                            error = high reward, no
                            real exploration value)

Projects

Project	Description	Difficulty
ε-greedy on a chain	A 10-state chain MDP with reward only at one end; observe how ε-greedy fails as chain length grows	⭐⭐
Count-based on a small env	Add a `1/√N(s)` bonus to Q-learning; verify exploration accelerates	⭐⭐⭐
RND on Atari	Implement Random Network Distillation; apply to Montezuma's Revenge; see the famous result	⭐⭐⭐⭐
ICM	Implement the Intrinsic Curiosity Module; compare to RND on the same task	⭐⭐⭐⭐
DIAYN	Train diverse skills with no extrinsic reward; visualize the skill space	⭐⭐⭐⭐⭐
Noisy-TV experiment	Construct an environment with a TV that shows random noise; verify that prediction-error methods get stuck staring at it	⭐⭐⭐

Sample Code: An RND Bonus

# Random Network Distillation: a fixed random target, a trainable predictor.
# Bonus = prediction error. Novel states have high error -> high bonus -> visited more often.
target_net    = MLP(obs_dim, 128).eval()                  # frozen, random init
predictor_net = MLP(obs_dim, 128)                          # trained on observed states
for p in target_net.parameters():
    p.requires_grad = False

def intrinsic_bonus(obs):
    with torch.no_grad():
        t = target_net(obs)
    p = predictor_net(obs)
    err = (t - p).pow(2).mean(dim=-1)
    return err                                             # add this to the extrinsic reward

# Separately, train the predictor on the same observations:
pred_loss = (target_net(obs).detach() - predictor_net(obs)).pow(2).mean()

Key Insight

Exploration is the part of RL where you most need to know the structure of your problem. General-purpose exploration is not solved, may not be solvable, and most papers that claim to solve it are evaluated only on the few benchmarks where their assumptions hold. Problem-specific exploration — using domain knowledge to design demonstrations, reset distributions, curricula, or shaped intrinsic rewards — almost always wins in practice. The pragmatic advice in 2026: bake exploration into your data collection (curated resets, demonstrations, curricula) rather than into your algorithm. The frontier may shift back the other way as foundation models give us better priors for what's worth exploring.

Resources

Phase 9: RL for Language Models — RLHF, DPO, GRPO, RLVR

This is where the modern field has its center of gravity. The RL ideas you learned in Phases 1–4 directly drive the post-training pipelines for every major frontier LLM. Most innovation in RL between 2023 and 2026 happened in this phase, not in classical control.

Concepts to Learn

The basic RLHF pipeline (post-2022)

Pretraining — a base language model on web data
Supervised fine-tuning (SFT) — fine-tune on human-written demonstrations
Reward model (RM) training — train a model to score completions, supervised by human pairwise preferences
RL fine-tuning — optimize the policy (the LLM) to maximize the RM's score, with a KL penalty back to the SFT model

The RLHF objective

J(π) = E_{x ~ data, y ~ π(·|x)} [ R_φ(x, y) ]  −  β · KL( π || π_SFT )

R_φ(x, y) :   reward model's score for completion y on prompt x
π          :   the language model being trained
π_SFT      :   the SFT model, treated as a fixed reference
β          :   how much to penalize drifting from the reference

The KL penalty is crucial. Without it, the policy reward-hacks the RM
(produces gibberish that the RM happens to score high) within hundreds of steps.
With it, the policy gets better at what humans actually want, slowly.

Algorithms

PPO for RLHF — the original recipe (InstructGPT, ChatGPT). Treat the prompt as the state, the completion as a sequence of actions, the RM score as terminal reward, and run PPO. Notoriously fiddly: value head, GAE on token-level returns, KL scheduling
DPO (Direct Preference Optimization) — closed-form derivation showing that the RLHF objective is equivalent to a supervised loss on preference pairs, when the optimal policy is parameterized correctly. No reward model, no rollouts, no PPO. The default in many open-source post-training pipelines
IPO, KTO, ORPO, SimPO, R-DPO — DPO-family variants tweaking the loss to fix specific failure modes (overfitting, length bias, asymmetric preferences)
GRPO (Group Relative Policy Optimization) — DeepSeek's PPO variant. For each prompt, sample a group of completions, compute relative advantages within the group (no value baseline needed), and apply PPO-style updates. Memory-efficient, simpler than PPO, the backbone of recent reasoning-model training
RLVR (RL with Verifiable Rewards) — when answers can be checked programmatically (math, code, formal proofs), skip the reward model entirely and use the verifier directly. This is the engine of the reasoning-model wave (o1, R1, etc.)
REINFORCE-style algorithms returning — RLOO and similar; once you have a working verifier, vanilla policy gradients with leave-one-out baselines often match or beat PPO

What changed in 2024–2026

Reasoning models: RLVR on math and code at scale produces models that learn to think before answering. The "RL on chain-of-thought" loop turns out to scale beautifully
Process reward models vs outcome reward models: scoring each reasoning step vs scoring only the final answer
Self-play and self-improvement loops: model generates problems, attempts solutions, verifies, learns from successes (STaR, V-STaR, RFT, R*)
Multi-turn RLHF: optimizing across full conversations, not just single completions

The DPO Derivation, in Words

Start with the RLHF objective:
   max_π  E[R(x, y)] − β KL(π || π_ref)

Solve for the optimal π analytically (Lagrangian):
   π*(y|x) ∝ π_ref(y|x) · exp(R(x, y) / β)

Rearrange to express R in terms of π* and π_ref:
   R(x, y) = β log( π*(y|x) / π_ref(y|x) ) + const(x)

Substitute this into the Bradley-Terry preference model
P(y_w > y_l | x) = σ(R(x, y_w) − R(x, y_l)).

You get a purely supervised loss on preference pairs:
   L_DPO = −log σ( β log(π(y_w|x)/π_ref(y_w|x)) − β log(π(y_l|x)/π_ref(y_l|x)) )

No reward model. No PPO. Just a clever loss on (prompt, chosen, rejected) triples.

Projects

Project	Description	Difficulty
SFT a small base model	Fine-tune Qwen-0.5B or similar on a small instruction dataset; observe baseline behavior	⭐⭐
Train a reward model	Pairwise classifier over SFT outputs; verify on held-out preferences	⭐⭐⭐
PPO-style RLHF	Mini-RLHF on a small model and small RM; track KL to reference; watch for reward hacking	⭐⭐⭐⭐⭐
DPO	Same dataset, DPO instead of PPO; compare quality, training time, stability	⭐⭐⭐⭐
GRPO from scratch	Implement GRPO for a small math task (GSM8K-style) with a verifiable reward	⭐⭐⭐⭐⭐
RLVR on math	Train a small reasoning loop on a verifiable math subset; observe the emergence of longer chain-of-thought	⭐⭐⭐⭐⭐
Length-bias audit	Plot completion-length distributions of your DPO/PPO models; verify the well-known drift	⭐⭐⭐
Reward hacking demo	Intentionally over-train against an RM; characterize the gibberish that emerges	⭐⭐⭐

Sample Code: The DPO Loss

import torch
import torch.nn.functional as F

def dpo_loss(policy_logits_w, policy_logits_l,
             ref_logits_w,    ref_logits_l,
             chosen_ids,      rejected_ids,
             beta=0.1):
    """
    policy_logits_*: logits from the model being trained, on chosen / rejected sequences
    ref_logits_*:    logits from the frozen reference model
    *_ids:           target token ids
    """
    logp_w_pi  = sequence_logp(policy_logits_w, chosen_ids)
    logp_l_pi  = sequence_logp(policy_logits_l, rejected_ids)
    logp_w_ref = sequence_logp(ref_logits_w,    chosen_ids).detach()
    logp_l_ref = sequence_logp(ref_logits_l,    rejected_ids).detach()

    pi_logratios  = logp_w_pi  - logp_l_pi
    ref_logratios = logp_w_ref - logp_l_ref

    logits = beta * (pi_logratios - ref_logratios)
    return -F.logsigmoid(logits).mean()

def sequence_logp(logits, ids):
    """Sum of log p(token_t | tokens_<t) along a sequence."""
    logp = F.log_softmax(logits[:, :-1], dim=-1)
    return logp.gather(-1, ids[:, 1:, None]).squeeze(-1).sum(-1)

Key Insight

The RLHF/DPO/GRPO/RLVR progression is the clearest example in recent ML of a research field collapsing complexity. Each step removed a moving part: DPO removed the explicit reward model and PPO. GRPO removed the value function. RLVR (when applicable) removed the reward model and replaced it with a deterministic verifier. The remaining components — preference pairs or verifiers, a reference model, a KL penalty — are mostly irreducible. When a verifier exists, RL on LLMs becomes embarrassingly straightforward. When it doesn't (everything involving subjective quality, style, helpfulness), you still need the messy preference pipeline. The grand bet of 2026 is: how many domains can be converted to verifiable form?

Resources

Christiano et al. — Deep RL from Human Preferences (2017) — the foundational paper
Ouyang et al. — InstructGPT (2022)
Rafailov et al. — DPO (2023)
Shao et al. — DeepSeekMath + GRPO (2024)
DeepSeek-R1 paper (2025) — the RLVR-for-reasoning playbook
Tülu 3 (Allen AI, 2024) — open-source RLHF/RLVR recipe
Hugging Face TRL library — the practical implementation
Lambert — Reinforcement Learning from Human Feedback — the book on this material

Phase 10: Frontier Topics

Where the field is going. Pick one or two threads and follow them; you cannot follow all of them.

Reasoning Models and the RLVR Wave

The most active research area in RL right now. Long chain-of-thought, RL with verifiable rewards (math, code, formal proofs), self-correction, search-augmented inference. The o1/R1/Claude-3.7-style reasoning trace is the visible part; the post-training recipes that produce it are the iceberg. Expect this thread to dominate 2026–2027.

Multi-Agent RL

Cooperative (StarCraft II, hide-and-seek), competitive (Go, poker), and mixed (negotiation, market simulation). Self-play, population-based training, league play. Non-stationarity from the other agents' learning is the central technical challenge.

Meta-RL and Few-Shot Adaptation

Learning to learn: RL² (an LSTM as the inner RL algorithm), MAML, PEARL. Increasingly subsumed by in-context learning in foundation-model policies.

Hierarchical RL

Options, sub-goals, feudal networks. The dream of compositional skills has been hard to realize cleanly, but VLAs and language-conditioned policies are achieving similar effects through different means (Phase 6 of the Robotics Guide).

World Models and Generative RL

Generative video models that are also simulators. Crosses directly into Video Generation Guide Phase 9. Genie 2, Sora-style for sim-replacement, Dreamer-class for control. The convergence of generative modeling and RL.

Inverse RL and Reward Learning

When the reward is unknown but the demonstrations exist: GAIL, AIRL, MaxEnt IRL. Connects to the preference modeling in RLHF and to safety (learning what humans value).

Constrained RL and Safe RL

Optimize reward subject to safety constraints. Critical for real-world deployment. CMDP formalism, Lagrangian methods, shielded RL. Adjacent to formal-verification approaches for embodied AI.

RL at Scale on Foundation Models

The infrastructure side: efficient PPO/GRPO on tens of thousands of GPUs, asynchronous generation/training, KV-cache management for rollout, vLLM/SGLang integrations. Most of the engineering work behind reasoning-model training lives here.

Open-Endedness and Curricula

Procedurally-generated environments (XLand, MineRL BASALT, Crafter), automatic curriculum learning, generative agents. The bet that "general intelligence comes from a diverse stream of tasks."

Theoretical RL

Sample complexity bounds, regret bounds, PAC-RL, distributional shifts. Currently disconnected from practice but slowly catching up. Worth watching if you like math.

Resources for the Frontier

OpenAI o1 system card and follow-ups
DeepSeek-R1, V3, V3.1
Tülu 3 (open-source frontier post-training recipe)
GPU Mode RL track
NeurIPS, ICML, ICLR — RL tracks at the top ML venues
RLHF book — Lambert

Suggested Timeline

Phase	Duration	Outcome
0. Prerequisites	0–2 weeks	Gymnasium, PyTorch, CleanRL installed; Sutton & Barto Ch. 1–2 read
1. MDPs and Bellman	1 week	Gridworld coded, value iteration converges, Bellman equations are reflex
2. Tabular methods	1–2 weeks	Q-learning, SARSA, TD(λ) implemented; understand the deadly triad
3. DQN and friends	2 weeks	DQN on Atari; Double + Dueling + PER ablation done
4. Policy gradients	2 weeks	PPO from scratch, all 37 details understood
5. Continuous control	2 weeks	SAC on Mujoco suite; understand why it dominates
6. Model-based	2–3 weeks	MPC done; mini Dreamer or mini MuZero attempted
7. Offline RL	1–2 weeks	IQL on D4RL; BC baseline run honestly
8. Exploration	1–2 weeks	RND on Montezuma's Revenge; opinions on what works and what doesn't
9. RL for LLMs	3–4 weeks	DPO done; GRPO understood; RLVR loop on a verifiable task working
10. Frontier	Ongoing	Picked one thread and going deep

Total to "comfortable practitioner": ~3–4 months of focused study, longer if combined with real projects (recommended). To "research-comfortable on one frontier thread": 6–12 months beyond that.

Key Advice

Read Sutton & Barto. Yes, the whole thing. It is the only book in ML where the gap between "skimmed it" and "actually read it" is night-and-day. Most "RL doesn't work" stories trace back to skimming.
Implement from scratch at least once. Tabular Q-learning, DQN, PPO, SAC, DPO. Not "use the library version"; from scratch. The exercise teaches you which knobs are real and which are accidents.
Read CleanRL before reading anything else. Each algorithm is in a single self-contained file. The standard reference implementation. Pin its commit when you reproduce a result.
Read the 37 PPO implementation details. Not optional. The gap between "PPO works" and "PPO works for me" is mostly that document.
Normalize. Everything. Observations, advantages, returns, rewards. PPO and SAC are weirdly sensitive to scale. The default in every working repo: running mean-and-variance normalization.
Seed and run multiple times. RL has enormous run-to-run variance. A single curve tells you almost nothing. Always plot 3–5 seeds with min/max bands.
Always run a BC / random / hand-coded baseline. Half the "RL works" claims in the literature are not robust to comparing against a strong non-RL baseline on the same task.
Define your reward signal before your algorithm. Reward shaping decisions dominate algorithm decisions. A poorly-shaped reward will defeat any algorithm; a well-shaped reward will be solvable by REINFORCE.
detach() the target. The single most common deep-RL bug: gradients flowing through the bootstrap target. Wrap every TD-target computation in with torch.no_grad(): or sprinkle .detach() aggressively.
Profile before scaling. RL training spends shockingly little time in the forward pass and shockingly much in environment stepping, data movement, and synchronization. Always profile.
In the LLM-RL era, ask: do I have a verifier? If yes, use RLVR. If no, decide between preference learning (DPO) and full RLHF (PPO/GRPO). The decision dominates everything that follows.
Avoid PPO's value head bug. A common implementation has the value head share the trunk with the policy and use a separate normalization. The interaction between value-loss clipping and observation normalization is responsible for many silent failures.

Common Pitfalls

❌ Forgetting detach() on the TD target → gradients flow through both sides of the Bellman equation, training diverges
❌ Off-by-one in the discounted return (G_t vs G_{t+1}) → subtle, takes days to find
❌ Wrong done handling at episode boundaries → V(s_{T+1}) should be zero, not predicted
❌ Using float16 for value functions → silent NaNs from the wide value range
❌ Not normalizing observations → wildly different scales blow up the policy gradient
❌ Tuning hyperparameters on a single seed → publishing results that don't replicate
❌ Forgetting the log-prob correction for tanh-squashed actions in SAC → silently broken policy
❌ Treating the SAC actor's log_prob as a categorical when it's Gaussian (or vice versa) → wrong loss, wrong everything
❌ Computing GAE across episode boundaries → corrupted advantages
❌ Using optimizer.zero_grad() once per epoch instead of per minibatch → gradients accumulate incorrectly
❌ Reading "PPO with default hyperparameters" and assuming they're the same across papers → they're not; always check
❌ Training RLHF without a strong reference model → reward hacking within hours
❌ Skipping the KL penalty in RLHF → catastrophic policy collapse
❌ Running RL with an environment that has non-Markov observations and assuming the policy will "figure it out" → it won't

Additional Resources

Books

Sutton & Barto — Reinforcement Learning: An Introduction (2nd ed.) — the bible
Szepesvári — Algorithms for Reinforcement Learning — concise, mathematical
Lambert — Reinforcement Learning from Human Feedback — the modern post-training book
Bertsekas — Reinforcement Learning and Optimal Control — for the control-theoretically inclined

Courses

David Silver — RL course (UCL/DeepMind, 2015) — the canonical lecture series
Sergey Levine — Deep RL (UC Berkeley, CS285) — the modern course
Emma Brunskill — RL (Stanford, CS234)
Pieter Abbeel — Foundations of Deep RL (YouTube) — 6-lecture series, excellent intro

Code You Should Read

CleanRL — single-file implementations of every major algorithm
Stable-Baselines3 — the production reference
TRL (Hugging Face) — RLHF/DPO/GRPO at scale
trlX — the older RLHF reference; still useful
DreamerV3 (official) — model-based reference
TD-MPC2

Environment Libraries

Gymnasium — the standard API
MuJoCo via Gymnasium — continuous control benchmarks
Atari via Gymnasium — the classic discrete-action benchmarks
DM Control — DeepMind's continuous control suite
Isaac Lab — massively parallel sim for locomotion and manipulation
PettingZoo — multi-agent environments
MineRL — Minecraft as a hard-exploration benchmark
NetHack Learning Environment

Communities

r/reinforcementlearning — the most active general forum
GPU Mode Discord — for the RL-at-scale engineering side
Eleuther AI Discord — for the RL-for-LLMs work
The RL Reading Group — weekly paper reading

Talks Worth Watching

Sergey Levine — "Underlying Assumptions of Deep RL"
John Schulman — "The Nuts and Bolts of Deep RL Research" — the practical advice document
Pieter Abbeel — many lectures

Quick Start Checklist

License

MIT License. See the LICENSE file for details.

Table of Contents​

Phase 0: Prerequisites​

Concepts to Know​

The One Equation Everything Comes Back To​

What You Need Installed​

Resources​

Phase 1: MDPs and the Bellman Equations​

Concepts to Learn​

The Two Bellman Equations You Must Know Cold​

The Geometric Intuition​

Projects​

Sample Code: Policy Evaluation on a Small MDP​

Key Insight​

Resources​

Phase 2: Tabular Methods — DP, Monte Carlo, and TD​

Concepts to Learn​

The Algorithm Family Tree (Tabular)​

Why TD is the Idea​

Projects​

Sample Code: Tabular Q-Learning​

Key Insight​

Resources​

Phase 3: Function Approximation and DQN​

Concepts to Learn​

The DQN Loss, Annotated​

Projects​

Sample Code: A Minimal Working DQN​

Key Insight​

Resources​

Phase 4: Policy Gradients — REINFORCE to PPO​

Concepts to Learn​

The Five Lines That Are PPO​

Generalized Advantage Estimation, Visualized​

Projects​

Sample Code: The PPO Update Loop (Sketch)​

Key Insight​

Resources​

Phase 5: Continuous Control — DDPG, TD3, SAC​

Concepts to Learn​

The SAC Mental Model​

Projects​

Sample Code: The SAC Critic Update​

Key Insight​

Resources​

Phase 6: Model-Based RL​

Concepts to Learn​

The Three Uses of a Model​

Projects​

Sample Code: Random-Shooting MPC​

Key Insight​

Resources​

Phase 7: Offline RL​

Concepts to Learn​

The Distribution-Shift Picture​

Projects​

Sample Code: The Heart of IQL​

Key Insight​

Resources​

Phase 8: Exploration​

Concepts to Learn​

A Taxonomy​

Projects​

Sample Code: An RND Bonus​

Key Insight​

Resources​

Phase 9: RL for Language Models — RLHF, DPO, GRPO, RLVR​

Concepts to Learn​

The basic RLHF pipeline (post-2022)​

The RLHF objective​

Algorithms​

What changed in 2024–2026​

The DPO Derivation, in Words​

Projects​

Sample Code: The DPO Loss​

Key Insight​

Resources​

Phase 10: Frontier Topics​

Reasoning Models and the RLVR Wave​

Multi-Agent RL​

Meta-RL and Few-Shot Adaptation​

Table of Contents

Phase 0: Prerequisites

Concepts to Know

The One Equation Everything Comes Back To

What You Need Installed

Resources

Phase 1: MDPs and the Bellman Equations

Concepts to Learn

The Two Bellman Equations You Must Know Cold

The Geometric Intuition

Projects

Sample Code: Policy Evaluation on a Small MDP

Key Insight

Resources

Phase 2: Tabular Methods — DP, Monte Carlo, and TD

Concepts to Learn

The Algorithm Family Tree (Tabular)

Why TD is the Idea

Projects

Sample Code: Tabular Q-Learning

Key Insight

Resources

Phase 3: Function Approximation and DQN

Concepts to Learn

The DQN Loss, Annotated

Projects

Sample Code: A Minimal Working DQN

Key Insight

Resources

Phase 4: Policy Gradients — REINFORCE to PPO

Concepts to Learn

The Five Lines That Are PPO

Generalized Advantage Estimation, Visualized

Projects

Sample Code: The PPO Update Loop (Sketch)

Key Insight

Resources

Phase 5: Continuous Control — DDPG, TD3, SAC

Concepts to Learn

The SAC Mental Model

Projects

Sample Code: The SAC Critic Update

Key Insight

Resources

Phase 6: Model-Based RL

Concepts to Learn

The Three Uses of a Model

Projects

Sample Code: Random-Shooting MPC

Key Insight

Resources

Phase 7: Offline RL

Concepts to Learn

The Distribution-Shift Picture

Projects

Sample Code: The Heart of IQL

Key Insight

Resources

Phase 8: Exploration

Concepts to Learn

A Taxonomy

Projects

Sample Code: An RND Bonus

Key Insight

Resources

Phase 9: RL for Language Models — RLHF, DPO, GRPO, RLVR

Concepts to Learn

The basic RLHF pipeline (post-2022)

The RLHF objective

Algorithms

What changed in 2024–2026

The DPO Derivation, in Words

Projects

Sample Code: The DPO Loss

Key Insight

Resources

Phase 10: Frontier Topics

Reasoning Models and the RLVR Wave

Multi-Agent RL

Meta-RL and Few-Shot Adaptation