Score Matching from Scratch
Key Insight
Instead of learning the data density p(x) directly — which requires an intractable normalizing constant — score matching learns its score, the gradient of log-density ∇_x log p(x) that tells you which way to nudge a point to make it more likely. The practical trick is denoising score matching: add a known amount of Gaussian noise to each sample and train the network to point back toward the clean point, which provably equals the score of the noised distribution — so you never need the true density. Once you have the score, you generate with Langevin dynamics: repeatedly step in the score direction plus a little random jitter, like a ball rolling into high-probability valleys while being shaken so it doesn't get stuck in one spot. This project builds the whole loop on 2D toy data (8-Gaussians, swiss roll) — small enough that you can plot the learned vector field and watch the sampled points line up with the true distribution.