Skip to main content

Linear Probes for Facts


Open the black box one layer at a time and ask "where in here does the model know X?"


Key Insight

This project trains a linear probe — a single linear classifier — on the hidden activations at each transformer layer to recover whether a statement is factually true, mapping out which layers actually encode that knowledge.

Why This Matters

A working probe is a window into mechanistic interpretability: it shows that the model carries the factuality signal internally before it generates its answer, which is the starting point for explaining why and where a behavior happens inside the network.