Memorization Probe

Give the right opening line and the model finishes the page word for word.

Key Insight

This project feeds an open model the first few sentences of documents from its known pretraining corpus and counts how often the model continues with the verbatim original text — direct evidence of memorization.

Why This Matters

LLMs reproduce a small but non-trivial fraction of their training data exactly, with consequences for copyright, privacy (leaked PII), and security; knowing how much your model memorizes is a precondition for any deployment that touches sensitive data and a check on whether your deduplication pipeline is doing its job.

Key Insight​

Why This Matters​

Key Insight

Why This Matters