Skip to main content

Memory-Mapped Tokens


Don't load the file into memory — let the operating system pretend it already is.


Key Insight

Memory mapping (via numpy.memmap) makes a file on disk look like an array in memory: you can read any slice of it without loading the whole file into RAM. After tokenizing a huge text corpus into one flat .bin file, training reads small chunks on demand.

Why This Matters

Language-model datasets are often far larger than RAM. Memory mapping lets you train on a corpus of any size while using almost no memory, because the operating system pulls in only the pieces you actually touch.