Skip to main content

Loss-Spike Forensics


Make the loss explode on purpose, then practice the recovery you will need at 3 a.m.


Key Insight

This project deliberately triggers a loss spike — with a too-large learning rate or a poisoned batch — then diagnoses it, fixes the cause, and resumes training from the last checkpoint. It is forensics applied to a training run.

Why This Matters

Real pretraining runs spike, and a frontier run that cannot recover cleanly wastes days of GPU time. Rehearsing the detect-rollback-skip-resume loop on a toy model builds the muscle memory that protects expensive runs.