GCG Jailbreak

Optimize a string of token gibberish until even a safety-tuned model agrees to break its rules.

Key Insight

This project reproduces a small GCG attack — a gradient-based search that finds an adversarial suffix to append to a prompt so that an aligned open model produces unsafe output — and traces what the optimization is actually doing token by token.

Why This Matters

GCG-style attacks demonstrate that safety training is brittle: a suffix found against one open model often transfers to other models, including closed APIs; understanding the optimization is the basis for any serious jailbreak defense or evaluation.

Key Insight​

Why This Matters​

Key Insight

Why This Matters