Detokenizer Fuzzer
decode("a") + decode("b")does not always equaldecode("ab")— and a fuzzer will find the case that proves it.
Key Insight
This project generates random token sequences and compares two ways of turning them back into text: piece-by-piece detokenization (one token at a time, as a streaming server does) versus all-at-once detokenization (the entire sequence in one call). Because BPE merges bytes that may be invalid UTF-8 on their own, the two paths can disagree at multi-byte character boundaries — and the fuzzer hunts for examples where they do.
A Concrete Example
The emoji 🎉 is four bytes in UTF-8: F0 9F 8E 89. A BPE tokenizer might split those four bytes across two tokens — say token A decodes to F0 9F and token B decodes to 8E 89. Neither half is valid UTF-8 on its own.
- Piece-by-piece (streaming):
decode(A)seesF0 9F, can't form a character, and hands back the�replacement glyph;decode(B)does the same with8E 89. The user sees��. - All-at-once:
decode([A, B])concatenates the bytes first —F0 9F 8E 89— and only then interprets them, recovering the real 🎉.
Same two tokens, two different answers. That disagreement at the multi-byte boundary is exactly the bug the fuzzer is built to surface, and it explains why a streaming server needs more machinery than a single decode call.
Implementing all-at-once detokenization
All-at-once decoding is the simpler path to write, because you never have to deal with half a character:
- Take the full list of token IDs.
- For each token, look up the raw byte string it maps to and concatenate those bytes (not the decoded text) into one buffer.
- Decode the whole buffer to text in a single UTF-8 pass — e.g.
byte_buffer.decode("utf-8")in Python.
Because every multi-byte character's bytes are already sitting next to each other before decoding, valid UTF-8 always reassembles correctly. The streaming path is the hard one: it must hold back trailing bytes that might be the start of a multi-byte character and wait for the next token before emitting them — the same rolling-buffer idea the stop-string matcher relies on.
Why This Matters
Users notice when a streamed reply shows a � replacement glyph or a half-finished emoji, and the bug only appears on certain prompts. Production tokenizers ship a dedicated incremental decoder that buffers partial bytes until they form valid characters — finding a disagreement by hand is the cleanest way to understand why that extra machinery exists.