Custom Vocab Extension

Teaching a model a new "alphabet" means adding tokens and growing its embedding table to match.

Key Insight

Adding tokens to a tokenizer requires resizing the model's embedding matrix so every new ID gets a vector. Those new rows start untrained, so the model must learn what they mean.

Why This Matters

Specialized notations — like the SMILES strings that describe molecules — often tokenize poorly out of the box. Extending the vocabulary cleanly, without breaking the model's existing English, is a practical fine-tuning skill.

Key Insight​

Why This Matters​

Key Insight

Why This Matters