Custom Vocab Extension
Teaching a model a new "alphabet" means adding tokens and growing its embedding table to match.
Key Insight
Adding tokens to a tokenizer requires resizing the model's embedding matrix so every new ID gets a vector. Those new rows start untrained, so the model must learn what they mean.
Why This Matters
Specialized notations — like the SMILES strings that describe molecules — often tokenize poorly out of the box. Extending the vocabulary cleanly, without breaking the model's existing English, is a practical fine-tuning skill.