Grounding Head
Key Insight
Grounding means making a VLM point at where something is, not just say that it is there; this project adds it the simplest possible way — extend the model's vocabulary with special tokens like <box> plus a small set of tokens that stand for quantized coordinates, so a bounding box becomes just a few extra tokens the model emits inside its normal text stream. The elegance is that no new architecture or loss is needed: predicting "the cat is at <box> 0.1 0.2 0.4 0.6" is the very same next-token prediction the LLM already performs, so it learns spatial output for free with the objective it was built around. Coordinates are quantized into a fixed grid of bins (rather than predicting raw floats) precisely so each one collapses to a single discrete token the existing vocabulary can hold.
Data Preparation Pipeline
To teach a model this spatial mapping, the training data must be formatted to match standard instruction tuning structures:
- Normalization: Convert absolute pixel coordinates into relative coordinates between
0.0and1.0. - Quantization: Map these continuous floats into discrete bins (e.g., 1000 bins) so they align with the newly added coordinate tokens in the vocabulary.
- Example:
[x1: 0.15, y1: 0.25, x2: 0.45, y2: 0.65]becomes the discrete sequence<box> 0.15 0.25 0.45 0.65.
- Example:
- Formatting: Wrap the quantized coordinates in a standard conversational JSON structure so the model learns to emit them naturally in response to user prompts:
{"image": "cat_photo_001.jpg","conversations": [{ "role": "user", "content": "Where is the cat in this image?" },{ "role": "assistant", "content": "The cat is located at <box> 0.15 0.25 0.45 0.65." }]}
How Frontier Models Handle Grounding
Modern frontier models streamline this process by moving away from glued-together components (frozen encoder + projector) toward Native Multimodal architectures:
- Training (Native Multimodal): Images and text are processed into a shared embedding space from the start. The model is instruction-tuned to predict coordinate tokens unconditionally alongside normal text using standard next-token prediction, learning the pattern of relating visual features to text tokens without architectural hacks.
- Inference Pipeline:
- Prefill: The model ingests the user's text and the entire sequence of image tokens simultaneously, computing the visual context and storing it in the KV cache.
- Decode: The model autoregressively generates the response ("The cat is located at..."). When it is time to output spatial tokens, it cross-attends to the visual KV cache. Greedy decoding is typically used for the coordinate tokens to prevent positional hallucinations.