Skip to main content

Inference Optimization

Key Insight

A trained VLM is only useful if it can serve answers fast, so this project takes an open VLM and runs it on a production engine like vLLM or sglang, then measures throughput (tokens per second) as the number of images per request grows. Image count is the knob that matters because every image expands into many image tokens that all live in the KV cache and must be attended to — so more images means a longer sequence, more memory, and lower throughput, the multimodal twist on the usual long-context squeeze. The full serving toolkit (continuous batching, paged attention, quantization) is owned by the Inference Systems guide; here the goal is just to feel how the image-token budget trades against speed.