Glossary
Terms from all guides in this repository, sorted alphabetically. Each guide's own concepts are included here; see individual guides for deeper context.
(2+1)D
A way to build a video network cheaply by factorizing a full 3D convolution (which mixes space and time at once) into two smaller steps: first a 2D spatial layer that processes each frame on its own, then a separate 1D temporal layer that mixes information across time at each pixel location. The name reads "2 plus 1": two spatial dimensions handled together, plus one time dimension handled separately. Splitting them this way costs far less compute than a true 3D layer and — crucially — lets you initialize the spatial half from a pretrained image model and add the temporal half fresh (see temporal inflation). The trade-off is that space and time never interact within a single layer, which can miss fast, complex motion that a full spatiotemporal layer would catch.
3D VAE
A VAE (variational autoencoder — a network that squeezes data into a small code and reconstructs it) built for video, so it compresses along time as well as the two spatial dimensions. A plain image VAE shrinks each frame's height and width; a 3D VAE also merges groups of nearby frames, exploiting the fact that consecutive frames barely differ. Typical ratios are about 4× in time and 8× in each spatial direction, cutting a clip's data by roughly 100× overall (the spatial compression applies to both height and width, so it shrinks 4 × 8 × 8 = 256 times in size, but the number of channels usually grows, e.g., from 3 RGB channels to 8 latent channels, so the final footprint is about 100× smaller). This is what makes modern video diffusion affordable: the diffusion model runs on the small compressed latent grid instead of raw pixels, so it sees maybe 30 latent "frames" where the original clip had 120. Because the same compressor is reused for every clip, the heavy work of learning to reconstruct video is paid once while the VAE is trained, not on every generation.
ABA
Articulated-Body Algorithm — O(n) forward dynamics for rigid-body chains
Ablation
A controlled experiment that changes exactly one factor (a data step, a layer, a hyperparameter) while holding everything else fixed, to measure that factor's true effect.
Acceptance rate
In speculative decoding, the share of the draft model's guessed tokens that the big target model agrees with and keeps — accepted ÷ proposed. Like a junior writer drafting sentences that the editor either approves or crosses out: the higher the approval rate, the less the editor has to redo and the faster the work goes. Higher acceptance means bigger speedups.
Activation checkpointing
A memory-saving trick that throws away the intermediate activations from the forward pass and recomputes them during the backward pass — trading a little extra compute for a lot less memory. Also called gradient checkpointing.
Activations
The intermediate outputs that flow between the layers of a network — the numbers each layer hands to the next during the forward pass. If weights are the fixed recipe a model learned, activations are the half-finished dish moving down the kitchen line, changing with every new input. Unlike weights, they are not saved after training; they are recomputed fresh each time the model runs on a new input.
AdaGN (adaptive group normalization)
The trick diffusion models use to push a condition — like a class label or the current denoising step — into the network through its normalization layers. First group normalization wipes a group of activations clean to mean 0 and variance 1, erasing their current style; then a tiny layer — a single small linear layer, just one little matrix of weights rather than a deep stack of layers — reads the condition and predicts two numbers, a scale and a shift (scale-and-shift), that re-stretch and re-center those activations. The layer can stay this small because its only job is to translate the condition into those two knobs — not to do any heavy image work — so a lightweight layer is plenty, and being tiny means it costs almost nothing even when one is dropped in at every normalization layer. Picture resetting a photo to neutral brightness and contrast, then letting the label "cat" turn those two knobs to a setting the model learned for cats. It is the group-norm cousin of AdaIN and AdaLN, and it is how a single model can be steered to generate one chosen class on demand. The adaptive part is exactly this: those two knobs are not frozen constants but are re-predicted for whatever condition you hand in, so the layer re-tunes itself for "cat" versus "dog" instead of behaving the same way every time.
AdaLN
Adaptive layer normalization; the conditioning mechanism in DiT. It is the layer-normalization cousin of AdaGN and AdaIN: a normalization layer first wipes a block's activations to a neutral mean-0, variance-1 state, then a tiny layer reads the condition (usually the current denoising step plus a class label) and predicts a fresh scale-and-shift. It is called adaptive because those scale and shift numbers are not fixed once and reused — they are re-computed for each condition, so the layer adapts itself to whatever you are asking it to generate. See also AdaLN-Zero.
AdaLN-Zero
The conditioning trick that powers DiT. Start with plain AdaLN (adaptive layer normalization): a normalization layer first wipes a block's activations to a neutral mean-0, variance-1 state, then a tiny layer reads the condition — usually the current denoising step plus a class label — and predicts a fresh scale-and-shift to re-stretch and re-center them, the layer-normalization cousin of AdaGN and AdaIN. The "-Zero" part makes two changes. First, the tiny layer predicts a third number — a gate — that multiplies the whole block's output before it is added back onto the residual stream (a gated path). Second, it initializes the layer that produces scale, shift, and gate so they all start at zero. With the gate at zero, every transformer block contributes nothing at the very start of training — the input slides straight through untouched, exactly like the identity shortcut of a residual connection. As training proceeds the gate gradually lifts off zero, so each block learns to add its effect gently instead of jolting a fragile, freshly-initialized network. Picture adding a new musician to a band but starting them on mute: you slowly turn up their volume from zero as they learn the song, rather than letting them blast over everyone on the first beat. That gentle start is a big part of why deep DiTs train stably.
Adam
Adaptive Moment Estimation — gradient-descent optimizer that maintains per-parameter running averages of the first (mean) and second (uncentered variance) moments of the gradients to compute individual adaptive learning rates.
AdamW
Adam optimizer with decoupled weight decay: the regularization term shrinks the parameter directly rather than being folded into the gradient update
Adapter
A small trainable layer slipped into an otherwise frozen pretrained network so it can pick up a new skill — or accept a brand-new input modality — without the cost of retraining the whole thing. The usual design is a bottleneck: squeeze the incoming features down to a tiny dimension, push them through a nonlinearity, expand them back to the original size, and add the result onto the untouched main path, so at the start the adapter outputs almost nothing and only gradually learns its small correction. Like a travel plug adapter that lets your existing appliance work in a foreign socket — the appliance (the big pretrained model) is left exactly as it is, and the cheap little adapter does all the converting. Example: bolt an adapter onto a frozen CLIP image encoder so it can suddenly read depth maps, training only the adapter's handful of weights while millions of frozen ones stay put. LoRA is one popular variety of adapter.
Adaptive
"Adaptive" means a layer does not use one frozen setting for every input — it adjusts its setting on the fly based on what you ask for. In a plain normalization layer the scale-and-shift knobs are learned once during training and then locked: every single input gets the exact same two numbers, like a radio permanently soldered to one station. The word adaptive flips that. Instead of fixed knobs, a tiny layer reads a condition you hand in — a style code, a class label like "cat," or the current denoising step — and predicts fresh knobs for that specific input, so the same layer re-tunes itself every time. Picture the difference between an old thermostat bolted to 70°F no matter who walks in, and a smart thermostat that reads the room — who is home, the time of day, the weather — and picks a new target temperature on its own. Same machine, but its behavior adapts to the situation. That is exactly what the "Ada" stands for in AdaGN, AdaIN, and AdaLN: the scale and shift are predicted from the condition instead of staying frozen, so the network adapts to whatever you are generating right now.
Adaptive instance normalization (AdaIN)
A way to push a "style" into a network's features: first normalize a feature map so it has mean 0 and variance 1 (wiping out its current style), then rescale and shift it using two numbers — a scale and a bias — predicted from a style code. Like erasing a drawing down to a plain pencil outline and then re-coloring it from a palette you hand in. StyleGAN applies AdaIN at every layer so a single style code can steer image features at every scale. The adaptive in the name means the scale and bias are not baked-in constants: they change with each style code you provide, so the very same layer re-paints features differently for every style instead of applying one fixed look.
ADD (Adversarial Diffusion Distillation)
The recipe behind few-step models like SDXL Turbo: distill a slow multi-step diffusion model into a 1–4-step student, but add a GAN-style discriminator that judges whether the student's quick output looks real. Plain distillation alone makes few-step images blurry, because regressing toward an average washes out detail; the discriminator punishes that blur and forces crisp results. Like training a sprinter to copy a marathoner's route in a fraction of the strides while a sharp-eyed judge rejects any shortcut that looks sloppy — so speed rises without the output going soft. Compared with an LCM (pure consistency distillation), ADD trades a fiddlier training setup for sharper few-step samples.
Admission control
Refusing requests early when capacity is saturated, to protect SLOs for accepted requests
Advantage
A(s, a) = Q(s, a) − V(s) — how much better than the baseline this action is
Aesthetic score
A single number predicting how visually pleasing a human would find an image — used to judge generators when realism metrics like FID miss the question of "is it beautiful?" You produce it with a small predictor — usually a tiny linear head bolted on top of frozen CLIP image embeddings — that has been fit to a dataset of images people rated on a 1–10 scale (the LAION-Aesthetics predictor is the best-known example). To score a new image you embed it with CLIP and pass that embedding through the trained head; the output approximates the average rating a person would give. Think of a film critic who has watched thousands of movies alongside their audience scores and can now glance at a new one and guess its rating. Because it is a learned proxy for taste, it inherits the biases of whoever did the original rating.
Agent
An LLM placed in a loop so it can plan, choose a tool, act, observe the result, and repeat until a task is finished — turning a one-shot answerer into something that carries out multi-step work, like a worker who keeps taking the next action until the whole job is done.
AI (arithmetic intensity)
FLOPs per byte of memory accessed; determines roofline position
Alignment (multimodal)
Making embeddings from different modalities comparable in a shared space
Alignment stack
The layered sequence of post-training steps that turns a raw base model into a helpful, safe assistant — typically SFT, then a reward model, then RLHF (or DPO). Like the stations on an assembly line, each layer builds on the one below it: the model first learns to follow instructions, then learns what people prefer, then is tuned to actually prefer it. "Alignment" here means getting the model's behavior to match human intent.
All-to-all token routing
In a Mixture-of-Experts (MoE) model spread across many GPUs, tokens must be sent to the specific GPU that holds the expert they need. "All-to-all" is the massive communication step where every GPU simultaneously sends its tokens to every other GPU and receives tokens in return. Imagine a busy postal sorting center where workers at different tables all throw packages to each other's tables at the exact same time—it requires incredibly fast network connections to prevent a traffic jam.
AllReduce
A team operation in distributed computing: every worker (rank) starts with its own array of numbers, and AllReduce adds them all together and hands the same combined result back to everyone. (A tensor here is just a grid of numbers, not a function; "summing tensors" means lining up two equal-shaped grids and adding matching cells — [1,2,3] + [10,20,30] = [11,22,33].) Imagine four friends who each counted part of a crowd: they pool their counts, add them up, and all walk away knowing the same total. In tensor-parallel inference each GPU computes part of a layer, and an AllReduce combines those partial results so every GPU ends up holding the full answer before the next layer runs.
Alt-text
The short text description attached to an image in a web page's HTML so screen readers can announce it and so the text still shows if the picture fails to load (the "alt" is short for alternative text). Because it sits right next to billions of web images, alt-text is the free, ready-made caption that web-scraped datasets like LAION use as each image's label — which is why it is the raw material the whole multimodal-data pipeline starts from. The catch is that it was written for accessibility, not for training: it is often missing, a bare filename like "IMG_2025.jpg", keyword spam stuffed in for search ranking, or simply unrelated to the picture — which is exactly why pipelines filter it by CLIP score and rewrite it into synthetic captions. Like the one-line label taped to the back of a photo in a shoebox: handy when it is accurate, useless when someone scribbled the wrong date.
AMP
Automatic Mixed Precision — running operations in 16-bit floats (float16 or bfloat16) where it is safe, to save memory and speed up training while keeping a float32 copy of the weights.
AnimateDiff
A 2023 technique that adds motion to an existing Stable Diffusion image model without retraining it. The trick is a separately trained motion module — a small stack of time-aware (temporal) layers — that you slide in between the frozen image model's blocks: the image model still draws each frame, and the motion module makes consecutive frames move together coherently. Because the module is trained once on generic video and then frozen, you can drop it into almost any community checkpoint (a custom-art-style fine-tune, say) and animate that style for free. It is the most popular concrete instance of temporal inflation packaged as a reusable add-on rather than a full model.
Anomaly detection
A debugging mode (torch.autograd.set_detect_anomaly(True)) that makes autograd check each operation and raise an error at the exact line that first produces a NaN or infinite gradient.
AOTInductor
Ahead-of-Time Inductor — a deployment path built on torch.export that compiles a captured model graph into a standalone shared library (.so) ahead of time, enabling C++-only inference without a Python runtime.
AnyRes
A way to feed a VLM images of any size and shape instead of squashing every picture to one fixed square. AnyRes splits the image into a grid of tiles at its native aspect ratio, runs the image encoder on each tile separately, and concatenates all the resulting image tokens — usually alongside one extra down-scaled copy of the whole image for global context. Analogy: rather than shrinking a newspaper page until the text is an unreadable blur, you photograph it column by column at full zoom and lay the close-ups side by side. Example: a tall 768×1536 screenshot might be cut into a 1×2 grid of two 768×768 tiles, doubling the tokens but keeping small text legible — which is why AnyRes (used by Qwen2-VL and InternVL2) sharply improves OCR-heavy and dense-chart benchmarks, at the cost of a longer, slower token sequence.
Application
The specific real-world job a model is being built to do — for example, "answer customer-support questions about our refund policy," "summarize internal engineering tickets," or "write product descriptions in our brand voice." A model that scores high on a generic public benchmark can still flop on your application if the two don't match, the way a chef who aces a fine-dining contest may still be the wrong hire for your taco truck. That mismatch is why teams build a small targeted eval shaped like their application instead of trusting a famous leaderboard number.
AprilTag
Square fiducial marker with a known code; widely used for pose ground truth
Arena
A way to rank chat models by having them go head-to-head: two models answer the same prompt, a human or LLM judge picks the winner, and many such duels are turned into Elo ratings — the scoring system used for chess players. The public LMSys Chatbot Arena is the best-known example.
argmax
The "which one is biggest?" operation: given a list of scores it returns the position of the largest one, not the value itself. The name is short for argument of the maximum — in math the "argument" is the input you hand a function, so argmax answers "which input gives the biggest output?" and returns that input's position. If the logits are [1.2, 4.8, 0.3], argmax is 1 — the index of 4.8 — which the model reads as "pick token #1." Like scanning a class's test scores and naming the top student rather than reading out their mark. Greedy decoding is just argmax applied to the logits at every step, so it always makes the same choice and never gambles.
Artifact
An unwanted distortion that a process adds to a signal — something that was not in the original but appears in the output, usually because detail was thrown away to save space. In lossy audio or image compression (an MP3, a JPEG, or a neural codec), squeezing the data too hard leaves audible smears, metallic ringing, or muffled detail in sound, and blocky squares or halos around edges in images. Like a photocopy of a photocopy — each pass loses fidelity and adds smudges the original never had. Example: encoding music at a very low bitrate can make cymbals sound watery or add a faint "underwater" warble — those are compression artifacts.
Aspect-ratio bucketing
A training trick for image generators: instead of forcing every training image into one square shape by cropping (which slices off the edges of tall portraits and wide landscapes, so the model never learns to compose anything but squares), you sort images into a handful of "buckets" by their shape — tall, wide, square — and build each batch from a single bucket so every image in it shares one resolution. This is necessary because a batch must be a single tensor of one shape, so images of different sizes cannot share a batch unless they are grouped first. The model then learns to generate at many aspect ratios, not just 1:1. Like a photo lab that sorts prints into 4×6, 5×7, and 8×8 trays before processing, so each tray runs through the machine at its own size instead of every photo being trimmed square. Example: a 1280×720 photo goes in the 16:9 bucket; afterward you can ask for a 16:9 image and get a properly-framed one instead of a cropped square.
ASR (Automatic Speech Recognition)
Automatic Speech Recognition — the task of turning recorded speech into written text, what your phone does when it transcribes a voice message. A modern ASR model such as Whisper reads a mel spectrogram of the audio and emits the words one piece at a time, just as a person listens and types along. Concrete example: feed it a clip of someone saying "turn on the lights" and it returns the string "turn on the lights". Because the model must map a long, wobbly sound wave onto a short line of text, the hard parts are accents, background noise, and rare words — which is why fine-tuning on a specific domain or language helps so much.
ATen
The C++ tensor library underneath PyTorch's Python frontend
Attention
The operation softmax(QKᵀ/√d) V — content-addressable token mixing; the core of every transformer. The softmax step turns the raw query–key match scores into weights between 0 and 1 that decide how much each earlier token contributes to the next one.
Attention sink
The first few tokens of a sequence, which attention heads keep putting weight on no matter what those tokens actually say. They are called a sink in the plumbing sense — a drain where leftover water collects: on every step the softmax has to spread a full 100% of attention across the tokens, so when a head has nothing important to look at, that spare attention drains into these first tokens. Because the model leans on them, KV cache eviction schemes deliberately keep these tokens even when they look unimportant, which keeps quality stable in long-context serving.
AudioSet
A large public dataset from Google of about two million 10-second clips taken from YouTube, each tagged with the kinds of sound it contains (dog bark, guitar, rain, speech) drawn from a vocabulary of 527 labels. Think of it as "ImageNet for sound" — a big, broadly labeled collection people use to teach models what everyday audio events sound like. Because each clip comes with short descriptive tags, AudioSet is also a handy source of (audio, caption) pairs for training a model to describe what it hears.
Autoencoder
A neural network that learns to copy its input to its output through a narrow middle layer. It has two halves: an encoder that squeezes the input down to a small set of numbers, and a decoder that rebuilds the original from those numbers. Because the middle is much smaller than the input, the network cannot simply memorize — it is forced to keep only the most important features, like writing a short summary of a long article and then reconstructing the article from the summary. That small middle representation is called the latent space.
autograd
The reverse-mode automatic differentiation engine
Autoregressive model
A model that generates a sequence one piece at a time, where each new piece is predicted from all the pieces produced so far — like writing a sentence word by word, with every word depending on the words already on the page. The name says what it does: auto means "self" and regression means "predicting a value from earlier values," so the model predicts each new piece by regressing on its own previous outputs — it feeds on itself. For images, an autoregressive model (such as PixelCNN) draws pixel by pixel in a fixed order. This makes the math clean and the samples sharp, but generation is slow because each step has to wait for the previous one, with no way to compute them all at once.
AWQ
Activation-aware Weight Quantization — preserve weights important to large activations
AV1
A modern, royalty-free video codec — the rules for squeezing video into a small file. AV1 compresses noticeably better than older codecs like H.264 — often 30–50% smaller files at the same quality — but it is much slower to decode, so reading AV1 video back into frames costs more CPU time. Analogy: it is like a denser ZIP format that saves disk space but takes longer to unzip. Example: storing 100 clips as AV1 .webm files might use a third of the disk of the same clips as H.264 .mp4, but take several times longer to decode each clip during training.
Backend
A device- or library-specific implementation that actually executes an operation's kernel — for example the CPU, CUDA, or MPS backend. The dispatcher routes each call to the correct backend based on the tensor's device and dtype.
Backward pass
The process of going through the network in reverse — from the output back to the first layer — to compute gradients: how much each weight should change to lower the error. It is used only during training, right after each forward pass: the forward pass makes a prediction, the loss measures how wrong it was, and the backward pass traces that error back to assign blame to each weight (using the chain rule). Like a chef tasting a dish that came out too salty and working backwards through the recipe to figure out which step added too much. A model that is only serving answers (inference) never runs the backward pass — that is why serving is cheaper than training.
Base model
A model fresh out of pretraining that only continues text and has not yet been taught to follow instructions — a brilliant autocomplete, not yet an assistant.
Batch
A small group of examples (sentences, images, prompts) that the model processes together in a single forward pass instead of one at a time. Like a chef who slices a whole basket of onions at once rather than picking up the knife for each onion separately — the GPU pays a fixed startup cost per pass, so doing 32 examples in one shot is far faster than 32 single passes. In training, the batch size sets how many examples contribute to each gradient update; in quantization methods like GPTQ, a small calibration batch of representative inputs is run through the model to estimate which weights matter most. See also Batching, which is the same idea applied to grouping inference requests on a serving stack.
Batching
Grouping several inference requests so the GPU runs them together in one forward pass instead of one at a time. Like an elevator that waits a moment to gather a few people and carry them up in a single trip rather than going up and down for each person separately — every rider's start is a touch slower, but far more people move per minute. That is the trade-off batching makes: higher throughput (requests finished per second) at a small cost in latency (how long one request waits). Production servers like Triton Inference Server do this grouping for you automatically; see continuous batching for the version that lets riders hop on and off mid-trip.
BC
Behavior Cloning — supervised imitation of demonstrator actions
Behavior policy
The policy that generated the data, in off-policy or offline RL
Bellman equation
The recursive consistency condition V(s) = E[r + γV(s')]
Benchmark
A fixed, shared test set used to measure and compare models on a task — like a standardized exam everyone sits so scores line up side by side. MMLU tests knowledge and GSM8K tests math; a benchmark is only meaningful while models have not already seen its answers (see contamination).
Best-of-N
An inference trick that samples N candidate answers to the same prompt and keeps the single one a scorer — usually a reward model or verifier — rates highest, like writing several drafts of an email and sending only the best.
bfloat16
16-bit float with fp32's exponent range — the modern default for training (also written bf16, BF16)
Bias correction
An adjustment applied in the Adam family of optimizers to counteract the zero-initialization of moment estimates; without it, early steps would be artificially small
Biases
The smaller, additive group of learned parameters in a layer — the b in y = xW + b. After the weights combine the inputs, each output neuron adds its own bias: a fixed offset that shifts the result up or down no matter what the input was. Like the + b that lets a line y = mx + b sit above or below the origin, or a starting balance in a bank account before any transactions — it gives each neuron a baseline to lean toward.
BigGAN
A large class-conditional GAN from 2018 that was the first to make GAN-generated images look convincingly realistic at high resolution across the thousand everyday categories of ImageNet (dogs, mushrooms, coffee mugs, and so on). Its recipe was mostly "make everything bigger and steadier": much larger batches, more parameters, and a projection discriminator to feed in the class label cleanly. Like discovering that a decent home cake recipe just needed a bigger oven, more eggs, and a steadier hand to reach bakery quality. Its best-known trick is the truncation trade-off: drawing the input noise closer to the average gives cleaner, more typical images at the cost of variety, so you can dial between "safe and pretty" and "wild and diverse."
Bitrate
How many bits are used to represent one second of sound (or video) — the data budget per second. A higher bitrate keeps more detail and sounds closer to the original; a lower bitrate saves storage and bandwidth but blurs detail and adds artifacts. For a neural codec it maps directly to how many tokens per second the audio is turned into — fewer tokens, lower bitrate. Like the quality slider when you export a photo: more data per image captures finer detail at the cost of a bigger file. Example: MP3 music is often 128–320 kbps (kilobits per second), while EnCodec can compress speech down to about 1.5 kbps — far smaller, but with more audible loss.
Blackwell
NVIDIA's 2024 GPU architecture (B100, B200, B200 Ultra) and the successor to Hopper. Like swapping a sports car engine for a more powerful one of the same shape, it keeps the same overall design as Hopper but doubles down on low-precision math — better FP8 throughput and brand-new FP4 Tensor Cores — which is what makes it the preferred chip for the largest 2025-era training and serving runs.
BM25 (Best Matching 25)
A classic keyword-search ranking — short for Best Matching 25 — that scores a document by how often the query's words appear in it, weighting rare words more heavily. Think of it like a librarian scanning pages for your exact search words and ranking pages where those words appear most often (especially unusual words) higher on the list. It is the sparse (exact-word) counterpart to dense embedding search.
Bootstrapping
Using a current estimate (e.g., V(s')) in the target instead of a full return
Bottleneck
The single slowest stage in a pipeline, which caps the overall speed; in training this is often the data loader rather than the model.
bpd (bits per dimension)
The standard likelihood metric for image models: the average number of bits needed to store each number (dimension) in an image, computed as -log₂ p(x) / D. Think of it as how "surprised" the model is per pixel-value — a better model predicts the data more confidently and so needs fewer bits, like a good compressor that zips a file smaller. A model that thinks all 256 pixel values are equally likely scores exactly 8 bits per dimension, so any real model must come in under 8 to show it learned something.
BPE
Byte-Pair Encoding — subword tokenization by greedy frequent-pair merges. It starts from raw bytes and repeatedly glues together the neighboring pair that appears most often, building up reusable chunks. For example, on lots of English text BPE notices t and h sit side by side constantly and merges them into th; a later round merges th + e into the. So a common word like the ends up as a single token, while a rarer word like tokenizer is left as familiar pieces such as token + izer. "Greedy" means each round simply takes the single most-frequent merge available, never looking ahead to see whether a different choice would pay off later.
Broadcasting
A tensor operation trick where a smaller tensor is automatically stretched to match the shape of a larger one without actually copying data in memory. Like painting a stripe down a wall: you only load the stripe pattern once, but apply it everywhere as you roll.
C++ extension
A custom operation written in C++ (optionally with CUDA), compiled and loaded so it can be called from Python like a built-in PyTorch op.
C-space
Configuration space — the abstract space of joint configurations
c10
PyTorch's core C++ library (the "core ten[sor]" library)
Calibration
Running a few representative batches of data through a model to learn how big its activations typically get — their usual smallest and largest values — before quantizing it. Knowing that range lets static quantization choose one fixed int8 "scale": the conversion factor that maps the real numbers onto the 256 slots an int8 can hold. It is like measuring the tallest guest you expect before setting a doorframe height — check the real range once, then size the fixed scale so almost nothing gets clipped.
Camera control
Steering not just what moves in a generated video but where the virtual camera goes — pan, zoom, orbit, dolly — by feeding the model an explicit camera path. The dominant method represents each frame's camera as Plücker coordinates (a six-number description of the ray every pixel looks along) and adds those as an extra input, so the model can keep objects placed consistently as the viewpoint moves. Named systems that do this include CameraCtrl and MotionCtrl. It is one of the control surfaces — alongside the motion score and depth- or pose-conditioning — that turn a raw video generator into a directable tool.
Canny edge detector
A classic (non-neural) algorithm that reduces a photo to a clean black-and-white map of its outlines — the lines where brightness changes sharply, such as the border between a face and the background. It works by measuring the gradient (how fast pixel brightness changes) at every point, keeping only the local peaks so the edges come out one pixel thin, and then linking those peaks into continuous contours. Picture tracing all the hard boundaries of a photo with a fine pen and throwing away the shading in between. ControlNet uses such an edge map as a conditioning signal: the outline says where shapes must go while the prompt decides what fills them. Named after its inventor, John Canny.
Catastrophic forgetting
When training a model on new data erases skills it had already learned, because the new gradients overwrite the old weights.
Cascaded diffusion
A way to generate high-resolution images or video by chaining several diffusion models in sequence rather than asking one model to do everything at once: the first model produces a small, coarse result, and each later model takes that output and adds detail or resolution, conditioned on the previous stage's blurry version. "Cascade" is the picture of water falling down a series of steps — each pool feeds the next. Splitting the work this way lets each model specialize (rough layout at low resolution, fine texture at high resolution) and was the dominant recipe for high-resolution video before latent models; Imagen Video and Make-A-Video both built super-resolution cascades. Modern latent-diffusion systems mostly dropped it because compressing into a small latent space up front already makes full-resolution generation affordable inside a single model.
Causal 3D VAE
A 3D VAE built so that each frame is encoded using only itself and earlier frames, never later ones. In machine learning, "causal" means "respecting the flow of time" — just as an effect cannot precede its cause, a causal model cannot look into the future to process the present. This is the same "look only backward" rule a causal mask enforces in language models. A plain 3D VAE merges a fixed block of frames together (e.g., always merging 4 frames into 1), so it has no clean way to handle a lone still image (because it expects a full block, a lone image has no later neighboring frames to merge with). The causal version sidesteps this: because the very first frame depends on nothing after it, a single-image input (T=1) compresses to a single latent frame (T'=1), and the one model can encode both still images and full video. This is what lets frontier video systems train a single shared compressor on a mix of images and clips (see joint image-video training) instead of maintaining separate image and video encoders.
Causal mask
A mask applied to attention scores that hides future positions, so each token can attend only to itself and the tokens before it. In machine learning, "causal" means "respecting the flow of time" — just as an effect cannot precede its cause, a causal model cannot look into the future to process the present.
CBF
Control Barrier Function — runtime safety filter via a constraint on ḣ
CDNA / RDNA
AMD's datacenter / consumer GPU architectures
CelebA
A dataset of about 200,000 photos of celebrity faces, each labeled with attributes such as "smiling," "wearing glasses," or "blond hair." Because every image is a face, it is a favorite for studying generative models of a single, well-defined kind of picture — you can easily judge whether a generated face looks real, and the attribute labels let you check whether the model learned to control features like hair or expression.
CFG (classifier-free guidance)
Classifier-free guidance — the standard inference trick for making a diffusion model follow its prompt more closely. The model is trained to run both with the condition (the prompt or label) and without it; at sampling time you take the difference between the two predictions and amplify it, pushing the output away from "generic" and toward "matches the prompt." Unlike classifier guidance, it needs no separate classifier — the same generator provides both signals — which is why it became universal in text-to-image models. A guidance-scale knob trades diversity for prompt adherence.
CFG fusion
A diffusion-serving optimization for classifier-free guidance, which normally needs two model passes per denoising step — one conditioned on the prompt, one unconditioned. CFG fusion runs both in a single batched forward pass (stacking them as a batch of two) instead of two separate calls, so the GPU is launched once per step rather than twice. Like cooking two portions in one pan instead of washing up between them — same result, far less overhead.
Chain rule
A calculus principle used to compute the derivative of a composite function by multiplying the derivatives of its parts.
Chameleon
Meta's family of native multimodal models that treat text and images as one single stream of tokens: pictures are turned into image tokens by a VQ-VAE, mixed in with ordinary text tokens, and a single transformer is trained from scratch over the combined sequence with one plain next-token-prediction objective. This is the early-fusion recipe taken to its extreme — there is no separate vision encoder bolted on, so the model can read and write any interleaving of words and image patches. Analogy: instead of a writer and an illustrator passing a notebook back and forth, one person who was taught from the start to "write" in both words and pictures sketches and types along the same flowing line. Example: handed a recipe that is half text and half photos, Chameleon continues it by emitting the next word or the next patch of an image, whichever comes next; the name nods to the lizard that blends seamlessly into any surroundings — here, any mix of modalities.
Chat template
The structured format (system/user/assistant) the model is fine-tuned on
Checkpoint
A saved snapshot of a model's weights (and optimizer state) at a point in training, so a run can be resumed or rolled back to it after a failure.
Chinchilla
The scaling law showing compute-optimal training uses ~20 tokens per parameter
Chunked prefill
Splitting long prompts across multiple iterations to interleave with decode steps
Chunking
Splitting documents into smaller passages (often a few hundred tokens each) before indexing them for retrieval, so a search returns a focused snippet instead of a whole book.
CIFAR-10
A classic dataset of 60,000 tiny 32×32 color photos sorted into 10 everyday categories (airplane, cat, dog, ship, truck, and so on). Because the images are small and the whole set downloads in seconds, it is a go-to "hello world" for image models — big enough to be interesting, small enough to train on a laptop. The name stands for "Canadian Institute For Advanced Research, 10 classes." See also MNIST, its even simpler grayscale cousin.
Class conditioning
Telling a generative model which category to produce instead of leaving it to chance. You feed the model a label (for example, the digit "7" or the class "cat") alongside its usual input, so at generation time you can ask for exactly that class. Without it, the model draws a random sample from everything it learned; with it, you steer the output — like ordering a specific flavor instead of accepting whatever scoop you are handed.
Classifier guidance
An early technique for steering a diffusion model toward a chosen class or label: you train a separate image classifier that can read noisy images, then at each denoising step add a nudge in the direction of its gradient — the direction that makes the target class more likely. Like a critic standing over a painter and pointing "more toward a cat" at every brushstroke, it trades a little sample diversity for much stronger adherence to the condition. Its drawback is the extra cost of training and running that dedicated noisy classifier, which classifier-free guidance (CFG) later eliminated by getting the same steering from the generator itself.
CLIP
Contrastive Language-Image Pretraining — a model that learns to match pictures with the words that describe them. It has two separate encoders: one reads an image, the other reads text, and both map their input into the same shared space of embeddings, so a photo of a dog and the caption "a dog" land near each other while the caption "a bicycle" lands far away. It is trained on hundreds of millions of image–caption pairs scraped from the web with a contrastive objective: pull each true image–caption pair together and push every mismatched pair apart. Think of it as teaching two translators — one who only speaks "image" and one who only speaks "text" — to agree on a common language, so any picture and its description end up pointing at the same spot. Once trained you can measure how well a caption fits an image (a CLIP score), classify images with no extra training by comparing them to label phrases (zero-shot), or feed the text encoder into a generator — it is the text encoder inside early Stable Diffusion.
Closed-form
A solution you can write down and compute directly with a fixed formula, instead of reaching it through many rounds of trial-and-error. Solving 2x = 10 by writing x = 5 is closed-form; nudging x up and down until both sides match is not. In DPO a closed-form objective lets the model learn straight from preference pairs with one training loss, skipping the slow reward model-plus-PPO loop of classic RLHF.
CLS token
A special extra "summary" token (short for classification) glued to the front of a transformer's input sequence whose only job is to soak up information from all the real tokens, so its final output vector can stand in for the whole input. In a ViT it has no patch of its own — it starts as a learned placeholder and, through attention, gathers a single image-wide description you then hand to a classifier. Like a meeting secretary who owns none of the agenda items but listens to every speaker and writes the one-line summary everyone refers to afterward. (Many models instead average all token outputs — mean pooling — which often works just as well.)
CNN
Convolutional Neural Network — a neural network built mainly from convolution layers; the standard architecture for image tasks. Instead of staring at the whole picture at once, a CNN slides a small magnifying glass across the image, checking one little patch at a time for simple features — an edge here, a splash of color there. Early layers spot these tiny patterns; deeper layers stitch them into bigger ideas (edges become a whisker, whiskers become a cat). Because the same magnifying glass is reused over every patch, a CNN needs far fewer parameters than a network that wired up every pixel separately — and it can recognize a cat whether it sits in the corner or the center of the photo.
COCO
Common Objects in Context — a widely used image dataset of roughly 120,000 everyday photos, each paired with five short human-written captions plus labeled object outlines, so it serves as a shared benchmark for both captioning and object detection. Think of it as the field's standard "practice set," the way students everywhere drill the same well-known textbook problems so their results can be compared. In this guide, COCO's image-caption pairs are the convenient small-scale fuel for training toy CLIP, Q-Former, and captioning models — big enough to be realistic, small enough to fit a weekend.
Codebook
The fixed list of code vectors a VQ-VAE is allowed to use to describe an image — think of it as a numbered paint set, where every patch of the picture must be painted using one of the colors on the palette rather than any color imaginable. The encoder looks at a patch, finds the closest entry in this list, and stores just that entry's index, which is what makes the latent code discrete. A bigger codebook offers more "colors" (finer detail) but is harder to use fully — see codebook collapse.
Codebook collapse
A failure where a VQ-VAE ends up using only a few entries of its codebook and ignores the rest — like owning a 64-color crayon box but only ever drawing with three. The unused entries are wasted capacity, so the model stores less detail than its codebook size suggests and reconstructions stay blurry. Common fixes are EMA codebook updates, re-initializing dead (never-chosen) entries near popular ones, and k-means warmup. It is the discrete-latent cousin of mode collapse in GANs.
Collate function
The function a DataLoader uses to combine a list of individual samples into one batched tensor; a custom one can pad variable-length data.
Collective operation
A communication step that all processes (ranks) in a distributed job perform together — such as AllReduce; if one rank skips it, the others wait forever.
Collision mesh
Simplified geometry used for collision tests, distinct from visual mesh
Column-wise partitioning
Splitting a weight matrix along its column (output) dimension so that each GPU holds a vertical slice and computes part of the output independently — the standard first step in Megatron-style tensor parallelism.
Concatenation
The most basic way to fuse two modalities: just stick their feature vectors end to end into one longer vector and hand that to the next layer. If an image embedding has 512 numbers and a text embedding has 512, concatenation glues them into one 1024-number vector — like taping two index cards side by side and reading them as a single wider card. It adds almost no parameters and is a surprisingly strong baseline, but the two streams never actually look at each other the way cross-attention lets them; they only get combined once the next layer mixes the stacked numbers, which is why richer fusion often wins when the task needs the modalities to interact.
Conditional GAN (cGAN)
A GAN that is told which kind of image to make instead of producing a random one. The class label (for example, the digit "7") is fed to both the generator and the discriminator, so generation becomes class-conditioned — you ask for a category and get it. Like a vending machine where you press a button for the snack you want rather than taking whatever drops. See also projection discriminator, an efficient way to feed the label to the critic.
Consistency model
A diffusion-derived model trained so that every point along a noisy-to-clean denoising path maps directly to the same final clean image — so at sampling time you can jump from pure noise to a finished picture in one (or a handful of) steps instead of the usual dozens. It is built by consistency distillation: a student learns to agree with itself at neighboring noise levels along a teacher's ODE trajectory. Like a winding park path where every bench has a sign pointing straight to the exit — wherever you start, one glance gets you to the end. Trade-off: a huge speedup (1–4 steps vs ~50) for a modest dip in quality. The latent-space version is the LCM.
Constitutional AI
An alignment recipe (introduced by Anthropic) where some or all human preference labels are replaced by an AI judge that grades responses against a written "constitution" — a short list of principles like "be helpful," "refuse to assist with harm," "don't pretend to be human." Like running a debate club with a published rulebook instead of asking the audience to vote: cheaper, more consistent, and easier to update than collecting fresh human labels for every new behavior. The technique is the foundation of RLAIF.
Constrained generation
A decoding-time technique that masks out any next-token choices that would break a target structure — a regex, a JSON schema, a grammar — so the model is only allowed to pick valid continuations. Like a Mad Libs game whose blanks accept only nouns or only numbers: the writer can be creative inside each blank but cannot break the form. Libraries such as Outlines and sglang are common implementations, and the technique is what makes reliable function calling and tool-using agents possible.
Contamination
When items from an evaluation benchmark accidentally end up in a model's training data, so its score reflects memorization rather than skill — like a student who studied from a leaked copy of the exam. Also called train-test contamination, it is a leading reason a high benchmark number can mislead.
Content-addressable token mixing
The routing and retrieval of information between tokens based on their query-key similarity (as in attention) rather than their positions
Context parallelism
Splitting one very long prompt across several GPUs by sequence position, so each GPU holds and processes a different slice of the tokens. Like handing each of four friends one chapter of the same long book to read at the same time, instead of one person reading all four chapters alone. It is how engines serve 100k–1M-token contexts whose KV cache would never fit on a single GPU.
Context window
The maximum number of tokens the model can attend over in one forward pass
Continued pretraining
Taking an already-pretrained model and training it further on a new corpus to add domain knowledge, rather than starting from random weights.
Continuous batching
A serving trick where the GPU adds new requests into the running batch — and drops finished ones — at every decode step, instead of waiting for the whole batch to finish together. Like a hotel shuttle that can pick up and drop off passengers anywhere along its loop rather than only at the start and end: far fewer empty seats overall, so throughput goes up dramatically. It is the single largest speedup in modern LLM serving and is the default in vLLM and TGI.
ControlNet
An add-on that gives a frozen diffusion model precise spatial control. It clones the U-Net's encoder into a parallel branch that reads an auxiliary conditioning image — a depth map, a pose skeleton, a Canny edge map, a segmentation map — and feeds that branch's features back into the original network so the output follows the supplied structure. The base model stays untouched (so its quality and prompt-following are preserved) and only the new branch is trained; the connections use zero-convolutions so the branch contributes nothing at first and is learned gradually. Like laying tracing paper with an outline over a painter's canvas: the prompt still chooses colors and texture, but every shape must follow the lines you drew.
ConvLSTM
An LSTM that swaps its internal matrix multiplications for convolutions, so it can carry memory across time and keep the 2D spatial layout of each frame instead of flattening it into a single vector. A plain LSTM treats its input as a flat list of numbers, which throws away which pixel sat next to which; a ConvLSTM keeps the grid intact, so a local fact like "this corner is getting brighter" stays local. That makes it a natural fit for future frame prediction, where both what changes and where it changes matter. It was the standard baseline for video prediction before transformers and diffusion took over, and later recurrent variants such as PredRNN refined the same idea with extra memory paths between layers.
Convolution Layers
These are the foundational building blocks of a Convolutional Neural Network (CNN). Their job is to scan an image and hunt for specific patterns.
Each layer uses a small grid of numbers—called a filter or kernel—that acts like a tiny pattern detector. The network slides this filter systematically, step by step, across the entire image. At every pause, the filter looks exclusively at the small patch of the image directly underneath it, checks how well that patch matches the pattern it is hunting for, and spits out a single "match score." As the filter sweeps over the whole image, it records these scores onto a new, blank grid called a feature map.
Picture a small, transparent stencil painted with red-and-white stripes. You drag this stencil step by step over a crowded "Where's Waldo?" poster:
- When the stencil is underneath a patch of blue sky or a green tree, the patterns don't match, so it leaves a "0" (a dark mark) on your feature map.
- But when you slide the stencil directly over Waldo's shirt, the stripes align perfectly, leaving a high score (a bright mark) on your feature map.
By the end of the sweep, your feature map acts as a glowing treasure map, lighting up exactly where Waldo's shirt is located.
In a real network, one filter might hunt for stripes, another for glasses, and another for the curve of a beanie cap. By stacking many of these convolution layers together, the network pieces together simple clues to eventually recognize a complex object like Waldo himself. Because the network reuses the same tiny filter across the entire poster, the process stays incredibly efficient—and ensures that the pattern is found no matter where it is hiding in the picture.
copy
A tensor that owns its own storage, independent of any source tensor; created by .clone(), or automatically by operations like .contiguous() and reshape when a view is not possible
Cosine decay
A learning-rate schedule that, after warmup, lowers the rate along the smooth downward half of a cosine curve until it reaches near zero by the end of training. The step size starts large and eases off gently — like braking smoothly as you coast up to a stop sign instead of slamming the pedal at the last moment — which helps the model settle into a good solution. It is the long-standing default schedule, before newer recipes like WSD.
Cosine similarity
A score from −1 to +1 for how closely two vectors point in the same direction, ignoring how long they are. You get it by taking the dot product of the two vectors and dividing by both of their lengths — which is the same as first L2-normalizing each vector (rescaling it to length 1 so it sits on the unit sphere) and then taking a plain dot product. Worked example: for a = [3, 4] (length 5) and b = [4, 3] (length 5), the dot product is 3·4 + 4·3 = 24, so cosine similarity is 24 / (5·5) = 0.96 — nearly 1, meaning they point almost the same way. A value of 1 means identical direction, 0 means unrelated (at right angles), and −1 means exactly opposite. Analogy: two people pointing at the night sky — cosine similarity asks only "are your arms aimed at the same star?", not "whose arm is longer." This is the standard way to compare embeddings, because in most models meaning lives in a vector's direction, not its magnitude; it is the score inside CLIP matching and the building block of InfoNCE.
Cost per million tokens
The standard price unit for running a model in production: how many dollars it costs to generate one million tokens of output. You get it by dividing the hardware's hourly cost by how many tokens it produces per hour — like working out a car's cost per mile from its fuel bill and the distance it covers. Almost every serving optimization, from batching to quantization, is ultimately a way to push this one number down.
CoT
Chain of Thought — prompting or training a model to write out its reasoning step by step before giving a final answer, the way a student shows their work on a math problem instead of blurting out just the result.
Covariance
A measure of how two quantities move together: when one is above its average, does the other tend to be above too (positive covariance), below (negative), or neither (near zero)? Stacked up for many quantities at once it becomes a covariance matrix, which describes the overall shape and spread of a cloud of points — how wide it is in each direction and how tilted. Picture a scatter of darts on a board: the covariance tells you whether the cloud is a tight circle, a wide oval, or a diagonal streak. FID compares the covariances of real and generated image features to check that the two clouds have the same shape, not just the same center.
CQL
Conservative Q-Learning — offline RL with a pessimistic Q penalty
Cross-attention
A form of attention that lets one stream of data look at and pull in information from a different source. In ordinary self-attention a sequence attends to itself; in cross-attention the queries come from one place (say, the image being denoised) while the keys and values come from another (say, the text prompt's embeddings) — often a different modality entirely. Picture a painter who keeps glancing at a written description while working: each patch of canvas asks the words "which of you matters to me?" and pulls in the answer to decide what to paint. This is exactly how diffusion models inject a text prompt into the image: inside the U-Net the image patches are the queries and the text tokens are the keys and values, so every region of the picture can attend to the words most relevant to it.
Cross-encoder
A model that reads a query and one candidate document together in a single pass and outputs one relevance score — far more accurate than comparing their separate embeddings, but too slow to run over a whole corpus, so it is used to rerank a short candidate list.
Cross-entropy
A loss function that scores how surprised a model is by the correct answer: it stays small when the model gave the true next word a high probability and grows large when it was confidently wrong. Like grading a weather forecaster on confidence and not just on being right — announcing "90% chance of sun" and then getting rain costs far more points than a hedged "50%." Training an LLM means adjusting the weights to push this surprise as low as it will go.
Cross-modal retrieval
Searching with one modality to find matches in another — typing a caption to pull up the right photo, or handing in an image to find the text that describes it. It works by mapping both modalities into one shared space (for instance with CLIP), so that a query and its true match land near each other; you then keep the few stored items with the highest cosine similarity to the query (a top-k nearest-neighbor lookup). Because every item is encoded just once, answering a query is only a batch of dot products — one matmul — which is why it scales to huge collections. Analogy: a library where books and their summaries are shelved by meaning instead of by title, so a summary in your hand leads you straight to the shelf holding the matching book. It is the first of the four canonical multimodal tasks and the thing dual encoders are best at.
Cross product
A way to combine two 3D vectors that — unlike the dot product, which boils them down to a single number — returns a third vector. That new vector points at a right angle (perpendicular) to both of the originals. For a = [a₁, a₂, a₃] and b = [b₁, b₂, b₃] it is computed slot by slot as a × b = [a₂·b₃ − a₃·b₂, a₃·b₁ − a₁·b₃, a₁·b₂ − a₂·b₁]. For example, [1, 0, 0] × [0, 1, 0] = [0, 0, 1]: two arrows lying flat on a table (one pointing "east", one "north") produce one pointing straight up, out of the table.
What it does (the effect). Where the dot product measures how aligned two vectors are, the cross product hands you the axis they are not aligned along. Analogy: Imagine laying two pens flat on a desk so their ends touch, forming a "V" shape. Now, imagine taking a pencil and standing it perfectly upright exactly where the two pens meet, pointing directly at the ceiling. That standing pencil is the cross product. Furthermore, the length of that pencil depends on how wide you open the "V" shape. If you open the pens to a perfect 90-degree corner, the pencil grows to its maximum height. If you close the pens together so they overlap and point the same way, the pencil vanishes entirely (its length shrinks to zero).
Why Plücker coordinates use it. If you just say "a line pointing North," you haven't given enough information to pin it down — imagine two parallel train tracks that both point North, but sit in completely different places. You need a way to tell them apart.
This is where the cross product comes in. By taking the cross product of a position vector (pointing to any spot on the track) and the track's direction, you create a new vector called the line's moment. Think of this moment as a unique fingerprint for the line's exact location in space. The magic of this math is that no matter which spot you pick along that specific track, the cross product always spits out the exact same fingerprint. But if you do the math on the other parallel track, you get a completely different fingerprint. So, by keeping just two things — the direction and this fingerprint — you perfectly lock down exactly which line you are talking about.
cuBLAS
NVIDIA's optimized library of dense linear-algebra kernels; PyTorch calls it for matrix multiplication on CUDA.
CUDA
NVIDIA's GPU compute backend; tensors on the cuda device run their kernels here
CUDA Graphs
A way to record a whole sequence of GPU kernel launches once and then replay them all with a single command, instead of telling the GPU what to do step by step every time. Like pressing "play" on a saved macro instead of retyping the same keystrokes — it skips the per-launch bookkeeping. What has to stay fixed is the list of steps, not the data they run on: every decode step runs the exact same kernels in the exact same order, just on a different token, so it can be recorded once and replayed each step while the actual tokens keep changing. (It only stops helping if the steps themselves change — say a different model path on every call.) This saves a noticeable 5–20% on small models, where launching dozens of tiny kernels per token is itself a real cost.
CUDA stream
A queue of GPU work that runs in order, but independently of other streams — so the GPU can be doing one stream's job while the CPU prepares the next, or two streams can overlap. Like separate checkout lanes at a store: putting independent tasks in different lanes lets them progress at the same time instead of waiting in one long line, which is how a serving stack overlaps detokenization or KV transfer with the next forward pass.
Custom op
A user-defined operation registered with PyTorch (e.g. via torch.library.custom_op) so it behaves like a built-in — including working with torch.compile.
CUTLASS
NVIDIA's open template library for matmul kernels
DALL·E 3
OpenAI's text-to-image model, best known for following long, detailed prompts faithfully — it reliably places the right objects, counts, and spatial relationships you asked for. Its standout trick was training on synthetic captions: instead of messy web alt-text, the team rewrote the training captions to richly describe each image, so the model learned exactly which words map to which pictures. Like a student who finally aces reading comprehension once their textbook is rewritten in clear, complete sentences. It is the proprietary counterpart to open models like Stable Diffusion, and the public demonstration that better captions can beat a bigger model.
Data parallelism
The default way to train across many GPUs: put a full copy of the model on each GPU, feed each one a different slice of the batch, then average their gradients so all copies stay identical — like several graders each marking part of an exam pile and then pooling the scores. (See DDP.)
DataLoader
PyTorch's iterator that pulls samples from a Dataset, groups them into batches, and can load them in parallel using worker processes.
DCGAN
Deep Convolutional GAN — the 2015 recipe that first made GAN training reliable, by building both the generator and discriminator out of convolution layers with a few simple rules (batch normalization, no pooling layers, specific activations). Before it, GANs often fell apart mid-training; DCGAN's architecture became the default starting point that almost every later image GAN built on.
DDIM
Denoising Diffusion Implicit Models — a way to sample from an already-trained DDPM far faster. Where DDPM's reverse process is stochastic (it injects fresh randomness at every step and may need ~1000 steps), DDIM makes the path deterministic: the same starting noise always yields the same image, and the smooth path lets you skip most steps, so ~50 steps match 1000-step quality. Crucially it reuses the same trained network — DDIM changes only how you sample, not how you train. Like taking a few long, confident strides across a room instead of many tiny shuffles.
DDIM inversion
Running the deterministic DDIM sampler in reverse to find the starting noise that would regenerate a given real image. Normal sampling goes noise → image by removing a little noise each step; inversion walks the same path backward, image → noise, adding the noise the model would have removed. Once you hold that noise you can change the prompt and denoise forward again, and because the path is largely reused the edit keeps the original's layout and pose. The catch is drift: each backward step is only approximate, so the recovered noise does not reconstruct the photo perfectly — null-text inversion is a follow-up that fixes this by optimizing the empty-prompt embedding so the reconstruction stays faithful. Like reverse-engineering the exact recipe behind a finished cake so you can bake it again from scratch — and, this time, swap in one ingredient to change just that part while everything else comes out the same.
DDP
Distributed Data Parallel — replicate model, split batch, all-reduce gradients
DDPG
Deep Deterministic Policy Gradient — the first deep-RL continuous-control algorithm
DDPM
Denoising Diffusion Probabilistic Models — the foundational 2020 paper and recipe that kicked off the modern diffusion era. Training is disarmingly simple: take a clean image, add a known amount of random (Gaussian) noise, and teach a network (usually a U-Net) to predict that noise so it can be subtracted back off; the loss is just mean squared error on the noise. To generate, start from pure static and repeat the learned "remove a little noise" step many times (classically 1000) until an image appears. Because there is no adversarial game, it sidesteps the mode collapse that plagues GANs.
Deadly triad
Function approximation + bootstrapping + off-policy data → instability
Decode
The token-by-token half of LLM inference: after prefill digests the prompt, the model generates one new token per forward pass, each step reading the whole KV cache before producing the next logits. Like writing a sentence one word at a time while glancing back over every word already written — fast per step, but the constant re-reading of the page is what bounds speed. Decode is memory-bandwidth-bound on a GPU, the opposite of prefill, and is what most serving optimizations target.
decord
A fast video-reading library that decodes frames straight into tensors, built for deep-learning data loaders. Its key trick is efficient random access: you can ask for "frames 0, 30, and 90" and it jumps to them without decoding everything in between, which is exactly what frame sampling needs. Analogy: a regular video player reads a movie front to back like a cassette tape, while decord works like a book with an index — it flips straight to the page you want. Example: vr.get_batch([0, 30, 90]) returns just those three frames as a single tensor, ready for the model.
Decoupled
A training technique where two effects that are mathematically equivalent in standard SGD are separated into independent operations. In AdamW, weight decay is decoupled from the gradient update so that the regularization strength is not scaled by the adaptive learning rate.
Deduplication
Removing repeated or near-repeated documents from a training corpus; one of the highest-return cleaning steps in pretraining.
Deep network
A neural network with many layers stacked one after another, so the input passes through a long chain of transformations before reaching the output — "deep" literally refers to that depth (the number of stacked layers), in contrast to a "shallow" network of just one or two. Each layer builds on the features the previous one produced: in an image model the early layers might pick out edges, the middle layers shapes, and the later layers whole objects — like an assembly line where every station adds a little more refinement. Depth is what lets these models learn rich, abstract patterns, but it also makes them hard to train, because the learning signal (gradients) has to travel back through every layer and tends to fade or blow up along the way — which is precisely the problem residual connections and normalization were invented to tame.
DeepSpeed
Microsoft's open-source library for training very large models efficiently. It is best known for ZeRO, which shards a model's parameters, gradients, and optimizer state across GPUs so no single GPU has to hold the whole model — the same idea as PyTorch's FSDP. Think of it as a moving company that splits one giant load across several trucks instead of trying to cram everything into one.
Depth map
A grayscale image that records how far away each pixel is rather than its color — near things are drawn light and far things dark (or the reverse), like a black-and-white fog where closer objects glow brighter. It throws away texture and color and keeps only the 3D shape of a scene: which parts stick out toward the camera and which recede into the distance. You can estimate one from an ordinary photo with a depth-prediction network, or capture it directly with a depth sensor. ControlNet uses a depth map as a conditioning signal so a generated image keeps the same sense of near-and-far layout — the prompt repaints the surfaces, but a person standing in front of a wall stays in front of the wall.
Detached tensor
A tensor that has been removed from the dynamic computation graph via the .detach() method, meaning operations performed on it will not be tracked for autograd.
Derivative
The instantaneous rate of change of a function with respect to its input. In deep learning, derivatives are computed via the chain rule during backpropagation to produce gradients used to update model parameters.
Deterministic algorithms
Operations that produce bit-identical outputs for identical inputs every time; enabled in PyTorch via torch.use_deterministic_algorithms(True) at the cost of some performance
Detokenization
Turning a sequence of token IDs back into a UTF-8 string — the reverse of what the tokenizer did on the way in. The tricky part for streaming servers is that a single visible character (like an emoji or a Chinese character) is often spread across several BPE pieces, so emitting each token's text the moment it arrives can produce broken bytes; a correct streaming detokenizer buffers the partial bytes until they form a complete character.
DH parameters
Denavit-Hartenberg parameters — textbook arm-geometry description
Diffusion model
A generative model that learns to un-noise an image (or video, or audio).
The Key Intuition: The model only learns the reverse process. The forward process (adding noise) is a fixed, mathematical destruction (like randomly shuffling a puzzle) that requires no learning. The reverse process is the actual learning phase: the network is handed a scrambled image along with a strict label of how much noise is currently present (often measured by a timestep t or standard deviation σ). This noise level acts as a critical condition, physically injected into the network via mechanisms like AdaGN so the model knows whether to focus on forming broad outlines (high noise) or tweaking fine details (low noise).
DINOv2
A strong, off-the-shelf image encoder from Meta trained in a self-supervised way via self-distillation — it learns purely from images, with no human labels, by teaching the network to give two different crops of the same photo matching internal descriptions. The result is a general-purpose ViT backbone whose features work well for many tasks (classification, segmentation, depth) right out of the box, often beating label-trained encoders on a linear probe. Like a student who learns to recognize objects just by looking at millions of pictures and noticing what stays the same when an object is moved or cropped, never being told any object's name. The "v2" marks the second, larger and cleaner-data version; the name comes from self-distillation with no labels.
Disaggregated serving
Running prefill and decode on separate GPU pools with KV cache transfer between them
Discriminator
The "critic" half of a GAN: a network that looks at an image and outputs how likely it is to be real rather than made by the generator. It is trained like a detective spotting fakes, and its verdicts are the only teaching signal the generator ever gets — as the discriminator sharpens, the generator is forced to make more convincing images. In Wasserstein GANs it outputs an unbounded score instead of a 0–1 probability and is usually called a critic.
Dispatcher
The PyTorch component that routes torch.foo(...) calls to the right backend/dtype kernel
Distillation
Training a smaller "student" model to copy the output of a larger, more capable "teacher" so the student inherits most of the teacher's behavior at a fraction of the cost. Like a junior cook shadowing a head chef and learning each recipe by mimicking the dish — they may never match the master, but they can plate most of the menu for far less money. Distillation works for skills the teacher already has but cannot conjure new abilities the teacher lacks.
Distribution drift
When the kind of data a model sees in production slowly changes away from the data it was tuned on — like a store whose regular customers gradually change their tastes, so last year's best-selling stock starts to sit on the shelf. For a quantized model it matters because calibration was fitted to the old traffic, so quality can quietly slip as the new traffic drifts further away.
DiT
Diffusion Transformer — Peebles & Xie's diffusion backbone that replaces the U-Net with a pure transformer. It chops the noisy image (really its VAE latent) into a grid of small patches, turns each patch into a token, and lets attention mix them — the same recipe that took over language modeling, now pointed at denoising. The name simply joins "diffusion" (the denoising task) with "transformer" (the architecture). Its big draw is scaling: make it wider or deeper and quality improves along a predictable curve, the way a bigger language model reliably gets better. Like swapping a custom-built, image-shaped machine (the U-Net) for a general-purpose assembly line you can just make longer to produce more. Sizes are named DiT-S (small), DiT-B (base), DiT-L (large), and a suffix like "/2" gives the patch size — DiT-S/2 is the small model with 2×2 patches.
Dolly
A camera move where the whole camera physically travels toward or away from the subject — the name comes from the wheeled cart (a "dolly") that camera operators roll along a track. Unlike a zoom, which only magnifies the image from a fixed spot, a dolly actually changes the camera's position, so the background shifts relative to the foreground and you get a real sense of moving through the scene. It is one of the moves a video model can be directed through with camera control.
Dot product
A way to boil two equal-length lists of numbers (two vectors) down to a single number: multiply them position by position, then add up all the products. For [1, 2, 3] · [4, 5, 6] you compute 1·4 + 2·5 + 3·6 = 4 + 10 + 18 = 32.
What it does (the effect). The dot product works as a similarity score between two vectors. It comes out large and positive when the two lists "point the same way" — their big numbers sit in the same slots — near zero when they are unrelated, and negative when they pull in opposite directions. Analogy: imagine two friends each rate ten movies from −5 to +5. Multiply their scores movie by movie and add them up: if they both loved and both hated the same films the total is a big positive number (very alike); if their tastes are unrelated the pluses and minuses cancel out near zero; if one loved what the other hated it goes negative. The dot product is exactly that "how aligned are we?" number.
How to calculate it with vectors (and matmul). Doing one dot product is the multiply-and-sum above. Doing many at once is precisely what matrix multiplication (matmul, written A @ B) is built from: each number in the output grid is the dot product of one row of the left matrix with one column of the right matrix. So a single matmul is just a big batch of dot products computed together. This is why, in a projection discriminator, "taking a dot product between the image's features and a learned class vector" is simply measuring how much the image lines up with that class — a high dot product means "this really looks like that category."
Double backward
Computing the gradient of a gradient by tracking the backward pass operations in a new computation graph.
Downstream
The later, real-world tasks a model is eventually judged on — such as question answering or coding — as opposed to the pretraining objective it was trained on. "Downstream scores" measure how much a change (cleaner data, a better learning rate) actually pays off on those end tasks, the way a river's health downstream reflects what happened upstream at the source.
DPM-Solver
Short for Diffusion Probabilistic Model Solver — a fast sampler for diffusion models. While the baseline Euler method blindly takes small straight-line steps based on the immediate slope (requiring hundreds of tiny steps to avoid wandering off-path), DPM-Solver exploits the known mathematical curvature of the ODE. By calculating the exact linear parts ahead of time, it can take large, confident strides, effectively reaching the same quality in just 10–20 steps. DPM-Solver++: An upgraded version optimized for high-CFG environments. High CFG makes the predicted noise direction (ε) oscillate wildly, which confuses the standard DPM-Solver. DPM-Solver++ fixes this by mathematically pivoting the equation to predict the stable final clean image (x₀) instead of the wavering noise direction. By anchoring its steps to this unmoving final destination, it safely prevents image burn and artifacts — the oversaturated, blown-out colors and blotchy fake textures that show up when too-high guidance pushes pixel values past their valid range, like overexposing a photo until the bright spots turn into harsh white patches — even under aggressive guidance.
DPO
Direct Preference Optimization — a way to do RLHF-style alignment that skips the usual two-step machinery (first train a separate reward model, then optimize against it with PPO) and instead tunes the model directly on pairs of (chosen, rejected) answers to the same prompt. The trick is a math shortcut — a closed-form result, meaning an exact formula you can write down directly instead of searching for the answer by trial and error. It proves you never have to build the separate reward model at all: the score that reward model would have given is already hidden inside the language model's own answer probabilities (how likely the model thinks each answer is). So one simple training step does the whole job — nudge the model to make the human-preferred answer a little more likely and the rejected answer a little less likely. To stop it from over-correcting and wandering off into nonsense, every nudge is measured relative to a frozen reference model — a saved, unchanging copy of the model from before tuning — so the tuned model can only drift a small distance from where it started, like a climber clipped to a fixed anchor who can move around but not fall far. Like teaching a cook by repeatedly showing them two plates and saying "this one, not that one," instead of first writing a detailed scoring rubric (the reward model) and then training against the rubric. Example: given a prompt and two responses a human marked better/worse, one DPO step nudges the model toward the better one — no reward network and no RL rollouts required, which makes it far simpler and cheaper to run than PPO.
DQN
Deep Q-Network — Q-learning with neural-net function approximation + experience replay + target network
Draft model
In speculative decoding, a small, fast model that guesses the next few tokens so the big target model can check them all at once. Like a quick assistant who scribbles a rough draft for the expert to approve or correct — cheap to run, and most of its guesses turn out right, so the slow expert is consulted far less often.
DreamBooth
A personalization recipe that fine-tunes the whole diffusion model on just 3–5 photos of one subject and binds it to a rare trigger word, so you can afterwards prompt "a photo of [V] dog surfing." Because every weight is updated the likeness is excellent, but the saved model is full-sized — the opposite trade-off from a lightweight LoRA. To stop the model from catastrophically forgetting what other dogs look like, it adds a prior-preservation loss that keeps training on the model's own generic class images. Like memorizing one specific face in such detail that you must consciously remind yourself other faces still exist.
dtype
A tensor's element data type — e.g. float32, float16, bfloat16, int8, bool
Dual encoder
An architecture with two separate encoders — one per modality, e.g. an image tower and a text tower — that each map their input into the same shared space, where the two are compared by cosine similarity. The key trait is that the modalities never mix until the very end (late fusion): each side is encoded entirely on its own. CLIP is the classic example. Analogy: two translators who never talk to each other but have both been trained to render anything into one common interlingua, so their outputs can be lined up afterward. This separation is exactly what makes dual encoders fast for cross-modal retrieval — you encode the whole image collection once, in advance, and a new text query only has to be encoded and compared — but it also means they can only match or score, never reason over or generate the other modality the way a VLM can.
Dynamic computation graph
A graph of operations built on-the-fly as code executes, representing the forward pass used for autograd.
Dynamic quantization
A quantization method that stores weights as int8 ahead of time but computes each layer's activation scale at runtime, just before the layer runs.
Eager mode
PyTorch's default execution, where each operation runs immediately as its Python line is reached — flexible and easy to debug, but without the cross-operation optimizations a compiler can apply.
EAGLE / Medusa
Self-speculation: extra heads on the target model propose tokens, no separate draft model
Earth Mover's Distance
A way to measure how far apart two distributions are by the smallest amount of "work" needed to reshape one pile into the other — imagine shovelling a heap of dirt into the shape of a second heap, where work is dirt moved times distance carried. Also called the Wasserstein distance, it gives a smooth, meaningful number even when the two piles barely overlap, which is exactly why Wasserstein GANs use it in place of the original GAN loss that goes flat in that case.
Edge inference
Running a model directly on the device in front of the user — a phone, laptop, car, or small embedded board — instead of sending the request to a data-center GPU. Like cooking at home rather than ordering delivery: it is private and works without a network, but you are limited to the small "kitchen" the device has, so models are kept small (1–8B), heavily quantized, and tuned to sip battery and fit in shared memory.
EDM
A cleaned-up reformulation of diffusion from the paper Elucidating the Design Space of Diffusion-Based Generative Models (Karras et al. 2022) that strips away historical baggage and makes training and sampling much easier to tune. Two ideas carry it: index noise by its standard deviation σ rather than a discrete timestep (the σ-schedule), and precondition the network — rescale its input, output, and per-σ loss weight so it always sees roughly unit-variance signals no matter how much noise is present. The result is a flat, forgiving hyperparameter surface. A useful rule of thumb: if a 2020-era diffusion paper feels obscure, restate it in EDM's language and it usually becomes obvious.
EKF
Extended Kalman Filter — Kalman filter linearized about the current estimate
ELBO
Evidence Lower Bound — a mathematical score used to train generative models like VAEs. It balances two goals: recreating the original input accurately, and keeping the model's internal representation organized. Think of it like packing for a trip: you want to bring everything you need (accurate recreation) but also pack it neatly so the suitcase closes easily (organized representation). The ELBO is the score that measures how well the model balances both tasks.
Elementwise operation
An operation applied independently to each element of a tensor (e.g. add, multiply, ReLU), where output position i depends only on input position i.
Elo
A rating system borrowed from chess that turns a series of head-to-head wins and losses into a single number per player: beat a strong opponent and your rating jumps, lose to a weak one and it drops. LLM arenas use it to rank chat models from pairwise comparisons instead of from a fixed-answer benchmark.
EMA weights
Exponential moving average of model weights; samples better than the live weights
Embedding
A dense vector that represents a token (or other item) so the model can compute over it; each token ID maps to one row of the embedding matrix
Embedding matrix
The lookup table E ∈ ℝ^{V×d} that turns each token ID into a dense vector by selecting its row; growing the vocabulary means adding rows
Embedding space
The shared multi-dimensional space that all embeddings live in, where every item is a point and direction and distance carry meaning — items that mean similar things sit close together and point the same way. Talking about the geometry of the embedding space means asking how those points are arranged: are the true image–caption pairs bunched into tight clusters, spread evenly over the unit sphere, or collapsed into one indistinct blob? Analogy: a city where related shops naturally form neighborhoods — you learn a lot about a model by inspecting the shape of its map, not just whether it gets answers right. In CLIP, a well-chosen temperature pulls matching pairs into tight clusters on the sphere while keeping mismatches pushed apart, so the geometry itself reveals how confidently the model separates right from wrong.
Enormous on paper
Describes a model like a Mixture-of-Experts (MoE) that has a massive total number of parameters (e.g., 100 billion), but only uses a small fraction of them (e.g., 10 billion) for any single word. Like a giant university with 5,000 courses listed in its catalog (enormous on paper)—no single student takes all 5,000 courses. Each student only takes a few classes at a time, so the cost per student remains low, even though the total catalog is huge.
Error budget
The small amount of failure an SLO allows. If your target is 99.9% success, the remaining 0.1% — about 43 minutes a month — is your error budget. Like a monthly data allowance on a phone plan: you can "spend" it on risky deploys and experiments, but once it runs out you stop taking risks until it resets. It turns reliability from a vague goal into a balance you can watch.
Euler method
The simplest way to numerically solve a differential equation: look at the slope where you currently are, take one straight-line step in that direction, then repeat. It is easy to implement but accumulates error quickly because it ignores how the slope changes during the step, so diffusion samplers built on it need many steps to stay accurate. Like steering a car by only ever looking at the road directly under the bumper. Contrast with Heun's method and DPM-Solver, which correct for the changing slope and so need far fewer steps.
Evaluation harness
A ready-made framework that runs a model through many benchmarks automatically, with the prompts, answer parsing, and scoring all fixed so every model is tested the exact same way. Like a standardized testing center that hands every candidate the same paper and grades it with the same answer key, instead of each examiner writing their own quiz. For VLMs the two common ones are lmms-eval and VLMEvalKit: you point them at a model and a list of benchmarks (MMBench, MMMU, DocVQA, …) and they return one comparable table of scores. This matters because tiny differences in prompt wording or in how a multiple-choice letter is pulled out of the answer can swing a score by several points, so sharing one harness is what makes two papers' numbers actually comparable. Example: a single command like python -m lmms_eval --model llava --tasks mmbench,mmmu evaluates the model on both and prints an accuracy for each.
ExecuTorch
PyTorch's lightweight runtime for running models on mobile and edge devices, built on the graph captured by torch.export.
Expert
In a Mixture-of-Experts (MoE), one of several parallel MLP sub-networks; a router sends each token to only the top few experts instead of all of them. Like a hospital triage desk that routes each patient to the right specialist rather than making everyone see every doctor — lots of expertise on hand, but only a little used per case.
Expert parallelism (EP)
For MoE models, distributing experts across GPUs with all-to-all token routing
Exponent
The part of a floating-point number that records its scale — how many places to shift the decimal point. In scientific notation like 3.5 × 10¹², the 12 is the exponent (using base 10 instead of base 2). More exponent bits give a wider range of representable magnitudes, from astronomically large to vanishingly small; fewer exponent bits mean values overflow or underflow more easily. This is why FP8 has two flavors: E5M2 (5 exponent bits) for gradients that can swing wildly in size, and E4M3 (4 exponent bits) for activations that stay in a tighter range. See also mantissa.
FCFS
First-Come, First-Served — the simplest scheduling rule: handle requests in the exact order they arrive, like a single queue at a bakery where nobody can skip ahead. It is fair and easy to build, but it has no sense of deadlines, so one slow request at the front can make everyone behind it late.
FFN
Feed-Forward Network — the small MLP inside each transformer block. Position-wise means it is applied to each token (each position in the sequence) on its own, reusing the same weights at every position — like one cashier serving each customer in line one at a time at the same till, never letting them interact. That is the opposite of the attention sublayer, where tokens do look at each other; the FFN just lets each token "think" by itself.
F/T sensor
Force/Torque sensor — six-axis force and moment at a wrist or fingertip
Farnebäck optical flow
A classical (non-neural) algorithm for computing dense optical flow, named after its inventor Gunnar Farnebäck. It estimates motion by approximating the brightness around each pixel with a small quadratic (a smooth curved surface) in both frames and solving for the shift that lines them up. Analogy: it slides a tiny transparent patch of the first frame around the second until it clicks into place, and records how far it had to move. It is fast and needs no training, but it is less accurate on large or blurry motion than a learned model like RAFT. Example: OpenCV's cv2.calcOpticalFlowFarneback returns a (H, W, 2) array giving each pixel's left–right and up–down movement.
FID
Fréchet Inception Distance — the standard sample-quality metric for image generation. ("Fréchet," after the mathematician Maurice Fréchet, names the Fréchet distance: a way to measure how far apart two probability distributions sit.) It runs both real and generated images through a pretrained Inception network to turn each image into a feature vector, then measures how far apart the two clouds of features sit by comparing their means and covariances (their centers and spreads). A lower FID means the generated images look statistically more like the real ones — picture two overlapping clouds of dots: the more they overlap, the smaller the distance. The real images here are only a yardstick, not an ingredient: your model invents brand-new images from random noise and never copies the real ones — FID simply needs a pile of real photos to compare those inventions against so it can score how convincing they are.
FILM
FILM (Frame Interpolation for Large Motion) is a neural frame-interpolation model from Google that, given two real frames, generates the frames in between — and it is specifically built to cope when objects move a long way between the two shots, the case where older methods smear or tear. It estimates motion at several scales at once (a coarse pass catches big jumps, finer passes catch small ones) and warps both frames toward the middle before blending them. Think of an animation assistant who can fill in the missing "in-between" drawings between two key poses even when the character has leapt clear across the scene. It is a convenient pretrained model for seeing, firsthand, the artifacts that fast motion produces.
Filterbank
A stack of band-pass filters that each measure how much energy a signal carries in one narrow frequency range — together they split a sound into a set of frequency "buckets." A mel filterbank is the specific set used to build a mel spectrogram: commonly 80 filters, each shaped by triangular weights and spaced on the perceptual mel scale, all stored as one fixed matrix. Applying it is a single matrix multiply that collapses the STFT's hundreds of evenly spaced frequency rows down to a handful of mel bands. Like a row of differently tuned wine glasses, each ringing only for the note near its own pitch: play a chord and you can read off how much of each note is present from how loudly each glass hums. Example: a 1024-point FFT produces ~513 frequency values per frame; multiplying by an 80×513 mel filterbank matrix turns each time frame into just 80 numbers.
Fine-tuning
Taking a model that was already trained on a huge dataset and training it a little further on a small, specific dataset so it picks up a new skill, subject, or style. The big initial training is expensive and done once; fine-tuning is cheap and reuses all that knowledge — like hiring an experienced cook and teaching them your three house recipes rather than training someone from scratch. In image generation you might fine-tune Stable Diffusion on 20 photos of your pet so it can draw that specific pet. Fine-tuning can update every weight (as in DreamBooth) or just a tiny added piece (as in LoRA); the less you change, the smaller and more shareable the result, at some cost in how much new behavior you can absorb.
FineWeb-Edu
A large, openly released pretraining dataset built by running a quality filter over crawled web pages and keeping only the educational-looking ones — like skimming a huge pile of internet text and saving just the pages that read like a textbook. Models trained on it often beat models trained on far more unfiltered text, making it a go-to example that data quality can matter more than raw quantity.
FK / IK
Forward / Inverse Kinematics — compute end-effector pose from joints or vice versa
Flamingo
DeepMind's 2022 vision-language model that pioneered gated cross-attention: it leaves a big pretrained language model entirely frozen and inserts brand-new cross-attention layers between its blocks so the text can look at image features. The clever part is the gate — a learned multiplier that starts at exactly zero, so on the very first training step the new layers contribute nothing and the model behaves identically to the original language model, then the gate slowly opens as training teaches it how much image information to let in. Like adding a new water line to a working house but keeping its valve shut until you have checked every joint, then easing it open. This "don't break what already works, blend the new capability in gradually" trick is why Flamingo could bolt vision onto a frozen LLM without destabilizing it, and it became a template later VLMs copied. The projector-only approach of LLaVA is the simpler rival design.
FlashAttention
A much faster way to compute attention that never writes the giant token-by-token score table to slow HBM memory. Plain attention builds the full T × T grid of how strongly every token attends to every other token, parks it in HBM, then reads it back — a flood of slow memory traffic. FlashAttention instead works on small tiles inside the chip's fast on-chip memory (SRAM) and keeps a running total, so the huge grid never has to be stored at all. Like adding up a long column of numbers in your head as you go instead of writing every subtotal on paper — same answer, far fewer trips to the slow notebook. Every modern inference engine relies on it.
FlashDecoding
A version of FlashAttention tuned for the decode step, where there is just one new query token but a long KV cache to read. It splits that long read across many GPU workers so the HBM bandwidth stays fully used instead of one worker plodding through the cache alone — the trick that lets engines like vLLM hit near-peak bandwidth on decode-heavy traffic.
float16
16-bit floating-point format (fp16); saves memory and can be fast on GPUs, but has a limited range (max ~65,504) that can cause underflow when accumulating very small values
float32
32-bit floating-point format (fp32); the standard default precision for PyTorch tensors — wide enough range and enough precision for most training and inference tasks
FLOPs
Floating-Point Operations — a count of the individual arithmetic steps (additions and multiplications on decimal numbers) a model performs, used as a hardware-independent measure of how much compute one forward pass costs. You estimate it by adding up the work in each layer: a matrix multiply of an M×K matrix by a K×N one, for instance, takes about 2·M·K·N FLOPs (each of the M·N outputs needs K multiplies and K adds). Like counting the total pencil strokes a calculation requires, regardless of how fast the person writing them is. More FLOPs means a slower, costlier model — exactly the price you pay when a ViT uses smaller patches. (Note: "FLOPs" = operations; "FLOP/s" with a slash = operations per second, a speed.)
Flow matching
Training a velocity field — a model that, given a half-noisy image and a time, predicts which direction and how fast to move it toward a clean image — so that following those arrows turns pure noise into data. Concretely, you draw a straight line between a real image x_0 and random noise ε, pick a random point on that line, and train the model to output the line's direction ε - x_0; at generation time you start at noise and repeatedly step along the predicted arrows (solving an ODE) until you arrive at a clean image. It is a simpler, more modern alternative to DDPM: there is no noise schedule to tune, just one clean regression target. Like learning the wind currents over a map so that, dropped anywhere in the fog, you always know which way blows toward home.
Flux
A family of state-of-the-art open-weight text-to-image models released in 2024 by Black Forest Labs (a team that included original Stable Diffusion researchers). Flux is built on a large MMDiT backbone trained with rectified flow, so text and image tokens share the same attention layers and the model denoises along nearly straight paths — which is why it follows detailed prompts and renders legible text unusually well. It ships in a few flavors: a top-quality "pro" version, an open "dev" version for tinkering, and a distilled "schnell" (German for fast) version that trades a little quality for very few sampling steps. Think of it as the generation of image models that arrived just after SD3 and pushed open-weight quality a notch higher.
Forensics
Working backward from a training failure to the operation that first caused it, instead of chasing the visible symptom. In PyTorch this means turning on autograd anomaly detection to halt at the first NaN or bad gradient.
FP4
4-bit floating point — half the bits of FP8 again, so a weight takes a quarter of the space of bfloat16. With only 4 bits there are just 16 possible values, so it sits near the edge of usable precision and needs careful checking; newer Blackwell GPUs accelerate it in hardware, making it attractive for squeezing huge models onto fewer chips.
Forward hook
A callback registered on an nn.Module that PyTorch calls automatically after the module's forward pass, receiving the input and output tensors; used for capturing activations and debugging
Forward pass
One complete run of an input through the whole network — every layer in order, from the first to the last — to produce an output (for an LLM, the logits for the next token). It means start-to-finish through all the layers, not a single layer. Like running a part down an entire assembly line once to get the finished product. The reverse direction, used in training to compute gradients, is the backward pass.
Fourier transform
A math tool that takes a signal that changes over time — like a sound wave — and reveals which pure frequencies (pitches) it is secretly built from, and how much of each. Like a glass prism splitting white light into its rainbow of colors, the Fourier transform splits a messy sound into the simple sine-wave "tones" hidden inside it. Concrete example: feed it a recording of a piano chord and it answers "this is mostly 262 Hz (middle C) + 330 Hz (E) + 392 Hz (G)". How it works: it slides every candidate frequency past the signal, multiplies the two together point by point, and adds up the products (a dot product); when a test frequency really is present the bumps line up and the sum comes out large, and when it is absent the products cancel to near zero — so a big result means "yes, that pitch is in here." The catch is that it tells you which frequencies are present across the whole clip but not when each one happened, which is exactly why the STFT runs it on short overlapping slices instead. It is named after Joseph Fourier, who showed that any repeating signal can be rebuilt by adding up enough simple sine waves.
FP8
8-bit floating point — half the bits of bfloat16. Comes in two flavors: E4M3 (4 exponent bits + 3 mantissa bits) keeps a bit more precision and is used for weights and the forward activations; E5M2 (5 exponent + 2 mantissa) trades precision for a wider range and is used for gradients, which can be very large or very small. Supported by Hopper and later NVIDIA GPUs, it is rapidly becoming the modern default serving precision.
Fragmentation
Memory wasted in gaps too small to reuse, left behind when each request is given its own contiguous chunk — like a parking lot full of single empty spaces where no bus can fit even though there is plenty of total room. Paged schemes such as PagedAttention avoid it by handing out small fixed-size pages instead of one big block per request.
Frame interpolation
Generating new frames between two existing ones to make motion smoother or a clip slower — turning, say, 24 frames per second into 60. It is sometimes called "video generation lite" because the model only has to invent the short motion between two anchors it can already see, not a whole scene from nothing. The classic analogy is hand-drawn animation: a lead artist draws the key poses and an assistant fills in the "in-between" frames — the industry literally calls this inbetweening. Modern neural versions such as FILM and Super SloMo estimate how each pixel moves between the two frames (closely related to optical flow) and warp the images toward the midpoint.
Frame rate (fps)
How many still frames a video shows per second — "fps" stands for frames per second (e.g. 24, 30, 60). It sets how much real-world time sits between two neighboring frames, so the same motion looks bigger and choppier at low fps and smoother at high fps. Analogy: a flipbook drawn with 12 pages per second looks jerky; the same drawings at 60 pages per second look fluid. Example: sampling 16 frames evenly from a 2-second clip at 8 fps covers the whole clip, but grabbing 16 consecutive frames from a 60-fps clip covers only a quarter-second — so a model must be told which fps it is seeing.
Frontier run
A training run for one of the largest, most capable models at the leading edge of what is currently possible — the kind that ties up thousands of GPUs for weeks and costs millions of dollars. Because the stakes are so high, a loss spike that cannot be recovered cleanly can throw away days of that compute, which is why teams rehearse checkpoint recovery on small models first.
Frozen
A layer or whole sub-network is frozen when its weights are held fixed during training — the optimizer is told to skip them, so no gradients update them — while other parts of the model keep learning. Like renovating one room of a house while the rest stays sealed off and untouched. Freezing is how you reuse an expensive pretrained component (a CLIP image encoder, a big language model) as a fixed feature extractor and train only a small new piece — a projector, an adapter, or a LoRA — on top: it saves memory and compute and protects the pretrained knowledge from being overwritten by a small new dataset. The opposite is leaving a part trainable (or "unfrozen"), as fine-tuning does.
F.scaled_dot_product_attention
PyTorch's built-in fused attention function (in torch.nn.functional) that computes softmax(QKᵀ/√d)·V in a single call, dispatching to an optimized backend such as FlashAttention.
FSDP
Fully Sharded Data Parallel — shard params, grads, and optimizer state across ranks
FSQ
Finite Scalar Quantization — a way to make discrete image tokens without a learned codebook. Instead of looking up the nearest entry in a trained table, it simply rounds each coordinate of the latent to the nearest value on a fixed grid, like snapping every measurement to the nearest tick on a ruler. Because there is nothing to train in the quantizer, it is simpler and sidesteps codebook collapse, yet stays competitive with VQ-VAE.
Function calling
The mechanism by which a model uses a tool: it emits a structured request (such as JSON naming a function and its arguments), an external program runs that request, and the result is handed back to the model. Also called tool use.
Fusion (early/middle/late)
Where in the network the information from different modalities is combined. Late fusion encodes each modality fully on its own and only compares the two finished embeddings at the very end (CLIP matching an image vector to a text vector). Middle fusion encodes each separately but then lets one stream attend to the other partway through, usually with cross-attention (a VLM feeding image features into a language model). Early fusion turns every modality into one shared stream of tokens from the very start and runs a single model over the mix (native multimodal models like Chameleon). Think of three ways to combine a recipe's flavors: stir two finished sauces together at the table (late), blend them while each is still simmering (middle), or throw every raw ingredient into one pot from the beginning (early). The earlier the fusion, the more freely the modalities can shape each other — but the more compute and data it takes to train.
Future frame prediction
The task of, given the first few frames of a video, predicting the frames that come next — an early benchmark for whether a model has learned how things move. It is the video cousin of next-word prediction in language: the model is trained to continue a sequence it has only partly seen. The classic toy benchmark is Moving MNIST and the classic baseline architecture is the ConvLSTM. Because the future is genuinely uncertain, a simple model trained with mean squared error tends to hedge by blurring — averaging all the plausible futures into one fuzzy guess rather than committing to a single sharp one.
FVD
Fréchet Video Distance — the standard (and flawed) automatic eval metric for video generation
GAE
Generalized Advantage Estimation — TD(λ) for advantages
GAN inversion
Running a GAN backwards: given a real photo, find the input latent code that makes the generator reproduce it. A trained generator only goes code → image, so inversion recovers the missing code either by optimizing it to lower reconstruction error or by training an encoder to predict it in one shot. It is the step that lets you edit a real image — once you have its code, nudging the code changes the picture.
GANs (Generative Adversarial Networks)
A class of generative models that trains two networks in a contest. A generator turns random noise into fake images, and a discriminator tries to tell those fakes from real ones; each one makes the other better, like a counterfeiter and a detective locked in an arms race. At the end you keep the generator, which by then makes images realistic enough to fool a well-trained critic. GANs produce sharp samples but are famously unstable to train — see mode collapse.
Gated
An operation where one path of a neural network controls how much of another path gets through, by multiplying the two together value-by-value (element-wise multiplication). Picture a row of dimmer switches — or the valves on a bank of faucets. The main path carries the information; the second path produces a "gate" number for each value, and that number turns the corresponding value up or down. A gate near 0 shuts a value off (nothing passes), a gate near 1 lets it through untouched, and anything in between is a partial dribble. Because the multiply happens one number at a time, every feature gets its own private valve, so the network can wave some details through while damping others — all decided on the fly from the input. This is the trick behind LSTM "forget/input gates" (deciding what to keep vs. drop from memory) and modern MLP blocks like SwiGLU, where one half of the layer gates the other. It is closely related to scale-and-shift conditioning, except a pure gate only scales (multiplies) rather than also adding an offset.
GCG
Short for Greedy Coordinate Gradient — a gradient-based attack that finds an adversarial suffix (a short string of seemingly random tokens) which, when appended to a harmful question, causes an aligned LLM to comply anyway. It works by swapping one suffix token at a time for whatever the gradient says raises the probability of an unsafe answer most. Like picking a combination lock by feeling each dial until the click; once one model is unlocked the same suffix often opens other models, which is why GCG is the standard benchmark attack in jailbreak research.
GELU
Gaussian Error Linear Unit — a smooth activation function widely used in transformer MLPs.
GEMM
GEneral Matrix Multiply — the workhorse operation C = A × B on two matrices, and the single most common heavy computation inside a neural network. GPUs are built to do GEMMs fast; nearly every layer's forward pass is one. When one input is very "skinny" (a tiny batch, as in single-token decode) the GPU's Tensor Cores sit half-idle, so that case needs a different kernel from a big, square prefill GEMM.
Generalization
How well a model performs on inputs it has never seen, as opposed to merely repeating its training examples. A model that generalizes has captured the underlying pattern; one that has only memorized has captured the examples — the difference between a student who learned how multiplication works and one who memorized a single times-table and is lost on any new numbers. For a style LoRA, generalization means the learned look transfers to prompts that never appeared in training; its opposite is overfitting. You measure it by checking performance on held-out inputs, not on the training set.
Generator
The half of a GAN that actually makes images: it takes a vector of random noise and maps it to a picture, learning to fool the discriminator into judging its output as real. It never sees the real images directly — it learns only from whether the discriminator was fooled, like a forger who improves purely from a detective's reactions. After training, the generator alone is what you keep and sample from.
GenEval
A benchmark that measures how faithfully a text-to-image model obeys the structured content of a prompt — the right number of objects, the right colors, the right spatial arrangement ("a red cube to the left of a blue sphere"). Instead of asking a person, it runs an object detector on each generated image and checks automatically whether every requested object, count, color, and position is present; the score is the fraction of prompts whose requirements were all satisfied. Like an exam graded against a fixed answer key rather than on handwriting: "two cats? — yes; one is orange? — no, fail." It targets compositional skills (counting, positioning, attribute binding) that beauty metrics like FID completely ignore.
GGUF
A single-file format for storing a quantized model — weights plus all the metadata needed to run it — popularized by llama.cpp. Like a self-contained zip that a laptop or phone can open and run without extra setup, it is the format of choice for edge and on-device inference.
Glow
A well-known normalizing flow model (from OpenAI, 2018) that improved on Real NVP by adding learnable 1×1 convolutions that shuffle and mix the channels between steps, letting it generate sharp, high-resolution faces. It showed that flows could produce convincing images and smoothly morph one face into another, though they were later overtaken by diffusion models on hard, real-world images.
GLU
Gated Linear Unit — a layer that computes two things from the input and multiplies them together element by element: one is the actual content, the other is a "gate" (a non-linearity whose output sits near 0–1) that decides how much of that content to let through. Like a row of dimmer switches, one per wire, that the network learns to turn up or down — rather than a plain on/off. Being able to suppress parts of its own signal makes a GLU more expressive than a single linear layer; SwiGLU is the popular variant that uses Swish for the gate.
GPTQ
Short for Generative Pre-trained Transformer Quantization — a post-training quantization (PTQ) method that compresses each layer's weights row by row, using second-order (Hessian) information to choose the int8 / int4 values that minimize the reconstruction error one layer at a time. Despite the name, GPTQ is not GPT-specific; it works on any transformer.
GQA
Grouped-Query Attention — sharing K/V heads across query heads; primary KV-cache saver at serving time
Gradient accumulation
Summing the gradients from several small batches before calling the optimizer, so the update matches a larger effective batch size without its memory cost.
Gradients
The vector of partial derivatives of a function with respect to each of its parameters. Think of it as a list of slopes telling you exactly the rate of change of the output with respect to each parameter.
The Intuition This combination of direction and size is exactly what an optimizer follows downhill—much like feeling which way a hillside slopes and how steeply to find the lowest point.
A Concrete Example
Consider a simple model y = w·x + b with weight w = 2, bias b = 1, and input x = 3.
- Prediction:
y = 2·3 + 1 = 7 - Target:
10 - Loss:
L = (y - target)² = (7 - 10)² = 9
Calculating the Gradient
Using the chain rule, we can determine the exact rate of change of the Loss (L) with respect to w and b. Since the derivative of L with respect to y is 2(y - target), and the partial derivative of y with respect to w is x (while with respect to b it is 1), the gradients are calculated as follows:
- For Weight (w):
∂L/∂w = (∂L/∂y) · (∂y/∂w) = 2(y - target) · x = 2(-3) · 3 = -18 - For Bias (b):
∂L/∂b = (∂L/∂y) · (∂y/∂b) = 2(y - target) · 1 = 2(-3) · 1 = -6
Interpreting the Results
- Direction (The Sign): The negative signs indicate that you need to increase both
wandbto reduce the error. - Magnitude (The Size): "Bigger" here refers to the absolute value (
|-18| > |-6|). Even though-18is a more negative number than-6, its magnitude is larger. This means the weight (w) has a much stronger influence on the loss than the bias (b).
Gradient checkpointing
A memory-saving technique that discards intermediate activations during the forward pass and recomputes them during the backward pass.
Gradient penalty
An extra loss term used in Wasserstein GANs (WGAN-GP) that keeps the critic 1-Lipschitz — meaning its output cannot change faster than its input. It works by measuring the size of the critic's gradient with respect to its input image and pushing that size toward 1. This replaces the original WGAN's blunt trick of clipping weights to a fixed range, which often hurt quality, and is the main reason WGAN-GP trains so stably.
GradScaler
A helper used with float16 mixed-precision training that multiplies the loss before the backward pass, preventing small gradients from rounding to zero (underflow).
Graph break
A point where torch.compile cannot trace the code (e.g. a print or a data-dependent branch), forcing it to split the model and fall back to eager mode — a common cause of lost speedup.
Greedy decoding
The simplest sampling rule: at every step, pick the single most likely next token (the argmax of the logits) and never roll the dice. Like always ordering the most popular dish on the menu — boring but predictable. Useful when reproducibility matters, though on a GPU even greedy decoding is not bit-for-bit deterministic across batch sizes because floating-point sums reorder.
Grounding
Making a VLM point at where something is in an image, not just say that it is there — the model outputs spatial references like a bounding box (a rectangle (x1, y1, x2, y2) around an object) or a single point, instead of only words. The common trick is to add special tokens (e.g. <box>) plus tokens for quantized coordinates to the vocabulary, so a location becomes a few extra tokens the model emits with ordinary next-token prediction — no new architecture needed. In modern native multimodal models, this alignment is leveraged during the decode phase by attending heavily to visual features stored in the KV cache during prefill. Analogy: the difference between a tour guide who says "there's a fountain in this plaza" and one who actually points their finger at it. Example: asked "where is the dog?", a grounded model answers "<box> 0.10 0.20 0.45 0.80", which a viewer can draw as a rectangle on the photo; this is what benchmarks like RefCOCO measure.
GRPO
Group Relative Policy Optimization — value-function-free PPO variant; DeepSeek lineage
GSM8K
A benchmark of about 8,000 grade-school math word problems, widely used to test step-by-step reasoning because each problem has a single checkable numeric answer.
GTSAM
Factor-graph SLAM library; the standard back-end for many modern systems
H.264
The most common video codec on the internet, also called AVC (Advanced Video Coding) — the rules used to compress almost every .mp4 you have ever streamed. It compresses well and decodes fast on nearly all hardware, which is why it is the safe default, though newer codecs like AV1 shrink files further. Analogy: it is the JPEG of video — not the smallest or newest, but supported everywhere. Example: a 5-second 720p clip that is 333 MB as raw frames might be only a few megabytes as an H.264 .mp4.
H2O
Short for Heavy-Hitter Oracle, a KV cache eviction method that keeps only the handful of past tokens that have been getting most of the attention — the "heavy hitters" — and throws the rest away. Like skimming a long book and keeping only the few sentences you keep flipping back to: you save shelf space while barely losing the plot, which lets a model serve much longer sequences in the same memory. It always keeps the very first tokens too (the attention sink), since those anchor the model no matter what they say.
Half-rotation
An efficient way to apply RoPE: rather than rotating each adjacent pair of vector components on its own, you split the vector into two halves and combine them in one shot (the rotate_half trick, [x₁, x₂] → [−x₂, x₁]). It turns many tiny 2-D rotations into a couple of whole-vector operations, so it runs fast on a GPU while giving the same result.
Hallucination
When an LLM states something false with the same confident tone it uses for true things — invented citations, made-up people, fabricated facts. Like a student who didn't read the book but answers the essay question anyway in confident prose; the grammar is fine, the facts are not. Hallucination is built in to the next-token prediction objective, which rewards fluent continuation rather than truth, and is mitigated (not solved) by RAG, verifiers, and abstention training.
Hard negatives
In contrastive training, the wrong candidates the model finds hard to reject because they look almost right — as opposed to easy negatives so obviously wrong they teach it nothing. For a photo of a husky, the caption "a wolf in deep snow" is a hard negative (close, but wrong), while "a slice of pizza" is an easy one. Training learns fastest from hard negatives because they sit right on the boundary the model is still getting wrong, so each one delivers a large, informative gradient; mining them means actively searching the data for these near-misses (e.g. the highest-cosine-similarity mismatch) instead of hoping a random batch happens to contain some. Like a chess student who improves quickest by drilling against opponents just above their level, not by beating beginners over and over. Example: to mine hard negatives for a caption, retrieve the images it scores highly against but does not actually describe, and add those as negatives for the next training step.
HBM
High-Bandwidth Memory — stacked DRAM on a modern GPU; usually the bandwidth bottleneck
Headroom
The safety margin you have left before something breaks. In low-precision training it is the spare range of values a number format can still represent before it overflows or rounds down to zero and triggers numerical issues — like the gap between your head and the ceiling: the less you have, the easier it is to bump into trouble. FP8 packs numbers into far fewer bits than bfloat16, so it has much less headroom and is more likely to destabilize a run.
Heads (attention)
The independent, parallel attention sub-computations in multi-head attention. Each head operates on its own learned projections of queries, keys, and values, so different heads can latch onto different relationships — one might track which word is the grammatical subject while another tracks what rhymes — and the model attends to several representation subspaces at once. They are called heads by analogy to the read/write "heads" of a tape or disk drive: several separate readers scanning the same strip of data in parallel, each pulling out something different. "Multi-head" attention simply runs many such readers side by side and then joins their findings.
Hessian
The matrix of all second partial derivatives of a function — it captures the curvature of a loss landscape, not just its slope. Where the gradient tells you "which way is downhill," the Hessian tells you "and how sharply does it bend." Like the difference between knowing a road slopes down and knowing whether it banks into a tight curve or stretches out almost flat. For real LLMs the full Hessian is too big to store (rows × columns each equal to the parameter count), so methods like GPTQ use cheap approximations of it — typically built from a small batch of calibration activations — to decide which weights matter most when quantizing.
Heun's method
A second-order ODE solver that improves on the Euler method with a predict-then-correct step: it takes a tentative Euler step, measures the slope at that new point too, then moves using the average of the start and end slopes. Averaging the two slopes cancels much of the error Euler makes, so Heun reaches the same accuracy in far fewer steps — which is why it is the default sampler in EDM. Like checking the road both where you are and where you're about to be, then steering down the middle. Named after the German mathematician Karl Heun.
Hierarchical VAE
A VAE with several layers of latent variables stacked at different scales instead of just one. Higher levels capture the big picture (overall layout and shape) while lower levels fill in fine detail (texture and edges), much like an artist who first sketches rough shapes and then adds the small touches. Splitting the work across levels lets the model represent complex images far better than a single flat latent space can. NVAE and Very Deep VAE are well-known examples.
Higher-order sampler
In diffusion sampling, the solver takes a series of discrete steps along an ODE path from pure noise toward a finished image. A higher-order sampler estimates the shape of that path more accurately at each step by using extra slope measurements, instead of assuming the path is locally a straight line. A first-order method (Euler) just follows the slope where it currently stands; a second-order method like Heun's method or DPM-Solver++ also peeks ahead and averages the two slopes, cancelling most of the error. Because each step is more accurate, higher-order samplers reach good image quality in far fewer steps — often 15–25 instead of 50+. Picture driving toward a bend: a first-order driver steers only by where the road points right now and drifts wide, while a higher-order driver also notices how the road is curving ahead and corrects, staying on track with fewer adjustments.
Holonomic
A vehicle whose instantaneous motion can be any direction (mecanum, omni)
Hopper
NVIDIA's 2022 GPU architecture (H100, H200) and the workhorse of LLM training and serving in 2023–2024. It was the first generation to ship dedicated FP8 Tensor Cores, which is what made FP8 inference a practical option. Named after Grace Hopper, the computer scientist who invented the compiler.
Hue
The "color name" part of a color — red, orange, green, blue, and so on — separate from how light or dark it is and how vivid it is. It is one axis of the way computers describe color (the H in the HSV color model), and it wraps around in a circle, so the two ends meet and both 0° and 360° are red. Analogy: hue is the label on a paint tube ("blue"), brightness is how much white or black you stirred in, and saturation is how strong the color is. In an optical flow picture, hue is often used to show the direction each pixel moved — each compass direction gets its own color — while brightness shows how fast it moved.
Hybrid retrieval
Retrieving with both dense embedding search (matches meaning) and sparse keyword search (BM25) (matches exact words) and merging the two result lists, so each method covers the other's weaknesses.
I2V
Image-to-Video: the task of generating a short video clip starting from a single still image, where the model invents plausible motion while keeping the first frame's appearance fixed. It is easier than text-to-video (T2V) because the image already settles what the scene looks like, leaving the model to handle only how it moves — and its training data is essentially free, since any video clip can be split into "first frame = input, the rest = target" with no text caption needed. Stable Video Diffusion is the canonical open I2V model.
Identity function
A function that returns its input unchanged: f(x) = x. In the context of straight-through estimators, gradients are passed through a non-differentiable operation as if it were the identity function.
Ideogram
A text-to-image model (and product) built by a startup of the same name, especially praised for text rendering — drawing legible, correctly-spelled words, logos, and typography inside images, which makes it a favorite for posters and graphic design. Like a sign painter you can trust to spell the shop name right, not just paint pretty letters. It competes with DALL·E 3, Imagen 3, and Flux.
Image embedding
The single dense vector a vision model boils a whole picture down to — a fixed-length list of numbers (say 512 of them) that captures what is in the image rather than its raw pixels. CLIP's image encoder, for instance, reads the pixels and outputs one such vector, placing pictures with similar content near each other in the shared embedding space. Analogy: distilling a whole meal down to a single flavor profile you can quickly compare against other dishes — you lose the individual ingredients but keep the essence needed to say "these two are alike." Because an image and a caption can then be compared just by the cosine similarity of their vectors, image embeddings are what make zero-shot classification and cross-modal retrieval work. Example: in CLIP the photo of a dog and the sentence "a photo of a dog" each become one vector, and the two land close together.
Imagen 3
Google's text-to-image model, known for photorealistic detail and unusually good text rendering — it can spell words inside the picture correctly, long a weak spot for generators. It leans on a strong text encoder and carefully curated training data to follow prompts faithfully. Like a meticulous illustrator who not only paints the scene you describe but gets the lettering on the signs right. It is Google's competitor to DALL·E 3 and Stable Diffusion.
ImageNet
A large benchmark dataset of about 1.2 million photos hand-labeled into 1,000 everyday categories (breeds of dog, kinds of mushroom, vehicles, and so on). For over a decade it has been the standard yardstick for "how well does this model see," so a new image encoder is almost always reported by its ImageNet accuracy. Think of it as the standardized entrance exam of computer vision — not perfect, but common enough that everyone's scores can be compared on one scale. A larger, even more finely labeled version is called ImageNet-21k (≈21,000 categories); see also its much smaller cousin CIFAR-10.
img2img
Generating a new image that is guided by an existing input image instead of starting from pure noise. You partially noise the input — controlled by a denoising strength (0 = keep the original, 1 = ignore it) — then let the diffusion model denoise from there, so the result keeps the rough layout and colors of the input while following the new prompt. Like tracing over a rough sketch: the more you erase first, the more freedom the model has to redraw.
Impedance control
Command a virtual spring-damper between end-effector and reference
IMU
Inertial Measurement Unit — gyroscope + accelerometer (often + magnetometer)
Inception network
A famous image-classification convolutional neural network (the "Inception" / GoogLeNet family) trained on millions of labeled photos. Along the way it learns to boil any image down to a compact feature vector — a list of numbers that captures what is in the picture (fur, wheels, sky) rather than the raw pixels. Because those features are such good summaries of image content, quality metrics like FID reuse a frozen, pretrained Inception network as a fixed yardstick instead of training anything new — like always using the same trusted scale to weigh two bags so the comparison is fair. (It was nicknamed "Inception" after the movie, for its "network inside a network" design.)
Indexing
Mapping a multidimensional index [i, j, …] to a flat storage position via offset + Σ iₖ·strideₖ
Inference-time compute
The work a model does while answering a question (not while training) — for reasoning models, mostly the tokens it spends "thinking" before it replies. Giving a fixed model more inference-time compute, like giving a student more time on an exam, can raise its accuracy without changing the model at all.
InfiniBand (IB)
High-speed network with RDMA; standard for AI clusters
InfoNCE
The contrastive loss that CLIP and most dual encoders train with: for each item it pulls the one correct match closer and pushes every other candidate away. How it is computed. Take a batch of N image–caption pairs, L2-normalize every vector, and build the N×N grid of cosine-similarity scores (one matmul). Each row is one image scored against all N captions, and the correct caption sits on the diagonal. Apply softmax across the row and ask that the diagonal entry get nearly all the probability — which is exactly cross-entropy with "the right answer is position i." Do this across rows and again across columns and average the two. The name is short for Noise-Contrastive Estimation of mutual Information: the off-diagonal pairs are the "noise" the true pair must be told apart from. Analogy: a police lineup where the model must point to the one caption that truly goes with this photo while N−1 decoys stand beside it, scored on how confidently it picks the right one.
Inpainting
Filling in a masked-out region of an image so the patch blends seamlessly with the rest. You hand the model the surrounding pixels as fixed context and let it generate only the hole — like a restorer repainting a torn corner of a photo to match the surviving picture. With a diffusion model this is done by re-noising and denoising only inside the mask while pasting the known pixels back on every step.
Instruction tuning
A second training stage that turns a model which merely continues text (or, for a VLM, describes an image) into one that follows requests — by fine-tuning it on many (instruction, response) examples instead of raw documents. For a VLM the examples are conversational (image, question, answer) triples, like the LLaVA-Instruct set whose dialogues a strong language model wrote from image annotations. Analogy: a fluent speaker who can ramble on any topic versus a helpful assistant who answers the exact question you asked — same vocabulary, very different behavior, and the gap is closed purely by showing thousands of question-and-answer demonstrations. Example: before tuning, shown a photo and "What is the dog doing?", the model might just caption "a dog on grass"; after tuning it answers "It is catching a frisbee." The key lesson is that this capability comes from data, not architecture — the network is unchanged; only what it trains on differs.
InstructPix2Pix
An image-editing model that takes a photo and a plain-English instruction ("make it winter," "add sunglasses") and returns the edited photo in a single pass — no masks, no per-image optimization. Its real trick is the training data: since no one wants to hand-edit thousands of photos, the data is made synthetically — a large language model writes an instruction plus before/after captions, and a text-to-image model (Stable Diffusion) with Prompt-to-Prompt renders a matched image pair that differs only in the described change. The finished model is then fine-tuned on millions of these triples. Like teaching an editor by showing them countless "before, instruction, after" flashcards until they can follow any new instruction.
int8
8-bit integer format; storing weights or activations as int8 uses a quarter of the memory of float32 and can run faster, at some cost in precision.
Inter-rater agreement
A measure of how often two or more graders give the same scores to the same items — the check you run before trusting one grader to stand in for another. If a cheap LLM-as-judge and a human reviewer rate the same 100 answers and their scores line up, the automatic judge can replace expensive human review; if they disagree a lot, it cannot. It is computed with a statistic such as a correlation (how well two lists of numbers rise and fall together, on a −1-to-+1 scale) or Cohen's kappa (the fraction of agreement beyond what random guessing alone would produce, on a roughly 0-to-1 scale, named after the psychologist Jacob Cohen). Analogy: two teachers marking the same stack of essays — if their grades nearly match you can trust either one alone next time, but if they wildly differ then the rubric (or one of the teachers) is unreliable.
IQL
Implicit Q-Learning — offline RL that never queries Q at OOD actions
Isaac Lab
NVIDIA GPU-parallel robotics simulation platform
iso
A prefix meaning "equal" or "the same" (from the Greek isos). In a phrase like iso-FLOP it marks a group of training runs that all spent the same compute budget, so they can be compared fairly — like rating cars by how far each travels on the same tank of fuel rather than on top speed. Plotting the loss of several iso-FLOP runs is how a Chinchilla-style scaling-law curve is drawn.
ITL / TPOT
Inter-token latency / time per output token — steady-state per-token decode time
Jacobian
Linear map from joint velocities to end-effector spatial velocity
Joint image-video training
A training recipe that feeds a video model a mix of still images and video clips in the same run — treating each still image as a one-frame "video" — so the model keeps its sharp single-image skills while it learns motion. The problem it solves: training on video alone lets a model's still-image quality decay, because video datasets are smaller and more compressed than image datasets, so the rich appearance knowledge an inflated image model started with drifts away. Mixing in a large fraction of images (often the majority of each batch) keeps that knowledge fresh and makes the model far more data-efficient. It needs no architectural change because a still image is simply the T=1 special case of a video — the same layers process both.
Jailbreak
A prompt — sometimes plain English, sometimes a gradient-found suffix like in GCG, sometimes a long role-play setup or a translation into a low-resource language — that gets a safety-trained model to do what its alignment training was supposed to refuse. Like picking a hotel-room door lock instead of asking for the key. Modern defenses assume any single safety layer can be jailbroken and use defense in depth — input filtering, output filtering, monitoring, refusal classifiers — instead of trusting the model alone.
Kernel
A single function that runs on the GPU (or CPU) to carry out one operation, such as a matrix multiply or an element-wise add.
Kernel fusion
Combining several small operations into one kernel so the hardware reads and writes memory fewer times and pays fewer launch costs.
KF
Kalman Filter — optimal linear-Gaussian Bayes filter
KL divergence
Short for Kullback-Leibler divergence — a number that measures how far one probability distribution has drifted from another, growing larger the more the two disagree. In RLHF it acts as a leash on the policy being trained: the further its word probabilities wander from the frozen reference model, the bigger the penalty it pays. Like a tether that lets a climber explore but stops them straying somewhere dangerous, it lets the model chase reward without forgetting how to talk sensibly.
KV cache
A scratchpad that stores the attention keys and values already computed for every earlier token in the sequence, so generating the next token only has to compute keys and values for that one new token instead of redoing all the previous ones. Like writing out a long multiplication table once and then looking up products instead of recalculating them — it turns each decode step from "redo the whole prompt" into "do one more token," which is what makes long-context serving fast enough to be usable.
L2 normalization
Rescaling a vector so its length becomes exactly 1 while keeping the direction it points unchanged — done by dividing every element by the vector's own length. It is called L2 because the length it uses is the L2 norm (also called the Euclidean norm — the ordinary straight-line distance you would measure with a ruler). The "2" comes from the p in the general Lp norm formula, which takes the p-th root of the sum of each element's p-th power; set p = 2 and that becomes the square root of the sum of squares — exactly the Pythagorean length √(x₁² + x₂² + …). Worked example: [3, 4] has length √(3² + 4²) = √25 = 5, so its L2-normalized form is [3/5, 4/5] = [0.6, 0.8] — same direction, but now length 1 and sitting on the unit sphere. Analogy: shrinking every arrow on a map to the same one-inch length so you can compare which way they point without the longer arrows drowning out the shorter ones. This is the step that turns a plain dot product into cosine similarity, which is why CLIP L2-normalizes every image and text embedding before scoring matches — so only direction (meaning), not magnitude, decides the score.
L2 regularization
A regularization technique that adds a penalty proportional to the squared magnitude of model weights to the loss function, encouraging smaller weights and reducing overfitting. In standard adaptive optimizers such as Adam, this penalty is folded into the gradient and scaled by the adaptive learning rate, which is why AdamW uses decoupled weight decay instead.
LAION
A family of huge, openly released image-text datasets (LAION-400M, LAION-5B — the number counts the image-caption pairs) scraped from the public web by the non-profit LAION (short for Large-scale Artificial Intelligence Open Network). Each entry is just an image URL plus its alt-text caption, kept only if CLIP judged image and caption to roughly match. It is the public fuel that trained Stable Diffusion and many other open models. Like a giant secondhand library assembled by photographing every captioned picture on the open internet — enormous and free, but riddled with mislabeled, duplicated, and low-quality entries, which is why every serious user re-filters and deduplicates it before training. Example: "LAION-2B-en" is the roughly 2-billion-pair English-caption subset.
Langevin dynamics
A way to draw samples from a distribution when you only know its score — the gradient of its log-density. You start from a random point and repeatedly take a small step in the score direction (uphill toward higher probability) while also adding a little random noise each step so you explore rather than collapse onto a single peak. The uphill pull plus the random shake settles the point into high-probability regions in the right proportions, like a ball jiggling around a bumpy bowl and spending most of its time in the deepest dips. It is the sampling method behind the original score-based generative models. Named after the physicist Paul Langevin.
Latency
The time it takes to complete a single request, from input to output; distinct from throughput, which counts how many requests finish per second.
Latent space
The compressed set of numbers a model uses to represent its data internally, after stripping away the raw detail. Each point in this space stands for one possible output, and nearby points usually mean similar outputs — so you can smoothly "walk" from one to another and watch the result morph. Think of it as the model's private map of its world: instead of a full 28×28-pixel image, an autoencoder might describe each digit with just 32 numbers, and that 32-number space is the latent space.
Latent video
The compressed form of a video that a 3D VAE produces: instead of the raw (T, H, W, C) pixel tensor, you get a much smaller (T', H', W', C) grid where time, height, and width have all been shrunk (often ~100× fewer numbers overall). Modern video diffusion runs in this latent space rather than on pixels, because denoising a 100×-smaller tensor is what makes high-resolution video generation affordable at all.
LCM
Latent Consistency Model — a consistency model distilled in the latent space of a VAE, giving 1–4-step Stable Diffusion-style sampling. It is the most practical few-step recipe for SD-style stacks, which is what makes near-interactive image generation possible.
LDM
Latent Diffusion Model — a diffusion model that runs in the latent space of a VAE rather than on raw pixels. A VAE first compresses the image (or video) into a much smaller grid of numbers; the diffusion model learns to denoise that small grid, and the VAE decoder turns the finished latent back into pixels. Because the latent is often ~50–100× smaller than the image, every training and sampling step is dramatically cheaper, which is the whole reason high-resolution generation became affordable. Stable Diffusion is the canonical image LDM; modern video models apply the same idea on top of a 3D VAE.
LFQ
Lookup-Free Quantization — a way to turn a continuous latent into a discrete token without a learned codebook. Instead of comparing each latent vector against a trained table of code entries and picking the nearest (the VQ-VAE way), LFQ squashes each latent dimension to a sign — roughly, "is this number positive or negative?" — so the pattern of signs across the dimensions is the integer code. With no table to look up, there is nothing that can go unused, which sidesteps codebook collapse and lets the effective vocabulary grow huge cheaply. It is the quantizer behind MagViT-v2 and a close cousin of FSQ, which snaps each dimension to a small grid of levels rather than just a sign.
Leaderboard
A public ranking that lists models by their score on one or more benchmarks, best at the top — like a sports league table for AI models. It makes progress easy to see at a glance, but a single number hides many hidden choices (prompt wording, answer parsing, image resolution), so two groups can report different scores for the same model; a high rank is also suspect if the test questions leaked into training (see contamination). Example: the MMMU leaderboard ranks VLMs by their accuracy on the MMMU exam, and a new model's headline claim is usually "we moved up this board."
Learnable
Refers to parts of an AI model (like weights or parameters) that are not set in stone by the programmer, but are instead adjusted automatically during training to improve performance. Like the knobs on a radio that tune themselves until the station comes in perfectly clear, rather than being glued in place.
Learning rate
The step size an optimizer takes when nudging the weights along the gradient. Too large and training overshoots and diverges; too small and it crawls — like choosing how big a step to take walking downhill in fog. It is usually ramped up during warmup and then decayed over the run.
LiDAR
Light Detection And Ranging — laser range scanner
Linear probe
A small linear classifier trained on the frozen hidden activations of a layer of a neural network to test whether that layer has already encoded some property — for example, "is this sentence true?", "what is the capital of this country?", or "which language is this?" Like sticking a voltmeter into one wire of a circuit to see what signal is flowing past that point; you don't change the circuit, you just read what's already there. The standard first tool in mechanistic interpretability.
Lipschitz constraint
A limit on how fast a function's output can change as its input changes: a 1-Lipschitz function never changes its output by more than the distance you moved the input. Picture a road whose slope is capped so it can never get steeper than 45° — no cliffs allowed. (The name simply honors the 19th-century German mathematician Rudolf Lipschitz, who first wrote down this "bounded-steepness" condition; it is not a description of the rule itself, the way "Celsius" is just a person's name rather than a word about temperature.) Wasserstein GANs require their critic to obey this so the Earth Mover's Distance it estimates stays valid, which is what the gradient penalty enforces.
LLaVA
Large Language and Vision Assistant — an open-source vision-language model that shows how far the simplest possible design can go: take a frozen CLIP image encoder, take a frozen LLM, and connect them with nothing but a lightweight projector (a single linear layer or small MLP) that translates each image patch's feature vector into the LLM's word-embedding space. The LLM then "reads" the image as if it were a sequence of extra words. Think of a United Nations translator who listens to a speech in one language and re-phrases each sentence for a listener who only speaks another — the translator (projector) does not change the content, just the format. Despite having no cross-attention or Q-Former, LLaVA matches or beats far more complex architectures on many visual-question-answering benchmarks, which is why its projector-only design became a widely-copied template. Compare with Flamingo, which uses gated cross-attention instead.
LLM
Large Language Model — a transformer trained on large amounts of text to predict and generate language.
LLM-as-judge
Using a strong LLM to grade or compare other models' answers in place of a human rater — fast, cheap, and surprisingly well-calibrated, though it tends to favor longer answers and ones written in its own style. To catch position bias you usually ask twice with the two answers swapped and trust only an agreeing verdict — like a blind wine tasting where the same two bottles are poured first as "Glass A, Glass B" and then again as "Glass B, Glass A"; you only believe the judge picked the better wine if they pick the same bottle both times, because that rules out them simply liking whichever glass sat on the left.
Load balancing
Spreading incoming requests across several copies of a service so no single one is overwhelmed while others sit idle — like a supermarket opening more checkout lanes and a greeter waving each new customer to the shortest one. The simplest rule is round-robin (hand requests out in turn, 1-2-3-1-2-3…); smarter rules send each request to the least-busy replica or to the one whose cache is already warm. The component that does this is a load balancer.
Load shedding
Deliberately dropping or rejecting some requests when a server is overloaded, so the ones it does accept still meet their targets. Returning a fast "try again later" to low-priority traffic is far kinder than letting every request crawl — like a busy restaurant turning new walk-ins away so the diners already seated still get served on time. It usually works hand in hand with admission control and request priority.
Logits
The raw, unnormalized scores a model produces at its output, one per vocabulary entry, before they are turned into probabilities by softmax. Like the points each contestant has scored at the end of a game — bigger means "more likely the next token" — but to read them as percentages you have to normalize. Sampling rules (temperature, top-k, top-p) all reshape the logits before the random draw, and argmax of the logits is what greedy decoding picks.
LoRA
Low-Rank Adaptation — a cheap way to fine-tune a huge model without rewriting it. Instead of changing the model's billions of frozen weights, you leave them all untouched and bolt on a tiny pair of extra low-rank matrices that nudge the output. Why a pair and not a single matrix? A lone update matrix would have to be the same full size as the weights it is correcting — which defeats the whole point of saving space. The trick is to split that update into two skinny matrices in a row: the first squeezes the big input down to just a handful of numbers, and the second expands those few numbers back out to full size. Picture an hourglass — wide, pinched to a narrow waist, then wide again: it is the narrow waist in the middle (the low rank) that keeps the total number of stored values tiny, and you need both halves of the hourglass to get from one side to the other. Like leaving a printed textbook exactly as it is and slipping in a few sticky notes that change how you read it: the notes are small to store, quick to write, and you can keep a different set of notes for each task and swap them in and out.
Loss function
A mathematical function that measures the difference between a model's prediction and the actual target. The goal of training is to minimize this value using gradients.
Loss masking
Telling the trainer to compute the loss only on the tokens you want the model to learn to produce — in SFT, the assistant's reply — and to ignore the rest, like grading only a student's answers and not the printed questions.
Loss spike
A sudden jump in the training loss, usually from an outlier batch or optimizer instability; small spikes are normal, but a diverging one can ruin a run.
Loss value
The single scalar number produced by evaluating the loss function on a model's predictions. autograd's backward pass computes gradients of this one scalar with respect to every parameter, which is what makes reverse-mode differentiation efficient.
Lorax / S-LoRA
Multi-LoRA serving engines; one base model + many adapters in HBM
Low-rank
A way of approximating a big matrix as the product of two much skinnier ones, capturing most of its information with far fewer numbers. A full 1000×1000 weight matrix holds a million entries, but if its real content is "low rank" you can rebuild it well from, say, two 1000×8 matrices — a few thousand numbers instead of a million. Like summarizing a thick report with a handful of bullet points that still carry the gist. This is the trick behind LoRA: freeze the giant base weights and learn only a small low-rank update on top.
Low-resource language
A language for which little digital training data exists — few transcribed recordings, books, or web pages — compared with high-resource languages like English or Mandarin that have billions of words online. Models trained mostly on the abundant languages do worst here, simply because they have not seen enough examples to learn the language's sounds and spellings. Like a cook who has made thousands of Italian dishes but tasted Ethiopian food only once — they will be shaky at Ethiopian cooking until they practice it specifically. Example: Whisper transcribes English almost perfectly but makes far more errors on a language like Welsh or Amharic, which is exactly where a few hours of targeted fine-tuning data helps most.
LQR
Linear-Quadratic Regulator — optimal linear feedback for quadratic cost
LSTM
Long Short-Term Memory — a type of recurrent neural network (RNN) cell designed to remember things over long sequences without the information fading away. A plain RNN is like whispering a message down a long line of people — by the end, the message is garbled. An LSTM fixes this with three gates: a forget gate that decides what old information to throw out, an input gate that decides what new information to store, and an output gate that decides what to actually hand to the next step. Together they maintain a "cell state" — a conveyor belt of memory that can carry important facts across hundreds of time steps with minimal loss. LSTMs were the go-to architecture for sequences (language, speech, time series) before transformers took over, and they remain the classic example of gated memory in neural networks.
MagViT-v2
The strongest open recipe for discrete video tokenization — turning a clip into a grid of integer tokens that an autoregressive or transformer model can generate the same way it generates language. It builds on the VQ-VAE idea of a discrete latent but replaces the learned codebook with LFQ (lookup-free quantization), which sidesteps codebook collapse and scales to a very large vocabulary cheaply. A single MagViT-v2 tokenizer handles both still images and video (it shares the causal trick of encoding the first frame on its own), and its reconstructions are sharp enough that token-based generators can finally rival diffusion models on quality — its headline claim is that a good enough tokenizer is what makes language-model-style video generation competitive.
Manifold
The thin, curved surface inside a much larger space where real data actually lives. A 32×32 color image is a point in a space of 3,072 numbers, but almost every random point in that space looks like static — only a vanishingly small, smoothly connected sliver of it looks like a real photo, and that sliver is the manifold. A useful analogy: a sheet of paper is a 2D surface, but if you crumple it and drop it into a room it traces out a thin curved shape floating in 3D space; the paper is the manifold and the room is the full space. Learning to generate images is largely learning the shape of this surface so you only ever land on it.
Manipulability
Scalar measure of how "easy" motion is from a given configuration (e.g. sqrt(det(JJᵀ)))
Mantissa
The part of a floating-point number that holds the precision digits — the significant figures sitting in front of the scale factor. In 3.5 × 10¹², the 3.5 is the mantissa (also called the significand). More mantissa bits give finer resolution between nearby values; fewer mantissa bits leave larger gaps between representable numbers. FP8's E4M3 format means 4 exponent bits + 3 mantissa bits, so it can only distinguish about 8 distinct values between each consecutive power of two — coarse, but small enough to fit twice as many numbers in the same memory as bfloat16.
Marlin
A specialized GPU kernel for mixed-precision matmul — 4-bit weights multiplied by 16-bit activations — built to stay fast even on the skinny, small-batch shapes of decode. It unpacks the 4-bit weights on the fly while keeping the Tensor Cores busy, so a quantized model runs nearly as fast as the math allows. (Named after the fast-swimming marlin fish.)
MaskGIT
A way to generate image tokens in parallel instead of one at a time. Starting from a grid where almost every token is hidden ("masked"), a transformer predicts them all at once, keeps only the predictions it is most confident about, and repeats over a handful of rounds until the grid is full. The analogy is filling in a crossword: lock in the answers you are sure of first, and the rest get easier. This makes it much faster than raster-order autoregressive generation, which must fill the grid one token at a time.
matmul
Matrix multiplication — the dominant compute operation in neural networks; written A @ B in PyTorch.
MDP
Markov Decision Process — the tuple (S, A, P, R, γ)
Mechanistic interpretability
The line of research that tries to reverse-engineer what individual pieces of a neural network actually do — which neurons or attention heads detect what, where a fact is stored, why a particular output came out. Like opening up a watch to see which gears turn the hands, instead of only timing how fast the watch runs. Main tools: linear probes, sparse autoencoders, activation patching, and circuit analysis.
Media container
The file format that wraps compressed video (plus audio, subtitles, and metadata) into one file — .mp4, .mov, .webm, and .mkv are containers. The container is the box; the video codec is how the picture inside was compressed, and the two are independent — the same H.264 video can sit in an .mp4 or a .mov. Analogy: a container is like a shipping box labeled on the outside, while the codec is the packing method used for the fragile thing inside. Example: a .webm file is a container that usually holds AV1- or VP9-compressed video, whereas .mp4 most often holds H.264.
Medical-image segmentation
The task of labelling every pixel in a medical scan (MRI, CT, X-ray, microscopy) as belonging to a particular structure — outlining a tumour, an organ, or a cell boundary — rather than just classifying the whole image with one label. The output is a per-pixel mask, like a precise coloring-book page where each region is filled with its own color. It demands very fine spatial accuracy, since a few pixels can be the difference between the edge of a tumour and healthy tissue, which is exactly why the U-Net's skip connections — carrying fine detail straight across the network — were originally designed for it. Think of tracing the exact outline of each country on a map instead of just saying "this is a map of Europe."
Mel bands
The output channels of a mel filterbank — the handful of frequency buckets (commonly 80) that a mel spectrogram keeps after squeezing the STFT's hundreds of fine frequency rows onto the perceptual mel scale. Low-pitch bands are narrow and closely spaced while high-pitch bands are wide, mirroring how human hearing tells low notes apart easily but lumps high ones together. Like sorting a piano's 88 keys into a few labeled bins where each bass key gets its own bin but many treble keys share one. Example: an 80-band mel spectrogram describes each moment of sound with 80 numbers instead of 500+ raw frequency values, small enough for a CNN or transformer to process like an image.
Mel spectrogram
A picture of sound: a 2D map with time along one axis and pitch along the other, where brightness shows how much of each pitch is present at each moment. It is built by sliding a short window across the audio waveform and measuring its frequencies (a Short-Time Fourier Transform), then squashing the frequency axis onto the mel scale — a perceptual spacing that, like human hearing, gives lots of resolution to low pitches and lumps high ones together (the jump from 100 to 200 Hz sounds bigger than the jump from 5,000 to 5,100 Hz). The payoff is that audio becomes an image with, say, 80 frequency rows, so the same CNN or transformer machinery built for vision can process it. Like turning a song into sheet music — a flat diagram you can read at a glance instead of a wiggling waveform.
Meta-learning
"Learning to learn" — training a model to adapt quickly to new tasks with few examples. Many meta-learning algorithms, such as MAML, rely on higher-order gradients to optimize across tasks.
Memorization
When an LLM reproduces a chunk of its training data verbatim instead of generalizing from it — give it the right opening prompt and out comes the original passage word for word. Like a student who recites a textbook sentence rather than explaining the idea; useful for trivia, dangerous for copyright, PII, and security. Deduplication at training time and prompt filtering at serving time are the main mitigations.
Memory leak
An unintended increase in memory usage over time, often caused in PyTorch by holding onto references to the loss function or other parts of the dynamic computation graph across training iterations.
Memory mapping
Accessing a file on disk as if it were an in-memory array, reading slices on demand without loading the whole file into RAM (e.g. numpy.memmap).
Memory snapshot
A recording of how much GPU memory is allocated at one moment; comparing snapshots taken across training steps reveals a steadily growing memory leak.
Megatron
NVIDIA's approach to tensor parallelism that splits attention and MLP layers column-wise and row-wise across GPUs with carefully placed AllReduce collectives, allowing efficient intra-layer parallelism.
MFU
Model FLOPs Utilization — the fraction of a GPU's peak arithmetic speed a training run actually uses (e.g. 70% MFU). Like a delivery truck's fill rate, it shows how much of the hardware you are paying for is doing useful work instead of waiting on memory or the network.
Micrograd
A tiny, educational autograd engine implemented in basic Python by Andrej Karpathy to illustrate how reverse-mode differentiation works.
MinHash
A hashing technique for estimating how similar two documents are, used to find and remove near-duplicate text at corpus scale (see deduplication).
MLP
Multi-Layer Perceptron — the simplest kind of neural network (also called a feedforward network): a stack of fully-connected layers with a non-linear activation in between. A fully-connected layer means every input number connects to every output number, each connection carrying its own weight — like a voting panel where every voter influences every result. A non-linear activation (such as ReLU or SwiGLU) is a simple bend applied after each layer; you need one because stacking plain linear layers just collapses back into a single straight line, so the bend is what lets the network learn curved, complicated patterns. In a transformer, the model is a tall stack of identical blocks, and each block has two sublayers in order: an attention sublayer (tokens look at each other) then an MLP sublayer (each token is processed on its own). So going up the stack it really does look like attention → MLP → attention → MLP → … — attention passes notes around the room, the MLP is each person quietly thinking about what they just read.
Attention comes before the MLP because a token should gather context from the others first and only then think for itself — you read the room, then form your own thought. The two are built differently: attention has each token build a weighted blend of every token's values (that blend is how they "look at each other"), while the MLP runs a plain feed-forward on each token's vector alone, with the same weights at every position (that is "on its own"). The bend inside that MLP has grown more capable over time: plain ReLU just clips negatives to 0; a GLU instead multiplies the content by a learned 0-to-1 "gate" so the network can dial parts of it down; and SwiGLU is a GLU whose gate uses the smooth Swish curve — the modern default.
MMDiT
Multi-Modal Diffusion Transformer — the DiT variant used in SD3 and Flux where text tokens and image tokens flow through the same attention layers ("joint attention") instead of having the image attend to the text through a separate cross-attention step. Each modality (text vs image) keeps its own normalization and MLP weights, but they see and influence each other inside one shared attention operation, which helps the model get compositional prompts ("a red cube on a blue sphere") right. Like seating writers and illustrators at one table where everyone hears the whole conversation, instead of passing notes between two separate rooms.
MMBench
A multiple-choice benchmark for VLMs that probes many separate abilities — object recognition, spatial relationships, attribute comparison, and more — with each question offering a few labeled answer choices. To stop a model from scoring well by luck or by always favoring one letter, it asks the same question several times with the choices shuffled and counts it correct only if the model picks the right answer every time (a trick its authors call CircularEval). Like re-asking a quiz question with the options reordered to be sure the student actually knows the answer rather than having memorized "it's always C." It is one of the standard general-capability scores reported for any new VLM.
MMLU
Massive Multitask Language Understanding — a 57-subject multiple-choice benchmark (history, law, medicine, math, and more) that became the standard quick test of how much general knowledge a model has, like a giant trivia exam spanning many school subjects at once.
MNIST
A classic dataset of 70,000 small 28×28 grayscale images of handwritten digits 0–9. It is the most common "hello world" for image models — tiny, clean, and quick to train on — so a brand-new idea is almost always tried on MNIST first, before anyone risks it on harder, fuller-color data like CIFAR-10.
MMMU
Massive Multi-discipline Multimodal Understanding — a hard benchmark of college-exam questions across many fields (medicine, engineering, art, business), where each question mixes text with an image such as a diagram, chart, or chemical structure. It is built to require real subject reasoning rather than just reading the picture, which is why even strong VLMs still score far below human experts on it. Like a university final that hands you a figure and expects you to apply the course material to it, not merely describe what you see. It is the most-cited measure of frontier multimodal reasoning, and the harder MMMU-Pro variant adds more answer choices and trickier distractors to fight contamination.
MoCoGAN
MoCoGAN (Motion and Content GAN) is an early video GAN whose key idea is to split a video's latent code into two parts: a single content vector that stays fixed for the whole clip (the identity of the person or object) and a sequence of motion vectors that change frame to frame (how it moves). Because content is held fixed while motion varies, the same face can be made to perform different expressions, or one motion can be replayed on different faces. This separation — disentangling what from how — keeps a subject from morphing as it moves; the generator reads a new motion vector each frame (produced by a small recurrent network) on top of the one shared content vector. The same content/motion split keeps reappearing inside later diffusion-based systems, which is why the 2017 model is still worth studying.
Modality
One type or format of data — text, images, audio, video, a depth map, and so on. Each modality has its own structure (text is a sequence of tokens, an image is a grid of pixels, audio is a waveform), so a model usually needs a dedicated encoder for each one before their information can be combined. A model that handles more than one is called multimodal. Think of modalities as the different human senses — sight, hearing, touch — each carrying information about the same world but in a different form, which the brain then has to fuse into one understanding. Cross-attention is one common way to let two modalities exchange information.
Modality balancing
In a single model trained on several modalities at once, the practice of adjusting how much each one contributes to the loss so that no single modality drowns out the others. The problem arises because modalities are rarely equal in size: if 99% of your tokens are text and only 1% are image, the text next-token-prediction loss dominates the gradient and the model barely learns to handle images. It is like a study schedule where, left alone, you would spend every hour on your strongest subject — you have to deliberately reweight so the weaker subjects get their share of attention. Concretely, you either oversample the under-represented modality's data or multiply its loss term by a larger coefficient, tuning until each modality's loss falls at a comparable rate.
Modality gap
The repeated empirical finding that, even in a model like CLIP trained to put matching items in one shared space, the embeddings of one modality (all the images) sit in a different region from those of another (all the captions) — two separate clusters rather than one blended cloud. The pairs are still correctly aligned (a photo is nearer its own caption than to a wrong one), but a constant offset separates the two modalities, a side effect of how contrastive training and the random initial weights shape the geometry. Analogy: two choirs singing the same song in perfect harmony but standing on opposite sides of the stage — in tune with each other, yet never in the same spot. You can see it by encoding a batch of images and captions, reducing them with PCA, and watching the two colors land in separate blobs; it matters because it lowers the cosine-similarity scores of true pairs and can be partly fixed by shifting one modality's vectors toward the other.
Mode collapse
A GAN failure where the generator discovers a few outputs that reliably fool the discriminator and just keeps making those, ignoring the variety in the real data — like a comedian who finds one joke that always lands and tells only that joke. Each sample may look fine on its own, but the model has stopped covering most of the data. It is the defining instability of GAN training; its discrete-latent cousin is codebook collapse.
MoE
Mixture-of-Experts — instead of one big MLP per layer, the model holds many parallel "expert" MLPs and a small router sends each token to only the top few. Like a big company where every question goes to just the two or three relevant specialists rather than the whole staff, the model can hold a huge number of total parameters while doing only a fixed, small amount of compute per token. The serving catch: which experts get used shifts with the workload, so keeping them evenly busy across GPUs (expert parallelism) is the hard part.
Momentum
A technique that accumulates a moving average of past gradients to dampen oscillations and accelerate gradient descent in consistent directions
Monosemantic
A feature inside a neural network that fires for exactly one concept — for example, a direction in activation space that lights up only for "Golden Gate Bridge," or only for "negation in a clause." The opposite is polysemantic: one neuron that activates for several unrelated concepts at once. Like a single word that means just one thing versus a homonym that means several. Recovering monosemantic features is the main goal of SAE-based interpretability.
Motion module
The plug-in component at the heart of AnimateDiff: a stack of time-aware (temporal) layers — mostly attention along the time axis — inserted between the blocks of a frozen image U-Net. The frozen image model still produces each frame's appearance; the motion module's only job is to look across frames and nudge them so the sequence moves smoothly instead of flickering independently. Think of it as a "motion adapter" you clip onto a still-image model — trained once on video, then reused unchanged across many image checkpoints.
Motion score
A single number handed to a video model that says how much motion a clip should contain — low for a near-still "animated photo", high for vigorous movement. During training it is measured from each real clip (commonly from the average optical-flow magnitude between frames — how far pixels travel), so the model learns to associate the number with an amount of movement; at inference you set it by hand to dial motion up or down. Stable Video Diffusion calls its version the motion bucket id, sorting clips into discrete buckets of increasing motion rather than using a continuous value. It is the simplest control surface for video: one knob that separates how much it moves from what is in it.
MoveIt
ROS 2 manipulation-planning framework
Moving MNIST
A simple synthetic video dataset built by taking handwritten digits from MNIST and bouncing two of them around inside a black 64×64 frame, where they drift in straight lines and ricochet off the edges. The motion is perfectly predictable (constant velocity plus bounces), but the digits overlap and pass in front of each other, which is just hard enough to test a future frame prediction model without the cost and decoding pain of real video. It became the standard first benchmark for video-prediction models such as the ConvLSTM.
MPC
Model Predictive Control — re-solved finite-horizon optimization at each step
MPS
Metal Performance Shaders — the GPU backend for Apple Silicon
MQA
Multi-Query Attention — all query heads share a single key/value head; the most aggressive KV-cache saver, at some quality cost
MSE (mean squared error)
The most basic way to score how wrong a prediction is: at each point take the difference between the predicted and true value, square it (so overshoots and undershoots both count as positive, and big misses are punished extra), then average over all points. For images it compares pixel by pixel, so a guess that is a little off everywhere still scores well — which is exactly why training on MSE alone tends to produce blurry results: when the model is unsure, the safest low-MSE answer is to predict the average of all the plausible pixels, and an average of sharp options looks like a smudge. This is the failure a perceptual loss is designed to avoid.
MT-Bench
A benchmark that scores a chat model's answers to a set of multi-turn questions, often using a strong LLM as the judge; a quick proxy for how helpful an assistant feels.
MuJoCo
Open-source physics engine; the de facto manipulation/locomotion simulator
Multi-head attention
Running several attention operations (heads) in parallel, each with its own learned projections of queries, keys, and values, then concatenating their results. Like having several readers skim the same sentence for different things — one tracks the grammar, another tracks who-did-what — and then pooling what each one noticed.
Multi-LoRA
Serving many LoRA adapters from one shared copy of the base model at the same time. Keeping LoRA's sticky-note picture: one cookbook (the base model) plus a drawer full of sticky-note sets (the adapters), one set per customer. The kitchen keeps a single cookbook and just grabs the right set of notes for each order, instead of buying a whole new cookbook for every customer — so one GPU can serve hundreds of fine-tunes at once.
Multi-tenant
One shared system serving many independent users or customers ("tenants") at the same time, who must not see or slow down one another — like an apartment building where many families live under one roof but each behind their own locked door. A multi-tenant inference service mixes everyone's requests onto the same GPUs, which is why fair scheduling, per-user rate limits, and tricks like shared-prefix cache routing matter so much.
Multi-turn conversation
A chat where the user and the AI take turns talking back and forth, building on what was said earlier, like a natural human conversation. For example, if you ask "What's a good movie?" and then ask "Who stars in it?", the AI remembers the movie from the first turn. Instead of starting from scratch every time, the system keeps the past conversation in its KV cache — like keeping an open notebook on your desk instead of erasing the whiteboard after every question.
NaN
"Not a Number" — a floating-point value representing an undefined or unrepresentable result (e.g., 0/0 or inf - inf). In PyTorch, NaNs often appear when gradients explode or when taking the logarithm of zero/negative numbers.
Native multimodal
A model trained from scratch on all modalities at once over a single shared vocabulary, instead of bolting a vision encoder onto a finished language model. Every modality is turned into tokens — text tokens, image tokens from a VQ-VAE, audio tokens from a neural codec — that all live in one alphabet, and one transformer reads and writes any mix of them with a single next-token objective. This is the early-fusion extreme, used by models like Chameleon and GPT-4o. Analogy: rather than hiring separate translators for each language and patching their notes together, you raise one person bilingual from birth, so switching between "languages" (modalities) is effortless and mid-thought. The payoff is true any-to-any flexibility; the cost is far more data and compute, since nothing is reused from a pretrained backbone.
Negative prompt
A second text prompt describing what you do not want in the image (e.g. "blurry, extra fingers, watermark"). It works through classifier-free guidance: instead of pushing away from a blank unconditional prediction, the model pushes away from the negative prompt's prediction and toward your real prompt — so naming a flaw steers the result away from it. Like telling an artist "paint a beach, and whatever you do, no people."
Network in Network
A design idea where a tiny neural network is tucked inside a single layer of a bigger one, so that layer can do more thinking than a plain filter could. A normal convolution layer slides a simple filter that just takes a weighted sum of each patch; a network-in-network slides a small multi-step mini-network over each patch instead, letting it recognize more complicated local patterns on the spot. Picture a factory line where, instead of one worker stamping each part, every station hides a little expert team that inspects and shapes the part before passing it on. The idea (from the 2013 Network In Network paper) inspired the Inception network's building blocks — which is why Inception was nicknamed after the movie about a dream inside a dream.
Neural codec
A neural network that learns to compress a signal — audio, an image, or video — into a compact code and then rebuild it, a learned cousin of hand-designed formats like MP3 or JPEG. ("Codec" = coder + decoder.) The encoder squeezes the signal down to a small set of numbers or tokens and the decoder reconstructs it; because the whole thing is trained on real data instead of hand-tuned, it can often pack more quality into fewer bits. A VQ-VAE is one example used for images; for audio, EnCodec and SoundStream are the best-known examples, squeezing a waveform into a short stream of discrete tokens at a chosen bitrate (bits per second) — so a lower bitrate means fewer tokens and rougher sound, a higher one means more tokens and cleaner audio.
Numerical issues
Problems arising from the finite precision of floating-point numbers, such as underflow, overflow, or loss of precision, which can lead to unstable training or NaN values.
Nav2
ROS 2 navigation stack
NCCL
NVIDIA Collective Communications Library — does AllReduce etc. on NVIDIA GPUs
nDCG
Normalized Discounted Cumulative Gain — a ranking-quality score from 0 to 1 that rewards putting the most relevant results near the top of the list; the standard way to check whether a reranker actually improved the ordering.
Needle-in-a-haystack
A long-context test that hides one fact (the "needle") inside a long stretch of irrelevant text (the "haystack") and checks whether the model can find it
Next-token prediction
The training objective of an LLM: given the tokens so far, predict the next one, scored with cross-entropy loss.
N-gram
A run of n tokens (or words) sitting next to each other. Here a gram just means one item — one word or token (the word comes from Greek gramma, "something written") — and the n says how many of them in a row, so "the cat sat" is a 3-gram (three words in a row) and "cat" on its own is a 1-gram. By matching the most recent few tokens against earlier text, you can often guess what comes next from what followed the same phrase before, which is exactly how prompt-lookup speculative decoding builds its drafts for free.
nn.Module
PyTorch's base class for all neural network components; acts as a registry that automatically tracks sub-modules, parameters, and buffers assigned in __init__
Node (distributed)
One physical machine (server) in a distributed job, usually holding several GPUs; multi-node training spreads work across several of them over a network.
Noise schedule
The recipe a diffusion model follows for how much noise to add at each step of its forward (noising) process — and therefore how much the denoiser must remove at each reverse step. A linear schedule raises the noise level by equal amounts every step; a cosine schedule ramps up gently at the start and end, keeping recognizable image structure alive for more of the process, which usually trains better. Think of it as a dimmer switch for how quickly a picture fades to static: turn it down too fast (linear) and most steps see only static, leaving little to learn from. The choice mainly affects training quality and how many sampling steps you need, not the model architecture.
non_blocking
The non_blocking=True flag on .to() / .cuda() that lets a host→device copy run asynchronously from pinned memory
Normalization
Rescaling a layer's outputs so they keep a consistent size — typically zero mean and unit variance (LayerNorm) or unit root-mean-square (RMSNorm). Like adjusting every photo to the same brightness before comparing them, it stops numbers from ballooning or vanishing as they flow through a deep network, which is what keeps training stable.
Normalizing flow
A generative model that starts from simple random noise (usually a plain Gaussian "bell curve") and pushes it through a chain of reversible steps to reshape it into realistic data — like kneading a smooth ball of dough into a detailed shape, where you can always un-knead it back. Why can you always un-knead it? Because every step is deliberately built to be undoable: it only ever stretches, shifts, or folds the dough in a way that has an exact opposite, and it never merges two blobs into one or throws any dough away. For example, if a step's rule is "double this number and add 3," its reverse is simply "subtract 3, then halve" — feed the output back through and you recover the original number exactly, with nothing lost. (An ordinary neural network is not like this: it mashes information together — like flattening the dough — so there is no way to run it backwards.) Because every step can be run backwards exactly, a flow can also report the precise probability of any data point, which most generative models cannot do. The price for that exactness is that each step must stay reversible, which heavily constrains the architecture; examples include Real NVP and Glow.
NVAE
Short for Nouveau VAE — a hierarchical VAE from NVIDIA (2020) that stacks many layers of latent variables through a deep network built from depthwise separable convolutions and residual connections, reaching then state-of-the-art image generation quality. Like a skyscraper where each floor refines the blueprint handed down from above — the top floors sketch the overall shape and the lower floors fill in the fine details. The name "Nouveau" is French for "new," positioning it as a modern reimagining of the classic VAE.
NVLink
NVIDIA's GPU-GPU interconnect; much faster than PCIe
NVSwitch
NVLink switch chip; full-bandwidth all-to-all within a node
Observability
The practice of making a running system's inner state visible from the outside — through metrics, logs, and traces — so you can ask new questions about why it is misbehaving without adding new code. Like the dashboard and warning lights in a car: you can tell what is wrong while still driving, instead of pulling the engine apart. For a serving stack it is the difference between knowing "p99 latency tripled at 9 a.m." and finding out only when users complain.
OCR (Optical Character Recognition)
Reading the text inside an image — turning pixels of letters into actual characters a computer can use — for example pulling the line items off a photographed receipt or the words out of a scanned page. It is the skill that separates a VLM that "sees a document" from one that can answer "what is the total?", and it is hard precisely because the answer often hides in small print that survives only if the image is fed in at high enough resolution (one reason AnyRes tiling helps). Analogy: the difference between glancing at a street sign and actually reading the words on it. Example: given a photo of a price tag, an OCR-capable model returns the string "$19.99" rather than just "a label"; benchmarks like DocVQA and OCRBench score exactly this ability.
ODE (Ordinary Differential Equation)
A mathematical equation describing how a system's current state determines its rate of change (its slope). Rather than giving a fixed value as an answer, solving an ODE yields a full continuous function (a path). In diffusion models, the "Probability Flow ODE" acts as the exact navigation route transitioning pure random noise into a structured image. If the current state is a car's position, the ODE defines its exact velocity at that spot.
Off-policy
The data comes from a different policy than the one being optimized
Offset
The starting index into the underlying storage where a tensor's data begins (.storage_offset())
OMPL
Open Motion Planning Library — sampling-based planners
Online softmax
An incremental method for computing softmax that maintains running maximum and sum statistics, enabling single-pass computation over tiled inputs without materializing the full exponent sum beforehand.
ONNX
Open Neural Network Exchange — a framework-neutral file format that stores a model as a graph of operations, so it can run outside the framework that trained it.
ONNX Runtime
A fast, cross-platform engine that runs models saved in the ONNX format, without needing the original framework like PyTorch.
On-policy
The data comes from the same policy being optimized (PPO, REINFORCE)
Open-ended
A task where many different answers can all be reasonable and there is no single right one to check against — writing a poem, summarizing an article, replying helpfully in a chat. The opposite of a closed-ended task like a multiple-choice question (one correct letter) or arithmetic (one correct number). Like grading a creative-writing assignment versus grading a true/false quiz: with the quiz you just count matches, but with the essay you need a human reader — or an LLM-as-judge — to weigh quality, which is why evaluating open-ended work is the hard part of LLM evals.
Open model
A model whose weights you can download and run yourself — Meta's Llama, Mistral, Qwen, DeepSeek — as opposed to a closed model like GPT-4 or Claude where the weights stay on the provider's servers and you can only call them through an API. Like the difference between buying a recipe book (you have the actual instructions, can modify them, can bake offline) and ordering at a restaurant (you only see the finished dish). Open models are essential for any white-box research that needs the model's internals: methods like GCG optimize against the model's own gradients, and interpretability tools like SAEs read its hidden activations — neither is possible through a closed API.
Optical flow
A per-pixel map of motion between two frames: for every pixel it gives an arrow saying which direction and how far that bit of the image moved. "Dense" optical flow computes an arrow for every pixel, versus "sparse" flow, which tracks only a few chosen points. It is the rawest form of the "motion signal" in video and shows up everywhere — data filtering, frame interpolation, and motion conditioning. Analogy: imagine laying a sheet of thin see-through paper (tracing paper, the kind you can see a drawing through to copy it) over two snapshots taken a moment apart, then drawing a tiny arrow from where each speck — a tiny spot of detail in the picture — sat in the first frame to where it ended up in the second. Example: between two frames of a car driving right, every pixel on the car gets a rightward arrow while the still background gets near-zero arrows. Common ways to compute it are the classical Farnebäck algorithm and the neural RAFT model.
Optimizer
An algorithm that updates model parameters using computed gradients; in PyTorch, a subclass of torch.optim.Optimizer that holds parameter groups and per-parameter state
Optimizer state
The extra per-parameter values an optimizer stores between steps — for example, Adam keeps two (the first- and second-moment estimates) — which adds to training memory.
Orbit
A camera move that circles around a subject while keeping it centered in frame — like walking in a ring around a statue, always looking inward at it. Because the viewpoint travels around the object, you see its different sides in turn, which makes the orbit a demanding test of whether a video model keeps an object's 3D shape consistent as the angle changes. It is one of the paths a model can follow under camera control.
Outcome reward model
A scorer that judges only a solution's final answer as right or wrong, ignoring the steps in between — simpler than a process reward model, which grades each step, but blind to where a wrong answer first went off track.
Outlines
An open-source Python library for constrained generation: you hand it a regular expression, a JSON schema, or a Pydantic model and it patches the LLM's decoder to mask out any next-token choices that would break the structure. Like putting guardrails on a road so the car physically cannot drive off the edge no matter how the driver steers, it makes the model's output structurally valid by construction rather than by hope.
Outpainting
Inpainting applied to the outside of an image: you place the original on a larger blank canvas, mark the new border area as the region to fill, and let the model extend the scene outward so it continues naturally past the original frame. Like a painter adding more landscape beyond the edges of an existing painting.
Overfitting
When a model learns its training examples too literally — memorizing their specific details and noise instead of the general pattern — so it does great on the training set but poorly on anything new. A personalization LoRA trained for too many steps overfits: asked for "the subject on the moon," it just spits back one of its training photos. The classic analogy is a student who memorizes the exact answers to the practice exam and then fails the real test because the questions are worded differently. You spot it when training accuracy keeps improving while held-out performance gets worse, and you fight it with more data, fewer training steps, or regularization. Its opposite — doing well on unseen inputs — is generalization.
Padding
Filling shorter sequences with a placeholder value so that every sample in a batch has the same length.
Pan
A camera move where the camera stays in one spot but rotates left or right — like standing still and turning only your head to sweep your gaze across a room. The viewpoint's position never changes, only the direction it faces, so near and far objects slide across the frame together. It is one of the basic moves a video model learns to follow under camera control.
Parameters
The numbers a model learns during training — its adjustable internal settings. Picture thousands of tiny knobs on a giant mixing board: training nudges each knob a little at a time until the whole board produces good output, and the final knob positions are what the model "knows." They come in two kinds — weights and biases — are stored as tensors, and are adjusted by the optimizer during training. (When people say a "7B model," they mean 7 billion of these knobs.) In PyTorch they are nn.Parameter objects, registered automatically when assigned to an nn.Module.
Partial derivative
How much a function changes when you nudge just one of its inputs and hold all the others still — the derivative taken one input at a time. If a recipe's tastiness depends on both salt and sugar, the partial derivative with respect to salt tells you the effect of adding a pinch more salt while keeping the sugar fixed. A gradient is simply the full list of these one-at-a-time slopes, one per parameter.
PagedAttention
A way of storing the KV cache for many concurrent requests by splitting each request's cache into small fixed-size "pages" that the engine can scatter freely around GPU memory and look up through a per-request page table — the same idea operating systems use for virtual memory. It removes the wasted space and fragmentation you get when each request needs its own contiguous chunk, which is why vLLM made it the default scheme.
Patch
A small rectangular section of an image. Instead of looking at an entire image at once, models often break it down into a grid of these smaller blocks to process them one by one. Like cutting a jigsaw puzzle into individual pieces and examining each piece separately before seeing how they fit together.
Patchification
Splitting a (latent) tensor into a sequence of small square patches and turning each one into a single token, so a transformer can treat an image like a sentence of words. For example, a 32×32 latent cut into 2×2 patches becomes a sequence of 256 tokens (a 16×16 grid), each a little block projected to the model's hidden width. The patch size is the key knob: smaller patches make more tokens (finer detail but more compute), bigger patches make fewer tokens (cheaper but coarser) — a suffix like "/2" in DiT-S/2 means patch size 2. Like slicing a photo into postage-stamp squares and reading them left-to-right, top-to-bottom. The same idea extends to video by cutting spatiotemporal patches — little 3D boxes that also span a few frames in time, so one sequence of tokens carries both motion and appearance.
PCA (principal component analysis)
A technique that finds the few directions along which data varies the most and uses them to compress many numbers down to a handful, so high-dimensional data can be drawn on a 2D plot. Imagine photographing a 3D object from the angle that reveals its shape best — PCA picks that most-informative "camera angle" automatically. It is a quick, standard first step for seeing the structure in data, such as checking whether real images cluster together while random noise scatters apart.
PCIe
The standard CPU-GPU connection (and slower GPU-GPU when no NVLink)
Perceiver IO
DeepMind's modality-agnostic architecture that handles inputs of any size or type — pixels, audio samples, point clouds — without the cost normally blowing up. Plain attention compares every input element with every other, so a million-pixel image would need a million-by-million grid; Perceiver instead keeps a small fixed set of learned latent vectors (say 256 of them) and lets only those latents cross-attend to the giant input, squeezing it into the small set once, then doing all the heavy processing among just the 256. The "IO" version adds a matching trick on the output side: a set of learned query vectors cross-attends to the processed latents to read out an answer of whatever shape you need. Like a small committee (the latents) that skims a huge pile of documents, takes compact notes, deliberates among themselves, and then answers any question put to them — the committee's workload depends on its own size, not on how tall the pile was. Because nothing in it assumes a grid or a sequence, the same architecture works across modalities with almost no changes, which is its headline selling point. It is closely related to the Q-Former, which uses the same small-set-of-learned-queries idea to distill an image for a language model.
Percentile
A way to describe where a value ranks in a sorted list: the p99 latency is the time that 99% of requests beat, with only the slowest 1% taking longer. Unlike an average, which a single huge outlier can hide, percentiles expose the slow tail that users actually feel — like reporting "even the slowest of the top 99% of diners was served within 20 minutes" instead of a misleading table-wide average. Serving teams quote p50, p95, and p99 rather than the mean for exactly this reason.
Perceptual loss (LPIPS)
A loss that compares two images by the features a pretrained network sees in them, rather than by their raw pixels. Two photos shifted by a single pixel are nearly identical to a human eye but very different under pixel-by-pixel error; a perceptual loss judges them the way an eye does, rewarding matching textures and shapes. Training with it (LPIPS — Learned Perceptual Image Patch Similarity — is the popular version) gives much sharper results than plain pixel MSE, which tends to blur. It is widely used inside VQ-GAN and VAE training.
permute
Reorders all of a tensor's dimensions by rewriting strides — never copies
Perplexity
A score for how surprised a language model is by a piece of text — roughly, how many words it was effectively choosing between at each step. Lower is better: a perplexity of 1 means the model knew exactly what came next, while a high number means it was guessing wildly. Because it is cheap to compute and rises the moment a model gets worse, it is a common first tripwire in a quality gate after quantization.
PID
Proportional-Integral-Derivative — the workhorse linear controller
Pinned memory
Page-locked CPU memory that enables faster, asynchronous transfers to the GPU; enabled with pin_memory=True on a DataLoader.
Pinocchio
Fast rigid-body dynamics library (CRBA, RNEA, ABA)
PixelCNN
An autoregressive image model — a CNN (Convolutional Neural Network) repurposed for generation — that draws a picture one pixel at a time, predicting each pixel from the pixels already drawn above it and to its left — like filling in a coloring grid square by square, always glancing back at what you have already colored to decide the next color. The image quality is strong and it can report an exact probability for any picture, but generating one is slow because the pixels must come out strictly in order, each waiting on the one before it.
Plücker coordinates
A way to describe a single straight line (here, the ray of sight through one pixel) using six numbers instead of a point-plus-direction. The six split into the ray's direction and its moment (a cross product that pins down which parallel line it is), so a line floating anywhere in 3D space gets one compact, position-independent code. Video models use them for camera control: give every pixel of every frame its Plücker ray and the model knows exactly which way the camera is looking, which lets learned camera moves generalize to angles never seen in training — far better than feeding raw camera-position numbers. Named after the 19th-century mathematician Julius Plücker, who introduced this line geometry.
PoC
Proof of Concept — a small, rough build whose only job is to show that an idea can work, before anyone invests in a polished version. Like frying one test pancake to check the batter before making the whole stack: you are not trying to serve it, just to learn whether the approach is sound.
Point cloud
A loose scatter of dots in space, where each dot is one data item placed by its numbers. Turn every image in a batch into a feature vector — a single point — and the whole batch becomes a cloud of such points. Comparing two clouds (say, real images vs. generated ones) is how a metric like FID measures similarity: it is like comparing two swarms of bees and asking whether they are hovering in the same spot and spread out in the same shape.
Policy
In reinforcement learning, the model being trained to choose what to do next — for an LLM, the network that picks the next token. "Improving the policy" just means making those choices earn more reward.
Position bias
A judge's tendency to pick an answer based on where it sits rather than what it says — for example, an LLM-as-judge that quietly prefers whichever response appears first (or last) when shown two side-by-side. Like a job interviewer who can't help favoring the candidate they meet right after lunch, regardless of qualifications. The standard fix is to ask the judge twice with the two answers swapped and accept the verdict only if both runs name the same winner.
Position interpolation
Extending a model's context length by linearly rescaling RoPE position indices so longer sequences fall within the trained range
Position vector
A vector that represents the exact location of a specific point in space. You make one by drawing a straight geometric arrow from the origin — the (0, 0, 0) center of the coordinate system — directly to your target point. If a point lives at coordinates (x, y, z), its position vector is simply the vector [x, y, z].
Why do we need this? In geometry, a point is just a fixed location, while a general vector is just a movement (a direction and a length, like "walk 5 steps North") that can float anywhere in space. A position vector bridges the two: by permanently anchoring the tail of the arrow to the origin, the vector perfectly describes that specific location. This is the mathematical trick that lets you plug a fixed point into vector operations like the cross product.
Posterior collapse
A VAE failure where the decoder grows strong enough to reconstruct inputs on its own and simply ignores the latent space. The encoder then stops bothering to encode anything and just outputs the default prior, so the latent variables carry no information about the input — like a student who has memorized the answer key and no longer reads the question. When this happens the KL divergence term drops toward zero and the latent code becomes useless for generation.
Postmortem
A written review done after an incident — an outage, a slowdown — that lays out what happened, how it was detected and fixed, and what will stop it recurring. A good one is blameless: it focuses on the system and the process, not on punishing a person, like an air-crash investigation whose goal is safer future flights rather than someone to fire.
PPO
Proximal Policy Optimization — the workhorse on-policy RL algorithm, used in classic RLHF
Precision and recall
Two numbers that, used together, describe how a yes/no detector is doing — far more honest than a single accuracy figure. Precision asks "when the model says yes, how often is it right?" — of all the times it shouted "dog!", what fraction really had a dog. Recall asks "of all the real yes-cases, how many did it catch?" — of all the images that truly had a dog, how many it found. You compute each as a simple fraction: precision = true positives / (true positives + false positives); recall = true positives / (true positives + false negatives). They trade off against each other — a model that says "yes" to everything has perfect recall but terrible precision — which is exactly why a hallucination probe must report both, not just accuracy. Analogy: a fisherman's net — precision is how much of the catch is the fish you actually wanted (not boots and weeds), and recall is how many of the lake's fish you managed to net at all.
Prefill
The first stage of LLM inference: reading the entire prompt at once to fill the KV cache, before any new tokens are generated. Because all the prompt's tokens can be processed together in a single forward pass, prefill is compute-heavy and fast per token — like a reader skimming a whole page at a glance to grasp it before starting to write a reply. It is the opposite of decode, which then produces the answer one token at a time, and prefill time is what sets the time to first token.
Prefix cache
Sharing KV cache across requests that begin with the same tokens (e.g., system prompts)
Pretraining
Self-supervised training on a large unlabeled corpus to predict the next token
Prior-preservation loss
An extra training term used by DreamBooth to stop a model from forgetting a whole class while learning one specific member of it. When you fine-tune on five photos of your dog, the model risks deciding every "dog" now looks like yours — a form of catastrophic forgetting. Prior preservation counters this by mixing in the model's own generic "a photo of a dog" images during training and asking it to keep reproducing them, so the broad concept of "dog" is preserved while the narrow concept of your dog is added on top. Like teaching someone your cousin's face without making them forget what faces in general look like.
PRM
Probabilistic Roadmap — multi-query sampling-based planner
PRM800K
A public dataset of about 800,000 human labels that mark each step of a math solution as right or wrong, released by OpenAI to train process reward models. Rather than only checking whether the final answer was correct, human graders read each worked solution line by line — like a math teacher putting a check or an X next to every step of a student's proof, not just the boxed answer at the bottom. Because the feedback is step-level, a model trained on it learns to spot exactly where the reasoning went off the rails instead of whether the ending happened to be lucky. It is the standard training set for the step-by-step scorers used in Best-of-N re-ranking.
Probability density
A function that says how likely each possible value is — high where real data points pile up, low in the empty regions where they rarely fall. For a 2D dataset you can picture it as a heatmap: bright ridges over the crowded spots, dark valleys over the bare ones. It must stay non-negative everywhere, and all of it added up (the total volume under the surface) equals exactly 1, since some value always occurs. Most generative models can only draw new samples; a normalizing flow is special because it can also report the exact probability density of any point you hand it.
Probability flow ODE
The deterministic twin of a diffusion model's reverse-time SDE: an ODE with no injected randomness that produces the same distribution of images at every noise level. Determinism buys two things the stochastic sampler can't: the same starting noise always maps to the same image (so you can interpolate between samples and invert a real image back to its noise), and the model's exact log-likelihood of any image — how probable it thinks that image is — becomes computable via the ODE's change-of-variables. It is the basis of fast deterministic samplers like DDIM.
Process reward model
A scorer that grades each individual step of a model's reasoning rather than just the final answer — like a teacher marking every line of a proof, not only the last one — so a mistake can be caught at the exact step it happens. Contrast with an outcome reward model.
Profiler
A tool (torch.profiler) that records how long each operation in a training step takes, used to locate performance bottlenecks.
Projection discriminator
A way to feed a class label into a conditional GAN's discriminator by taking a dot product between the image's features and a learned vector for that class, then adding it to the score — rather than just gluing the label on as an extra input. This matches how the math of conditioning actually factorizes, so it conditions more strongly for almost no extra cost, and it became the standard trick for class-conditional GANs such as BigGAN.
Projector
The small network — often a single linear layer or a two-layer MLP — that maps one modality's feature vectors into the space another model expects. It initially acts as a physical adapter cable that reshapes one plug into another (e.g., resizing a 1024-dimensional image vector into a 4096-dimensional word vector). Crucially, simply matching dimensions is not enough; the projector must undergo alignment training (like installing a software driver for the adapter) to learn the exact mathematical transformation that routes the visual semantics into the LLM's native coordinate space. This is the entire fusion mechanism in LLaVA: freeze the vision encoder, freeze the LLM, and train only this projector to perfectly align the two spaces. The catch is that all the image information must squeeze through this one thin bridge, so it can become a bottleneck on detail-heavy tasks.
Prompt injection
An attack in which adversarial text smuggled into something the model reads — a retrieved document, a tool's output, an email, even text inside an image — overrides the original system instructions. Like a customer slipping a fake "manager-approved" note into a server's order pile: the server can't easily tell the planted note from a real one. The hardest unsolved security problem in deployed LLMs, because the model has no built-in way to separate "instructions" from "data" in its input.
Prompt-to-Prompt
A diffusion editing technique that changes what an image shows while keeping its layout intact, by reusing the cross-attention maps from the original generation. Those attention maps record which word controls which region (the word "cat" lights up the cat's pixels); if you swap "cat" for "dog" but force the new run to reuse the old maps, the dog lands in exactly the same pose and place as the cat. Picture keeping a painting's pencil under-drawing fixed and only changing the colors you fill in. It is one of the tools used to build paired before/after data for InstructPix2Pix.
PTQ / QAT
Post-Training Quantization / Quantization-Aware Training
Pydantic
A popular Python library for declaring the shape of your data as a class — you write a class with typed fields (e.g. name: str, age: int) and Pydantic validates that any data you load actually matches, raising a clear error if a value is the wrong type or a required field is missing. Like a customs form for data: anything that does not match the listed fields gets stopped at the border. In LLM work it is the standard way to describe the JSON object you want the model to produce, which tools like Outlines or OpenAI's structured-output mode can then enforce during decoding.
Q-Former
The fusion module from BLIP-2 (a 2023 vision-language model) that shrinks a whole image down to a fixed small number of tokens — typically 32 — that a language model can read. It holds a set of learned query vectors that cross-attend to the frozen image encoder's many patch features, each query pulling out one summary of what it cares about; the 32 outputs are then projected and fed to the LLM as if they were 32 word tokens. Like 32 interviewers who each question a sprawling exhibit and walk away with one concise note, so the language model reads 32 notes instead of touring the whole gallery. The point of the fixed count is cost control: an image becomes a constant, small number of tokens no matter its resolution, instead of hundreds. It shares the small-set-of-learned-queries idea with the Perceiver IO; later VLMs like LLaVA showed a plain projector often matches it with less complexity.
QLoRA
LoRA with the frozen base model stored in 4-bit quantized form, cutting memory so much you can fine-tune a large model on a single consumer GPU.
Quality filter
A classifier that scores each training document and keeps only the high-quality ones (e.g. educational web text), discarding low-value text before pretraining.
Quality gate
An automatic check that a model must pass before it is allowed to serve real traffic — like a bouncer at the door who turns away anyone failing the dress code. It runs a fixed set of evaluations (such as perplexity and capability tests) and blocks the deploy if any score drops too far from the trusted baseline, which is how teams catch silent quantization regressions before users do.
Quantization
Reducing weight / activation precision (FP16, BF16, FP8, INT8, INT4) to save memory and bandwidth
RadixAttention
sglang's KV cache organized as a radix tree keyed on prompt prefixes for automatic sharing
RAFT
RAFT (Recurrent All-Pairs Field Transforms) is a neural network for computing dense optical flow, and on release one of the most accurate. Its core idea is to compare all pairs of pixels between the two frames to build a similarity volume, then iteratively refine a flow estimate with a recurrent update — repeatedly nudging the guess until it stops improving (which is the "recurrent" in its name). Analogy: it is a careful editor who, instead of guessing the motion once, keeps revising the answer over many small passes. Example: feeding RAFT two adjacent video frames returns a (H, W, 2) flow field that is far cleaner on fast motion than the classical Farnebäck method.
RAG
Retrieval-Augmented Generation — give the model an "open-book exam" instead of asking it to answer from memory alone. First a search step fetches the documents most relevant to the question (from a company wiki, a manual, the web), then those documents are pasted into the prompt, and only then does the model write its answer using them as notes. This lets it use fresh or private facts it was never trained on, and makes it easy to check where an answer came from.
rank
The unique integer ID of a process in a distributed job. RANK is the global ID across all machines; LOCAL_RANK is the ID within one machine; WORLD_SIZE is the total number of processes.
Raster order
Walking through a 2D grid of pixels (or image tokens) one row at a time, left to right and top to bottom — the exact path your eyes take reading a page. The name comes from how old CRT TVs and monitors painted the screen: an electron beam swept across in horizontal lines called raster lines (from the Latin rastrum, "rake," because the lines look raked across the glass). An autoregressive image model that generates in raster order produces the top-left pixel first and the bottom-right pixel last.
RDMA
Remote Direct Memory Access — letting one machine read or write another machine's memory directly over the network, without either CPU stopping to copy the data. Like a pneumatic tube that drops a package straight onto a coworker's desk instead of handing it to a courier who walks it over. In disaggregated serving it is how a prefill node ships a multi-gigabyte KV cache to a decode node fast enough to be worth splitting them.
Real NVP
Short for "Real-valued Non-Volume Preserving" — an early, influential normalizing flow design. Its trick at each step: split the numbers into two halves, leave one half completely untouched, and use that untouched half to decide how to stretch and shift the other half. Because the untouched half is still right there, the step is trivially reversible (you can recompute the stretch-and-shift and undo it) and its effect on probability density is cheap to calculate. This made flows practical to train and inspired later models like Glow.
Reasoning model
An LLM trained to think out loud at length — writing a long chain of thought before its final answer — to solve harder problems (math, code, logic). Like a student who fills a page of scratch work before writing the answer, it is far more capable on tough questions but also far more expensive to serve, because one hard problem can produce 10× the tokens of a normal chat reply. Managing that swing in output length is the main serving challenge it creates.
ReAct
A simple agent pattern that interleaves Reasoning and Acting: the model writes a thought, takes an action with a tool, reads the observation, then repeats — the loop most basic agents are built on.
Reciprocal rank fusion
A simple, robust way to merge several ranked lists into one: each item scores the sum of 1 / (rank + constant) across the lists, so items ranked highly by more than one retriever rise to the top. Common for combining dense and sparse search in hybrid retrieval.
Rectified flow
A flow-matching parameterization whose training paths are straight lines from noise to data, popular in 2024+ models like SD3 and Flux. Straight trajectories are easy to follow in a few big steps, so sampling needs fewer steps than the curvy paths of older diffusion. You can also "re-flow": after training once, use the model to generate (noise, image) pairs and retrain on those straight pairs, which straightens the paths even further and lets you sample in as few as one or two steps. Like replacing a winding mountain road between two towns with a straight highway — same destination, far fewer turns to take.
Reference model
A frozen copy of the starting model that RLHF and DPO measure against (through a KL term) so the model being trained does not drift too far from sensible behavior — a "before" photo to compare every change against.
Rejection sampling
A way to draw samples from a target distribution by proposing easy guesses and keeping or throwing away each one with just the right probability, so the survivors are distributed exactly as if they came from the hard distribution directly. The "right probability" of keeping a guess is min(1, p ÷ q), where p is how likely the target model thinks that token is and q is how likely the draft model thought it was. The rule is intuitive: if the target wants the token at least as much as the draft did (p ≥ q), always keep it; if the target wants it only half as much (p is half of q), keep it half the time and otherwise draw a replacement. For example, the draft proposes "cat" with q = 0.6 but the target only gives it p = 0.3, so you keep "cat" with probability 0.3 ÷ 0.6 = 0.5 — a coin flip — which exactly cancels the draft's over-eagerness for that word. In speculative decoding this is the step that lets a draft model's guesses be reused for random sampling without changing the target model's true output distribution.
ReLU
Rectified Linear Unit — the most common and simplest activation function: it keeps positive numbers unchanged and turns every negative number into 0 (max(0, x)). Like a one-way valve that lets water through in one direction and blocks it in the other. That single sharp bend is enough to give a network its non-linear power, and because it is so cheap to compute it was the default for years; newer models often swap it for smoother curves like Swish or GELU.
Reparameterization trick
A method to keep the training signal flowing through a random sampling step, enabling models like VAEs to be trained with ordinary backpropagation.
- The Problem: Drawing the latent variable
zdirectly from the encoder's distribution introduces randomness that blocks the flow of gradients. - The Solution: The trick separates the randomness by drawing plain noise
εfrom a fixed standard normal distribution (a bell curve). You then computez = μ + σ · ε. - Why it Works: The randomness is now isolated in
ε(which has no learnable parts). As a result, the network'sμandσremain on a clean, differentiable path. - Analogy: It is like rolling one shared die outside the machine and then scaling the result, rather than building the dice into the machine itself.
Reranker
A second-stage model that re-scores the top candidates from a fast first-stage retriever and reorders them by true relevance — usually a cross-encoder. The "retrieve then rerank" two-stage pattern is standard in search and RAG.
reshape
Returns a tensor with a new shape, copying only when a no-copy view isn't possible
Residual connection
A shortcut that adds a block's input straight onto its output — written output = x + f(x), where x is what went in and f(x) is what the block computed. Instead of each block having to rebuild the whole signal from scratch, the original x flows past it on an express lane and the block only contributes a small f(x) correction on top. Think of editing a draft: rather than rewriting the entire essay at every pass, each editor keeps the existing text and just marks up the few changes that improve it.
What does "adds a block's input to its output" actually buy you? Two big things:
- An easy "do nothing" default. If a block has nothing useful to add, it can simply output near-zero, and
x + 0 = xpasses the input through unchanged. So adding more layers can never make things worse than the layers already learned — a new block starts from "leave it alone" and only departs from that when it finds something helpful. (This is exactly why AdaLN-Zero zero-initializes its gate: each block begins as a clean pass-through.) - A gradient highway. On the backward pass, the
+ xterm hands every layer a direct path back to the earlier layers, so gradients don't shrink toward zero as they travel through many layers (the vanishing-gradients problem). That direct path is what makes very deep networks trainable at all — before residual connections, stacking 50+ layers usually trained worse, not better. It is the same skip-and-add logic found in convolutional nets, and it is what carries the residual stream through a transformer.
Residual parameterization
A modeling trick used in deep hierarchical VAEs where each layer of latent variables is expressed as a small correction to what the previous layer already predicted, rather than as a full absolute value. Like a GPS giving "turn left in 200 m" instead of stating exact coordinates — each step describes only the gap from where you already are, so no single step has to carry the whole story. Because each latent group only needs to represent a tiny residual change, gradients flow smoothly through many stacked layers and very deep hierarchies become trainable. The idea borrows from residual connections in standard networks, applying the same skip-and-add logic to the latent variable structure itself.
Residual stream
In a transformer, the running activation vector that flows through every layer via residual connections — each attention block and MLP block reads from this stream and adds its update back to it, without erasing what came before. Like a shared bulletin board that every department reads and pins notes to as it passes through the office: by the end of the building, the board carries the combined contribution of every team. Because every layer reads and writes the same vector space, the residual stream is the most natural place to look for interpretable features, which is why sparse autoencoders (SAEs) are usually trained on residual-stream activations.
ResNet
Residual Network — a deep CNN whose layers each learn a small change to add to their input rather than a brand-new output, thanks to residual (skip) connections that route the input straight past each block. The name is short for "residual," the leftover the layer adds on top. Before ResNet, stacking many layers made networks harder to train because the signal degraded on its way through; letting each block default to "pass the input through unchanged, plus a tweak" means adding depth can only help. Like a relay of editors who each suggest small edits to a draft instead of rewriting it from scratch — the original text is never lost. ResNet-50 (50 layers) is still a common, sturdy baseline image encoder.
reverse-mode
The order autograd walks the computation graph when differentiating: the forward pass first, then a single backward pass that propagates gradients from the scalar output back to every input. It is the efficient choice when a model has many parameters but only one loss value.
Reward hacking
A policy that maximizes the reward signal without doing what was intended
Reward model
A model trained on human preference comparisons to score how good a response is; it stands in for a human rater so RLHF can score millions of answers automatically.
Right-sizing
Choosing the smallest, cheapest model that still clears your quality bar for a task, instead of defaulting to the biggest one available. A well-trained 8B model often passes the same eval as a 70B at a fraction of the cost per million tokens — like hiring a capable specialist instead of an expensive all-rounder for a job that doesn't need one. Most production teams over-serve, so right-sizing is one of the easiest cost wins.
Ring attention
A way to run attention over a very long sequence that is split across several GPUs (context parallelism): each GPU passes its slice of the keys and values to its neighbor around a circle, round after round, until every GPU has seen every other slice. Like people seated around a dinner table passing dishes one seat at a time so everyone eventually tastes every dish. This lets the GPUs handle a sequence far longer than any one of them could hold alone.
RLAIF
Reinforcement Learning from AI Feedback — the same recipe as RLHF but the preference labels (or grades) are produced by another, stronger LLM following a written rubric instead of by paid human raters. Like swapping a panel of human judges for a single expert judge who works for free, never sleeps, and applies the same rules every time. Cheaper and faster than human labeling, often nearly as good on well-defined tasks, and the basis of Constitutional AI.
RLHF
Reinforcement Learning from Human Feedback — preference learning, classically via PPO + KL
RLVR
RL with Verifiable Rewards — RL when the reward is a deterministic checker
RMSNorm
Root-Mean-Square LayerNorm without mean-centering; the modern default
RNEA
Recursive Newton-Euler — O(n) inverse-dynamics algorithm
Rollout
One sample of the model actually generating a full response to a prompt, used in RL to see what behavior to reward; producing many rollouts is the expensive part of PPO and GRPO.
Rollout distribution
The spread of responses a model is currently generating when it produces rollouts during RL training — what it tends to say and how varied those answers are. This distribution shifts as training proceeds, which is the whole point; but if it drifts toward weird, repetitive, or gamed outputs, that is a warning sign of reward hacking. Watching how it moves is like checking what a student actually writes on practice tests, not just their final score.
Roofline
Performance model bounding throughput as min(peak FLOPs, memory bandwidth × arithmetic intensity)
RoPE
Rotary Position Embedding — a way to tell a transformer where each token sits by physically rotating its query and key vectors by an angle proportional to the position, so the attention dot product between two tokens depends only on how far apart they are. Because the encoding lives in the rotation rather than an added vector, it extrapolates to longer sequences than the model trained on. 2D RoPE extends the trick to images: a patch token is rotated by its row and its column, encoding 2D spatial position. Like giving every seat in a theater a precise angle on a dial, so the model can always work out the spacing between any two seats.
ROS / ROS 2
Robot Operating System — robotics middleware (ROS 2 is the modern version)
Router model
A small, cheap model that sits at the front of a serving stack and decides which model should answer each request — for example, sending an easy question to a fast 1B model and only escalating hard ones to a slow, expensive 70B model. Like a hospital triage nurse who handles simple cases on the spot and forwards the serious ones to a specialist, it saves money because most queries never need the biggest model.
RRT
Rapidly-exploring Random Tree — single-query sampling-based planner
SAC
Soft Actor-Critic — maximum-entropy continuous-control algorithm; the modern default
SAE
Sparse Autoencoder — interpretability tool decomposing activations into monosemantic features
Sample
A single example in a dataset or batch — one sentence, one image, one prompt. If a batch is a carton of eggs, a sample is one egg. The word can confuse beginners because sampling in text generation means something else entirely (randomly drawing the next token); here it simply means "one item."
Sampler
The component that decides the order in which a DataLoader visits dataset examples (e.g. random, sequential, or class-weighted).
Sandbox
An isolated, throwaway environment — like a fenced-off playground — where an agent or program can run commands, create files, and make mistakes without affecting your real computer. If the agent breaks something inside the sandbox, you just throw the sandbox away; nothing outside it is touched. Containers (like Docker) and virtual machines are common ways to build one.
Sampling
Drawing the next token from the model's predicted probability distribution instead of always taking the most likely one; temperature, top-k, and top-p control how random the choice is.
Scale-and-shift
A two-step tweak applied to a layer's activations: multiply every value by a learned scale and then add a learned shift — the operation y = scale × x + shift. It is exactly like the brightness and contrast sliders on a photo editor: scale stretches or squashes the range (contrast), and shift nudges everything up or down (brightness). The two numbers are usually the weights and biases a normalization layer learns; when they are instead predicted from a condition such as a class label, you get conditioning schemes like AdaGN, AdaIN, and AdaLN.
Scaling laws
The empirical finding that a model's loss drops in a smooth, predictable curve as you add parameters, training data, and compute — like a growth chart that lets you forecast a bigger model's quality from smaller ones before you ever build it.
Scene detection
Automatically finding the "cuts" in a video — the hard jumps where the footage switches from one shot to another — so a long video can be split into clean single-shot clips. It works by watching for a sudden, large change between two adjacent frames, measured by something like the difference in their color histograms (a tally of how many pixels fall into each color bucket) or in deep features. Analogy: flipping through a photo album and starting a new pile every time the picture suddenly looks completely different. Example: a 90-minute movie might be split into roughly 1,500 single-shot clips, each safe to use as a training example because the motion inside it is continuous rather than spanning an editing splice.
Scheduler
The part of an inference server that decides, at every step, which requests to start, which to keep generating, and which to pause when memory runs low — like an air-traffic controller choosing which planes take off, keep flying, or circle, so the runway (the GPU) is always busy but never overloaded. A good scheduler is often worth more real-world throughput than any single clever kernel.
Score
The gradient of the log-probability of the data with respect to the input, written ∇_x log p(x). It points in the direction that makes an image more likely under the data distribution — in plain terms, "which way should I nudge these pixels to make this look more like a real image?" Diffusion models implicitly learn this at every noise level, so generation becomes a matter of repeatedly stepping in the score's direction, from noise toward a realistic sample.
Score matching
A way to train a generative model by teaching it the score — the gradient of log-density, "which way makes this more likely" — instead of the density itself, which avoids ever computing an intractable normalizing constant. The practical version, denoising score matching, sidesteps needing the true score: add a known amount of Gaussian noise to each training example and have the network predict the direction back to the clean point, which provably equals the score of the noised data. (A relative, sliced score matching, estimates it instead by checking random one-dimensional projections.) Once the score is learned, you generate by following it with Langevin dynamics. This is the lens that reveals diffusion models as score estimators trained at many noise levels.
Scratchpad
A temporary, fast-access workspace where intermediate results are stashed so they don't have to be recomputed later. Like a math student's scratch paper next to an exam: jot the partial sums, look them up later, move on much faster than redoing each calculation. In serving, the KV cache is the model's scratchpad — every key and value it has already computed sits there ready to be reused on the next decode step.
SD3
Stable Diffusion 3 — the 2024 release of Stable Diffusion from Stability AI that switched the architecture to an MMDiT transformer and trained it with rectified flow instead of the older U-Net-plus-DDPM recipe of earlier versions. Letting text and image tokens share the same attention layers, and feeding prompts through both CLIP and a large T5 text encoder, gave it noticeably better prompt-following and spelling than SD1.x/SDXL. Think of it as the bridge release that moved Stable Diffusion from the U-Net era into the modern transformer-and-flow era that Flux then built on.
SDE (stochastic differential equation)
An equation describing how something evolves over time under both a predictable push (the "drift") and continuous random jitter (the "diffusion") — like the path of a pollen grain carried by a current while being constantly buffeted by water molecules. A diffusion model can be written as an SDE that gradually turns an image into noise; reversing that SDE turns noise back into an image. The reverse SDE has a deterministic twin with identical statistics, the probability flow ODE, and the two standard noising conventions are the VP and VE SDE families.
SDF
Signed Distance Field — scalar field giving distance to nearest obstacle (negative inside)
SDXL Turbo
A speed-tuned version of Stable Diffusion XL that produces a usable image in a single step (or just a few), instead of the usual 20–50. It was created with Adversarial Diffusion Distillation (ADD), which trains a fast "student" model under the eye of a GAN-style discriminator that rejects any quick output which doesn't look real — keeping the picture sharp despite the shortcut. Like a chef who learns to plate a dish in seconds because a tough critic tastes every rushed attempt. The trade-off: near-instant generation, with slightly less fine detail and variety than the slow original.
SE(3) / SO(3)
Special Euclidean / Orthogonal group — rigid-body motions / rotations in 3D
Seed
A fixed starting number for a random-number generator; setting the same seed makes random operations (shuffling, initialization, dropout) produce the identical sequence every run.
Segmentation map
A picture that has been divided up so that every pixel is painted a flat color standing for what kind of thing it belongs to — all the "sky" pixels one color, all the "road" pixels another, all the "person" pixels a third. It is like a color-by-numbers outline of a scene: it throws away the photographic detail and keeps only a labeled map of which region is which. (Splitting an image into these labeled regions is called segmentation; the result is the segmentation map, sometimes a segmentation mask.) ControlNet can take one as a conditioning signal so a generated image places each kind of object exactly where its colored region sits — the prompt decides what a "building" looks like, but the map decides where the building goes.
Sentence embedding
A single dense vector that captures the meaning of an entire sentence (or short passage), so two sentences about the same topic end up close together in vector space even if they use completely different words. Think of it as a GPS coordinate for meaning — two sentences that "mean the same thing" land near the same point on the map. Sentence embeddings are the backbone of semantic search in RAG: you embed the user's question and every stored passage, then find the passages whose coordinates are closest.
Self-consistency
Sampling many independent chain-of-thought solutions to the same problem and taking a majority vote on the final answer — like asking several people to solve a puzzle on their own and trusting the answer most of them land on.
Self-distillation
A twist on distillation where the "teacher" and the "student" are the same model instead of a big teacher and a smaller student — the network learns by trying to match its own output on a slightly different view of the same input. Like checking your work by solving a problem a second way and forcing the two answers to agree: there is no answer key and no smarter tutor, so the network teaches itself just by staying consistent. Concrete example: in DINOv2, a "teacher" copy (which is just a slowly-updated running average of the "student") looks at one crop of a photo while the student looks at a different crop, and the student is trained to reproduce the description the teacher gave — so the model learns features that stay the same when an object is moved or cropped, all with no human labels. Because the teacher is only a smoothed copy of the student, this is a form of self-supervised learning, and the slow averaging is what stops the network from cheating by collapsing to one constant answer for every image.
Self-supervised
Learning from raw, unlabeled data by inventing the labels from the data itself — for example hiding part of an input and asking the model to predict the missing piece, or asking whether two altered views came from the same original. No human annotation is needed, so the model can train on billions of images or sentences nobody had to tag. Like learning a language by covering up words in books you already own and guessing them, instead of paying a tutor to quiz you. This is how DINOv2 learns vision features and how the masked- and next-token objectives behind most LLMs work; contrast it with supervised training, which needs an answer key.
SFT
Supervised Fine-Tuning — train on demonstration data with cross-entropy
SGD
Stochastic Gradient Descent — updates parameters by subtracting a scaled gradient computed on a mini-batch; the simplest optimizer and the basis for more advanced methods
sglang
An open-source LLM serving runtime that pairs fast inference (via RadixAttention prefix sharing) with first-class constrained generation — built-in regex / JSON / grammar constraints applied at decode time. Plays a similar role to vLLM but is the popular pick when reliable structured output (function calls, tool use, schema-conformant JSON) matters most.
Shape
The size of a tensor along each dimension; the tuple returned by .shape
Sharding
Splitting a dataset (or model) into many smaller pieces so they can be stored, loaded, or processed in parallel.
SigLIP
Sigmoid-loss CLIP — a CLIP variant that swaps CLIP's batch-wide softmax contrastive loss for a sigmoid loss, which scores each image–text pair on its own as an independent yes/no match. Judging pairs one at a time means it trains well even with small batches, where CLIP needs very large ones to gather enough negatives to compare against. SigLIP 2 (2025) extends it with better data and multilingual training.
Sigmoid loss
A training loss that scores each example with one simple, independent yes/no question — "should these two things match?" — instead of making examples compete against each other. It runs the model's raw match score through the sigmoid function, an S-shaped curve that squashes any number into a probability between 0 and 1 (very negative → near 0, very positive → near 1, zero → 0.5), then rewards the model when a true pair lands near 1 and a mismatched pair lands near 0. Like grading each true/false exam question on its own merits, instead of ranking every student in the room against one another — which is what softmax-based losses such as CLIP's InfoNCE do. How it is computed: for each pair take the label y (1 if they truly match, else 0) and the predicted probability p = sigmoid(score), then add up the cross-entropy of that single decision, −[y·log p + (1−y)·log(1−p)], independently over every pair. Because each pair is judged alone rather than against a whole batch, training still works with small batches, unlike softmax losses that need many examples per batch to compare against. This is the loss behind SigLIP.
SiLU
Sigmoid Linear Unit — just another name for Swish, the activation x · σ(x). The two words mean the exact same function: you will see "SiLU" in code (PyTorch's nn.SiLU) and "Swish" in papers.
SIMT
Single Instruction Multiple Threads; NVIDIA's execution model
Skip-and-add logic
A design pattern in neural networks where a signal bypasses a layer unchanged and is then added back to the layer's output. Think of it like a chef tasting a soup that already has a good base flavor (the "skip" part, where the main base is kept), and deciding to just stir in a pinch of salt (the "add" part) to improve it, rather than throwing the soup out and cooking a new one from scratch. Because the main signal flows straight through, the layer only has to figure out the small correction (the residual) needed to make it better. This keeps information flowing easily in very deep networks.
SLAM
Simultaneous Localization and Mapping
SLI
Service Level Indicator — the actual measured number for how well a service is doing, such as the real percentage of requests that succeeded or answered within 500 ms. The SLO is the target; the SLI is the measurement you compare against it — like the speedometer reading (SLI) versus the posted speed limit (SLO).
SLO
Service Level Objective — a specific, measurable promise about how a service should perform, such as "p95 TTFT under 500 ms" or "99.9% of requests succeed." It is the target you design toward and get alerted on, like a delivery company promising most parcels arrive within two days. The measured reality you check it against is the SLI, and the slack it allows for failure is the error budget.
SM
Streaming Multiprocessor; the GPU's "core"
Soft gate
A multiplier that scales each value by some amount between fully off (0) and fully on (1), instead of the hard either/or of a switch that is only ever 0 or 1. Picture a dimmer knob rather than a light switch: a hard gate can only block a signal or let it through untouched, but a soft gate can pass 0.3 of it, or 0.8, dialing each value partly up or down. In a SwiGLU layer the gate amounts come from squeezing one projection of the input through a smooth activation function like Swish, whose output slides continuously rather than snapping between two settings — and that smoothness is what lets the network learn the right gate values from clean gradients.
softmax
The function that turns a vector of scores into a probability distribution — each value squeezed into 0–1, and all of them summing to 1; the core of attention and classification heads. The name means a soft version of max: instead of the hard "winner takes all" of argmax, which hands the single biggest score 100% and the rest nothing, softmax gives most of the weight to the biggest score while still leaving a little for the others. That smoothness — a dimmer switch rather than an on/off toggle — is what lets the model be trained by gradients.
Spatiotemporal attention
An attention pattern for video in which every token attends to every other token across both space and time at once — all positions in all frames mixed in a single shared attention operation. This is the most expressive way to model motion, because it can directly relate any pixel in any frame to any other, but its cost grows quadratically with the total number of tokens T×H×W, so it becomes very expensive as clips get longer or larger. Contrast (2+1)D, which splits spatial and temporal attention into two cheaper separate steps, and windowed attention, which restricts the joint attention to small local 3D windows. Sora-class models can afford full spatiotemporal attention only because a 3D VAE first shrinks T×H×W aggressively before attention ever runs — moving the expense out of attention and into the compressor.
Spatiotemporal patches
The 3D version of image patches: instead of cutting one frame into flat 2D squares, a video is cut into little boxes that span a small image region and a few consecutive frames, so each box (also called a tubelet) captures appearance and motion at once. Each box becomes one token for a transformer, so movement is baked into the input from the start rather than reconstructed later from separate frames; TubeViT is a model built this way. Like cutting a flip-book into small columns that each go down through several pages — one cut shows how that corner of the picture changes over time. Example: a 16-frame clip cut into 2×16×16 patches (2 frames deep, 16×16 pixels wide) becomes a sequence of motion-aware tokens. The trade-off is that 3D boxes multiply the token count fast, raising compute — contrast plain patchification, which slices a single still image.
Special tokens
Reserved vocabulary entries that mark structure rather than text — e.g. <bos>, <eos>, <pad>, and chat-boundary tokens like <|im_start|>
Speculative decoding
A trick to make decode faster for free: a small, fast "draft" model guesses the next few tokens, and the big "target" model checks all of them in a single parallel pass, keeping every guess that matches what it would have produced and discarding the rest. Like an editor who reads a sentence a junior writer drafted and approves the part that is already correct rather than writing every word from scratch — the answer is identical to what the target alone would say, just reached in fewer slow steps. It works because decode is starved for memory bandwidth, so the GPU has spare compute to verify several guesses at once.
Stable Diffusion
The best-known open-source diffusion model for turning a text prompt into an image (first released by Stability AI in 2022). Its key trick is to do the slow denoising work in a small compressed space (the latent space of a VAE) rather than on full-size pixels — like sketching a scene as a rough thumbnail first and only blowing it up to full resolution at the very end — which makes it light enough to run on a single consumer GPU. Because the weights were released publicly, it sparked a huge ecosystem of fine-tunes and add-ons such as LoRA, ControlNet, and DreamBooth.
Stable Video Diffusion (SVD)
Stability AI's open-weights image-to-video (I2V) model (2023), the canonical baseline for turning a single still image into a short clip. It is built by temporal inflation: it freezes a pretrained Stable Diffusion image model and adds new time-aware layers that learn motion, so it keeps Stable Diffusion's strong sense of appearance and only has to learn how things move. Released in two variants — one tuned to generate 14 frames, one for 25 — it conditions on the input image (not text), which makes it the easiest strong model to run for hands-on I2V experiments. It also exposes a motion score input to control how much movement the clip contains.
State dict
A Python OrderedDict that maps every parameter and buffer name to its tensor value; the standard format for saving, loading, and transplanting PyTorch model weights
Static quantization (PTQ)
A quantization method that converts both weights and activations to int8 before serving, using a calibration pass to fix the activation scales in advance.
STFT
Short-Time Fourier Transform — a way to find which frequencies are present and when in a signal by chopping it into many short, overlapping windows (say 25 ms each) and running a Fourier transform on each one separately. A plain Fourier transform tells you the frequencies in a whole clip but loses all sense of when they happened; the STFT trades a little frequency precision for time precision by asking the question over and over on tiny slices. The output is a grid of (time × frequency) magnitudes — the raw material a mel spectrogram then refines. Like tapping out a song's rhythm window by window instead of blending the whole piece into one average chord.
Stop-string
A user-supplied substring that tells the server "as soon as the generated text contains this, stop." Matched on the decoded text, not the raw token IDs, because the same letters can land in different BPE tokens depending on what came before — so the matcher has to keep a small rolling window of recent output and check for the string at every step.
Storage
The 1-D buffer that a tensor is a view into
Straight-through estimator
A trick for training through a step that has no usable gradient — such as the nearest-codebook-entry lookup in a VQ-VAE. On the forward pass the hard, non-differentiable operation runs as usual; on the backward pass the model simply pretends that step was the identity and passes the gradient straight through unchanged. It is like sketching along a ruler and then erasing the ruler's marks: the rough step shapes the result, but learning flows as if it were never there.
Streaming
Sending the model's reply to the client one piece at a time as it is generated, instead of waiting for the whole answer and then returning it in a single response. Over HTTP this is usually done with Server-Sent Events (SSE) or chunked transfer encoding; the connection stays open and the server flushes each new token as soon as it is sampled. Like a waiter who brings each course out as it leaves the kitchen rather than holding the whole meal until dessert is ready — the user sees TTFT drop dramatically even though total generation time is the same.
Stride
The number of storage elements to step over for each dimension of a tensor
StyleGAN
A family of GANs (StyleGAN, StyleGAN2, StyleGAN3) famous for photorealistic faces — the models behind sites like thispersondoesnotexist.com. Instead of forcing random noise directly into a rigid spherical shape (which tangles attributes together), it first passes the noise through a mapping network to "iron out" the warped space into an intermediate W latent space. It then injects this unwarped style code into every generation layer through adaptive instance normalization. This design "disentangles" the latent space, so moving in one direction smoothly changes a single attribute (hair, age, lighting) while leaving the rest completely untouched.
StyleGAN2
An improved version of StyleGAN that fixes visual artifacts like waterdroplet-like blobs. It does this by redesigning how the adaptive instance normalization (AdaIN) is applied, moving it outside the convolutions. Think of it as upgrading from a good camera that sometimes leaves dust spots on the lens to a professional one that takes perfectly clean photos every time.
StyleGAN3
The third generation of the StyleGAN family, which focuses on fixing "texture sticking" — a problem where textures like hair or wrinkles would stay glued to the screen coordinates even as the face moved. It achieved this by making the entire network "alias-free," ensuring that when the underlying features move, the generated pixels move perfectly with them, like a seamless video rather than a sequence of loosely connected frames.
Super-resolution
Turning a low-resolution image or video into a higher-resolution one by inventing the missing fine detail — not merely stretching the pixels (which only blurs them) but hallucinating plausible texture and sharp edges that were never in the small version. A diffusion-based super-resolution model is trained by taking sharp images, shrinking them, and learning to reconstruct the originals while conditioned on the small input. Like an artist handed a thumbnail and asked to repaint it at poster size, filling in detail consistent with what the thumbnail implies. It is the upscaling stage in a cascaded diffusion pipeline, and "super" simply means resolving detail finer than the input's resolution seemed to allow.
SWE (Software Engineering)
Short for Software Engineering — the discipline of building, testing, and maintaining software systems. In the AI/LLM context, "SWE" usually appears in compound terms like SWE-bench or "SWE-style agent," meaning an agent that does the kind of work a human software engineer does: reading code, diagnosing bugs, writing fixes, and running tests.
SWE-bench
Short for Software Engineering Benchmark — a benchmark of real GitHub issues paired with the code changes that fixed them; an agent is judged by whether its edits make the project's test suite pass, which makes it the standard test of coding agents.
Sweep
Training the same model many times while changing one setting across a range of values, then comparing results to pick the best — for example trying ten different learning rates and keeping the winner. Like tasting a sauce as you add salt in small steps to find the amount you like, rather than guessing the whole spoonful at once. A sweep is how you turn a hyperparameter hunch into a measured choice.
SwiGLU
The activation used in most modern transformer MLPs: a GLU gate whose non-linearity is Swish, written (xW) · Swish(xV). In plain terms, the input is projected two ways — one path is the content, the other is squeezed through Swish to become a soft gate — and the two are multiplied so the gate dials each value up or down. It replaced plain ReLU feed-forward layers because, for the same size, it tends to learn a little better; it is the default FFN in Llama-style models.
Swish
A smooth activation function, x · σ(x), also called SiLU. It does roughly the same job as ReLU — squashing large negatives toward 0 and passing positives through — but with a gentle curve instead of a sharp corner (and it even dips a little below 0 for small negatives before recovering). Think of a soft-closing drawer that eases shut instead of slamming at exactly zero: that smoothness gives the network cleaner gradients to learn from. It is the non-linearity used as the gate inside SwiGLU.
Synthetic captions
Replacing an image's original web alt-text — which is often missing, keyword-spammed, or unrelated to the picture — with a fresh, detailed caption written by a VLM that actually looks at the image and describes it ("a golden retriever catching a frisbee on a beach at sunset"). Also called recaptioning. Training a text-to-image model on these cleaner descriptions dramatically improves how faithfully it follows prompts — it is the single biggest reason DALL·E 3 became so good at composition. Like re-cataloguing a library where half the books were shelved under the wrong title: once every spine is relabeled to match its contents, readers (here, the model) finally learn which words map to which pictures. Example: an image whose alt-text was "IMG_2025.jpg" gets a full descriptive sentence before it is used for training.
SynthID
Google DeepMind's watermarking tool that hides an invisible, detectable signal inside AI-generated content — images first, later audio, text, and video — so it can be identified as machine-made without any visible change to the picture. Rather than stamping the pixels afterward, it can weave the mark into the generation process itself, which helps it survive cropping, resizing, and JPEG compression. Like a secret ink woven into the paper of a banknote: you can't see it, but the right detector lights it up instantly. It is one practical answer to the safety problem of telling real photos from synthetic ones as AI images flood the web.
System prompt
A message placed at the very start of a chat conversation that tells the model how to behave — its role, tone, rules, and the tools it can call — before the user's first turn ever arrives. Like a stage director's note to an actor before the curtain rises: "You're a polite customer-support agent who answers only refund questions." System prompts are usually long and shared across many requests, which is why caching their KV state (see prefix cache) saves so much repeated work.
Systolic array
Data-flow matmul fabric used in TPUs
T2V
Text-to-Video: generating a video clip from a text prompt alone, with no image to start from — the model must invent both what the scene contains and how it moves. This makes it harder than image-to-video (I2V), where the first frame is given, and it needs paired text–video training data, which is scarce. Sora, Veo, and Kling are well-known T2V systems.
T5
A text transformer (Google's "Text-to-Text Transfer Transformer") that reads a sentence and produces rich embeddings of its meaning. Unlike CLIP's text encoder, which was trained only to match images to short captions, T5 was trained on general language tasks, so it captures long, detailed prompts and word order more faithfully — which is why models like Imagen, SD3, and Flux feed it (often the large "T5-XXL" variant) into cross-attention for better prompt adherence.
Tail latency
The latency of the slowest requests (for example the p95 or p99 percentiles) rather than the median (p50); it is what users notice most.
Target model
In speculative decoding, the big, accurate model whose output you actually want — it checks the small draft model's guesses and has the final say on every token. Like the senior editor who must approve the assistant's draft: slow and expensive to consult, so the trick is to bother it as rarely as possible while still letting it decide the real answer.
TCP
Tool Center Point — the configurable point on a tool whose pose tracking controls
TD error
δ_t = r_t + γV(s_{t+1}) − V(s_t) — the signal that drives every TD update
TD3
Twin Delayed DDPG — DDPG plus three stability fixes
Temperature
A sampling knob that scales the model's scores before softmax: low temperature (e.g. 0.2) sharpens the distribution so the model plays it safe and repeats the likeliest words, while high temperature (e.g. 1.5) flattens it so rarer, more surprising words can win. Think of it as a creativity dial — turn it down for factual answers, up for brainstorming. The same knob appears in contrastive learning (written τ): there the similarity scores are divided by τ before the softmax, so a small τ (CLIP learns one starting around 0.07) sharpens the contest and forces the model to focus on its hardest negatives, while a large τ softens it — too small destabilizes training, too large and even the true pair is barely preferred.
Temporal attention
Attention applied only along the time axis of a video: each spatial position (say, the pixel at row 10, column 20) looks at that same position across all the frames and decides how its value should change from frame to frame. It is the half of a (2+1)D block that handles motion, added on top of ordinary spatial attention (which works within each frame separately), and in temporal inflation it is exactly the new layer dropped into a pretrained image model to teach it movement. Like tracking one fixed spot on a flip-book through every page to see how it animates, while ignoring the rest of each page. It is far cheaper than spatiotemporal attention because each position only compares itself across the T frames, not against every other position as well.
Temporal flicker
The shimmering, pulsing, or boiling look you get when a video is processed one frame at a time with no coordination across frames. Each frame is reconstructed (or generated) slightly differently from its neighbors — tiny independent errors in texture, color, or brightness — and because the real scene barely changed, your eye reads those frame-to-frame differences as unwanted motion. It is the classic failure of running an image VAE per frame, and the main reason video needs compressors and models that span the time axis (a 3D VAE or temporal attention) rather than treating a clip as a stack of unrelated stills.
Temporal inflation
The dominant trick for making a video model out of an existing image model: take a pretrained 2D network, insert new layers that operate along the time axis (temporal convolutions or temporal attention), and usually initialize them as an identity (pass-through) so that, at the start of training, the inflated model behaves exactly like the original image model run frame by frame. You then fine-tune so the new layers gradually learn motion while the spatial layers keep everything they already knew about appearance. "Inflation" captures the picture of taking a flat 2D model and puffing it out into the third (time) dimension. Stable Video Diffusion, AnimateDiff, and Make-A-Video all use variants of this; the 2024+ frontier (Sora-class models) instead trains spatiotemporal models from scratch.
Tensor
A grid of numbers — the basic container deep learning uses for almost everything. A single number is a 0-D tensor, a list of numbers is 1-D (a vector), a table is 2-D (a matrix), and you can keep stacking into 3-D and beyond — for example a color image is a 3-D tensor of height × width × 3 color channels. Under the hood it is a (storage, shape, stride, offset, dtype, device, requires_grad) tuple viewing a 1-D storage buffer, but the everyday idea is simply "an N-dimensional array of numbers the GPU can crunch in parallel."
Tensor Core
Specialized matmul unit in NVIDIA GPUs since Volta
Tensor parallelism (TP)
Splitting each layer's weights across several GPUs so they each do part of the math, then combining their partial results with an all-reduce. "At attention/MLP boundaries" means that combining happens at two natural seams in every transformer block — once at the end of the attention sublayer and once at the end of the MLP sublayer — because within a sublayer the GPUs can work independently, but at its edge their pieces must be added back together before the next step can start. Like four cooks each preparing part of a dish and merging everything at two fixed points before it moves on.
TFLOPs
Tera (10¹²) floating-point operations per second
TGI
Short for Text Generation Inference — Hugging Face's open-source LLM serving engine, similar in role to vLLM. It implements continuous batching, PagedAttention, and quantized inference behind a simple HTTP API, and is one of the two engines most commonly used to put an LLM in front of real users.
Thinking budget
A cap on how many tokens a reasoning model is allowed to spend thinking before it must give an answer — like telling a student "you have ten minutes of scratch work, then write your answer." It lets a serving system trade accuracy for cost and latency: a bigger budget usually means better answers on hard problems but slower, pricier responses.
Text encoder
The part of a model that turns a piece of text into numbers — a list of embeddings that capture what the words mean — so the rest of the system can work with language as math. Think of it as a translator that reads your prompt and rewrites it in the only language a neural network understands: vectors. In a text-to-image model the text encoder reads the prompt once and the diffusion model then keeps glancing at those vectors (via cross-attention) to decide what to draw. It is one half of a two-part model like CLIP — the side that reads words, paired with an image encoder that reads pictures — but the term is more general: any model that maps text to embeddings (CLIP's text tower, T5, a BERT-style encoder) is a text encoder, which is why it deserves its own name rather than being conflated with CLIP as a whole.
Text rendering
The ability of an image generator to draw legible, correctly-spelled words inside the picture — a shop sign that actually reads "OPEN," not "OPNE" or wavy gibberish. It was for years the field's most visible failure, because a model trained only to match overall image statistics has no spelling checker: it learns the shape of letters but not that "the exact order of letters matters." Modern models (Imagen 3, Flux, Ideogram) largely fixed it with dedicated training data and stronger text encoders. Like a painter who can flawlessly reproduce the look of handwriting in a language they cannot read — beautiful strokes, but misspelled words until they are taught the alphabet itself. Example test: prompt "a neon sign that says 'DIFFUSION'" and check whether all nine letters appear, in order, spelled right.
Text-to-image
A model that turns a written prompt into a brand-new picture — you type "a corgi astronaut floating in space" and it paints one from scratch. Under the hood a text encoder reads your words into numbers, and a generator (usually a diffusion model) uses them to decide what to draw, glancing back at the prompt the whole time via cross-attention. Like a sketch artist who never sees the scene and draws purely from your verbal description — the richer and clearer your words, the closer the result. Famous examples include Stable Diffusion, DALL·E 3, Imagen 3, and Ideogram.
Textual Inversion
A personalization method that teaches a frozen diffusion model a new subject by learning a single new word for it — nothing in the model itself changes. Concretely it optimizes one fresh vector added to the text encoder's embedding matrix (the lookup table of word embeddings) so that this invented "word" makes the model draw your subject. Because only that one vector is trained, the saved file is a few kilobytes — the smallest personalization artifact there is — but its capacity is limited: one vector can pin down a recognizable look yet cannot match the fidelity of LoRA or DreamBooth, since the frozen model can only render what it already knows how to draw. Like adding one new entry to a shared dictionary: you define the word just once, and from then on that single word stands in for your whole subject — but, just like a dictionary, it can only ever be explained using words and ideas the model already understands.
Throughput
How much work is completed per unit of time — for training, the number of examples processed per second.
Tiling
Splitting a large computation into small blocks ("tiles") that fit in fast on-chip memory, so a kernel reads slow memory fewer times.
Token (visual/audio)
A discrete code that stands for a small piece of an image or sound, produced by a VQ-VAE or neural codec. Just as a tokenizer chops text into word-pieces, an image tokenizer turns a picture into a grid of these codes drawn from a fixed vocabulary — so a transformer can model images (or audio) with the same machinery it uses for language.
Tokenizer
The mapping from string to integer IDs; trained, frozen, part of the model contract
Tokens per byte
A measure of tokenizer efficiency: how many tokens it emits per byte of input text; higher means the same text costs more tokens
Tool call
When an LLM, instead of answering directly, emits a structured request to run an external function — search the web, query a database, run code — and then continues once it sees the result. Like a person pausing mid-task to look something up or use a calculator, it is the basic action an agent takes in its loop.
Top-k
A sampling rule that keeps only the k most likely next tokens and draws from those, throwing away the long tail. With k=1 it always takes the single best word (greedy decoding); with k=50 it chooses among the top 50 — like ordering only from a menu's 50 most popular dishes instead of the whole cookbook.
Top-p
Also called nucleus sampling: instead of a fixed count like top-k, it keeps the smallest set of top tokens whose probabilities add up to p (e.g. 0.9), then samples from them. The shortlist automatically grows when the model is unsure and shrinks when it is confident — always keeping just enough candidates to cover 90% of the model's belief.
TOPP
Time-Optimal Path Parameterization — time-parameterize a geometric path under bounds
torch.compile
The PyTorch 2.x API that traces a model into a graph and generates optimized, fused kernels, speeding up eager mode code with a single call.
torch.export
The modern PyTorch API that captures a model into a standalone graph; the foundation for deployment paths like ExecuTorch and AOTInductor.
torch.multinomial
The PyTorch function that draws a random sample from a probability distribution: hand it a list of probabilities and it rolls a weighted die, returning the index it lands on. A token with probability 0.6 comes up about 60% of the time. It is the "roll the dice" step at the end of sampling — the opposite of argmax, which never gambles. On the GPU each call is its own kernel launch, which is why folding it into the rest of the sampling math can speed up decode.
torchrun
PyTorch's launcher command that starts one process per GPU and sets the RANK, LOCAL_RANK, and WORLD_SIZE environment variables those processes need to find each other.
TorchScript
The legacy serialization/IR for PyTorch; superseded by torch.export
Trained or just prompted
A choice in how you create a small helper model (like a router model). You can either "train" it (by fine-tuning its weights on thousands of examples, like sending someone to medical school) or "just prompt" it (by taking an existing smart model and simply giving it a written instruction like "You are a router, decide if this question is hard or easy," like handing a smart assistant a checklist). Training takes more effort upfront but is faster and cheaper to run; prompting is quick to set up but costs more per request since you process the instructions every time.
Transformer
The decoder-only / encoder-only / encoder-decoder architecture built from attention + MLP blocks
TransformerEngine
NVIDIA's open-source library that automates safe FP8 training and inference on Hopper and Blackwell GPUs — it picks per-tensor scales each step so the low-bit math stays numerically stable. Like a thermostat for low-precision arithmetic: as values drift toward overflow or underflow, it nudges the scale to keep them inside the safe range. Drop-in transformer layers wrap your model and turn FP8 on without the user having to manage the scaling manually.
transpose
Swaps two dimensions by rewriting strides — never copies; the result is usually non-contiguous
Tree-of-Thoughts
A reasoning method that explores several partial solutions at once as branches of a tree, scores them, and expands only the promising ones — like working through a maze by trying multiple paths and backing out of dead ends instead of committing to the first turn.
Triage
Sorting cases by what each one needs, borrowed from emergency-room medicine where a nurse classifies arriving patients by severity before any doctor sees them. In LLM evaluation, hallucination triage means sorting model answers into useful buckets — correctly answered, correctly abstained ("I don't know"), confidently wrong (hallucination) — so each rate can be measured separately, instead of collapsing everything into one "accuracy" number that hides which failures are dangerous.
Triangular weights
The triangle-shaped set of multipliers each filter in a mel filterbank uses to blend nearby STFT frequencies into one band. A filter's weight rises linearly from 0 up to 1 at its center frequency and falls back to 0 at its edges, so frequencies near the center count fully and those at the edges barely count — and neighboring triangles overlap so no frequency is dropped. Like a dimmer switch that is brightest in the middle of each band and fades to off at the borders, smoothly handing off to the next band. Example: a triangle centered at 500 Hz might weight 480 Hz at 0.8, 500 Hz at 1.0, and 520 Hz at 0.8, while 400 Hz and 600 Hz get 0; the band's value is the weighted sum of those frequencies' energies.
Tripwire
A cheap, fast check whose only job is to sound the alarm the instant something goes wrong — named after the thin wire that, when stepped on, sets off a trap or flare. In model deployment a quick metric like perplexity is used as a tripwire: it won't tell you what broke, but it spikes the moment quality drops, so it catches a bad build before the slower, fuller tests even run.
Triton
A Python-flavored language for writing GPU kernels, developed by OpenAI
Triton Inference Server
NVIDIA's production server for hosting models behind an HTTP/gRPC API, with batching and multi-model support; unrelated to the Triton kernel language despite the shared name.
TTFT
Time to produce the first token — the elapsed time from when a request arrives at the server until the model returns its first output token, dominated by prefill plus any queue wait. Like a restaurant's "time until your drink arrives" — felt separately from the rest of the meal, and the first thing the user actually notices.
U-Net
An encoder-decoder network whose name comes from its U shape: the left arm shrinks the image down to a small, abstract summary while the right arm builds it back up to full size, with skip connections that hand each down-sampling layer's detail straight across to its matching up-sampling layer. Those skips are what let it keep fine pixel detail while still reasoning about the whole image, which is why it became the standard backbone for diffusion models (before DiT brought in transformers). It was originally invented for medical-image segmentation.
UCF-101
A widely used action-recognition video dataset from the University of Central Florida of about 13,000 short YouTube clips spanning 101 human-action categories (playing guitar, applying makeup, bench-pressing, and so on); the "101" is simply the number of action classes. It is small and low-resolution by modern standards, which is exactly why it became a common testbed for early video generation — you can train on it without a data-center. It shows up constantly in pre-diffusion video-GAN papers as the dataset everyone reported numbers on.
Underflow
Condition where a floating-point value is too small to be represented and rounds to zero; common with float16 when accumulating very small gradients
URDF / MJCF / USD
Robot description formats (ROS, MuJoCo, NVIDIA respectively)
User turn
One message a user sends in a chat conversation, paired with the model's reply (the assistant turn). A back-and-forth between user and assistant is a sequence of alternating turns, all under the same opening system prompt. In typical traffic, the system prompt is long and fixed while each user turn is short and varies — which is exactly the pattern a prefix cache exploits.
V2V
Video-to-Video: transforming an existing video into a new one while keeping its motion and timing — for example restyling it into a cartoon, or re-rendering it conditioned on per-frame depth or pose. The hard part is temporal consistency: editing each frame independently makes the result flicker, so V2V methods share information across frames. Contrast with image-to-video (one image in) and text-to-video (text only).
Vanishing gradients
A problem during training where gradients become extremely small, effectively preventing the weights from changing their value and stalling the learning process.
VAE
Variational Autoencoder — an autoencoder whose encoder outputs not a single point but a small cloud of possibility (a mean and a spread) for each input, and whose decoder samples from that cloud to rebuild the image. Training on the ELBO presses those clouds to fit neatly under one standard bell-curve shape, so afterwards you can draw a brand-new point from that shape and decode it into a fresh image the model has never seen. That sampling ability is what makes a VAE a generative model rather than just a compressor.
Validation loss
The loss measured on held-out data the model was not trained on; the honest signal of how well training is generalizing.
Value function
Expected return; V(s) for state-value, Q(s, a) for action-value
Value network
The helper network (the "critic") in some RL algorithms that estimates the value function — its best guess of how much future reward a situation is worth — so the policy can tell whether an action turned out better or worse than expected. PPO trains one alongside the policy, which roughly doubles the networks held in memory; GRPO skips it entirely by comparing each sampled answer to the group's average instead, which is what makes it cheaper.
Vanilla
The plain, unmodified, baseline version of a model or algorithm — no special improvements or extra tricks, just the original idea as first described. Like ordering plain vanilla ice cream with no toppings: it is the default flavor before anyone adds anything extra. In machine learning, "vanilla VAE" means the original VAE from the 2013 Kingma & Welling paper, before later work added hierarchical latents, β controls, or other refinements. Comparing the vanilla version to improved variants is the clearest way to measure what each addition actually buys.
VBench
Comprehensive open evaluation suite for video generation
Verifier
A program that automatically checks whether an answer is correct — running unit tests, or comparing to a known math result — giving the exact, unhackable reward that RLVR trains on.
Very Deep VAE
A hierarchical VAE (Child, 2021) that scales to dozens of stacked latent variable groups — far more layers than earlier models. Each group only handles a thin slice of the work, with residual-like parameterizations keeping gradients flowing through the depth. Like adding so many floors to a building that no single floor needs to bear much weight, it achieved strong image generation quality, showing that deeper hierarchies can capture richer structure than shallow ones.
Video-CFG
Applying classifier-free guidance (CFG) to a video model that has more than one condition — typically a text prompt and a conditioning image — by giving each condition its own guidance scale instead of one shared dial. You can then push text adherence and image faithfulness independently: strong text guidance to match the prompt, separate image guidance to stay locked to the conditioning frame. The catch unique to video is that turning either scale too high amplifies per-frame detail at the cost of smooth change between frames, so the clip's motion begins to flicker or its colors over-saturate — guidance strength trades against temporal smoothness. This is why production video models expose several guidance knobs rather than the single one image models use.
Video codec
The set of rules for compressing video into a small file and decompressing it back into frames — "codec" is short for coder–decoder, which is literally what it does. Raw video is enormous (a few seconds can be hundreds of megabytes), so almost all real video is stored compressed; codecs exploit the fact that neighboring frames barely change. Analogy: a codec is like shorthand for a movie — instead of writing every frame in full, it writes "same as the last frame, but this corner moved." Examples include H.264 (the universal default) and AV1 (smaller files, slower to decode); the codec lives inside a media container like .mp4.
view
A no-copy alias that shares storage with its source; requires a contiguous-compatible layout
VIO
Visual-Inertial Odometry — fuse camera and IMU for high-rate ego-motion
Video GAN
A GAN adapted to produce short video clips instead of single images: the generator outputs a whole stack of frames at once and the discriminator judges whether the motion, not just each individual frame, looks real. The early family — VGAN, TGAN, MoCoGAN, DVD-GAN, and StyleGAN-V — produced only short, low-resolution clips and suffered badly from mode collapse (the generator falling back on a few safe outputs). Each pushed one idea: MoCoGAN separated content from motion, DVD-GAN was the first to reach plausible quality, and StyleGAN-V applied StyleGAN's latent-space tricks to video. The whole approach was largely abandoned around 2023 once diffusion proved both sharper and far more stable to train at scale.
ViT
Vision Transformer — a transformer that classifies or encodes images by first chopping them into a grid of small square patches (patchification), turning each patch into one token, and then treating the picture exactly like a sentence of words. Because a plain transformer has no built-in notion of "next to" the way a CNN does, a ViT adds a learned positional embedding to each patch (a small vector that says "I am the patch at row 3, column 5") and usually prepends a CLS token whose output becomes the whole-image summary. Like reading a mosaic tile by tile, left to right, instead of taking in the whole wall at once — and, given enough data, this beats CNNs because the model can relate any tile to any other from the very first layer instead of only neighboring pixels. The "B/16" in a name like ViT-B/16 means a Base-size model with 16×16-pixel patches.
VLA
Vision-Language-Action model — transformer mapping image + instruction → action
vLLM
The reference open-source inference engine with PagedAttention and continuous batching
VLM
Vision-Language Model — a model that takes an image (usually plus a text question) in and produces text out, such as a caption or an answer. The standard build is middle fusion: a pretrained image encoder turns the picture into feature vectors, a small projector maps those into the token space of a pretrained language model, and the language model then "reads" the image alongside the words. LLaVA is the canonical open example; Qwen2-VL and Gemini are larger ones. Analogy: a sighted assistant describing a photo to a brilliant writer who cannot see it — the encoder does the looking, the language model does the talking. Unlike a native multimodal model, a plain VLM only outputs text; it cannot generate images.
Vocabulary
The fixed set of tokens a tokenizer can produce, each with an integer ID; its size trades tokens-per-document against embedding matrix size
Volta
NVIDIA's 2017 GPU architecture (V100) and the first generation to ship Tensor Cores, the dedicated matmul units that made deep-learning training dramatically faster. Subsequent generations — Turing, Ampere, Hopper, Blackwell — kept Tensor Cores and added support for ever-lower-precision formats. Named after the Italian physicist Alessandro Volta.
VP / VE SDE
The two standard ways to define the forward noising process of a diffusion model, each written as an SDE. Variance-Preserving (VP) — the family DDPM uses — shrinks the original signal as it adds noise so the total variance stays around 1 the whole way. Variance-Exploding (VE) — used by the early score-based models — leaves the signal untouched and simply piles on ever-larger noise, so the variance grows without bound. They are mathematically interconvertible and reach similar quality, but differ in numerical conditioning and in which samplers behave well.
VP9
A royalty-free video codec built by Google as a free alternative to the patent-licensed H.264. It compresses noticeably better than H.264 — smaller files at the same quality — and is the codec behind most YouTube streams and many .webm files, though it has since been largely overtaken by the newer, even-smaller AV1. Analogy: VP9 is to H.264 what a tighter, license-free ZIP format is to an older paid one — it squeezes the video smaller with no license fee, at the cost of more work to decode it back into frames. Example: a clip saved as a VP9 .webm is usually a good bit smaller than the same clip as an H.264 .mp4, but slower to unpack into frames during training.
VQA (Visual Question Answering)
The task of answering a natural-language question about an image — "How many people are in this photo?", "What color is the car?" — where the model must read the picture and the words together to respond. It is the classic benchmark for multimodal understanding: unlike captioning, which can lean on generic descriptions, a question pins the model to one specific detail it cannot fake. Think of an open-book exam where the "book" is a photograph and each question forces you to actually look. Most VLMs are evaluated on VQA datasets, and it is the natural small task on which to compare fusion methods like concatenation versus cross-attention.
VQ-GAN
A VQ-VAE trained with two extra signals so its reconstructions look sharp instead of blurry: a perceptual loss that compares images by their high-level features rather than exact pixels, and a patch discriminator — a small critic from the GAN world that scores whether each local region of an image looks real. The combination pushes the decoder to commit to crisp, specific details. This is the recipe Stable Diffusion's VAE descends from.
VQ-VAE
Vector-Quantized VAE — an autoencoder whose latent code is forced to be discrete. Instead of letting the encoder output any continuous numbers, each patch of the image must be described using an entry chosen from a small fixed codebook, like painting only with the colors in a numbered paint set. Turning an image into a grid of these code indices lets you treat it as a sequence of tokens and generate it with the same tools used for language. It is trained with a straight-through estimator so gradients can flow through the non-differentiable lookup.
W and W+ latent spaces
The editable latent spaces inside StyleGAN that dictate how images are generated and controlled. W (The Master Remote): The intermediate space the input noise is first mapped into. Because StyleGAN's training thoroughly disentangles it, it acts like an intuitive master remote control—turning a single "dial" in W smoothly changes one specific attribute (like age) without altering the rest, making it perfect for editing. W+ (The Individual Room Panels): A relaxed version of W where each layer gets its own independent W code instead of sharing just one. Like abandoning the master remote for highly detailed control panels in every single room, it is harder to tweak one simple trait, but it can represent and reconstruct a specific, complex image much more precisely. This is the space GAN inversion usually targets when trying to match a real-world photo.
Warmup
The opening phase of training where the learning rate ramps up from near zero to its peak, stabilizing the first noisy updates
Warp
32 threads scheduled in lockstep on NVIDIA GPUs
Wasserstein GAN (WGAN)
A GAN variant that replaces the original loss with the Earth Mover's Distance between the real and generated image distributions. The original loss gives almost no gradient once the discriminator wins, stalling training; the Earth Mover's Distance stays informative even when the two distributions barely overlap, so the generator keeps learning. It requires the critic to obey a Lipschitz constraint, enforced in the popular WGAN-GP version by a gradient penalty.
Watermarking
Hiding an invisible, machine-detectable signal inside a generated image so software can later confirm "this was made by AI" without changing how the picture looks to a human. The signal can be stamped into the pixels after generation (a faint patterned perturbation) or baked into the model's own sampling — Google's SynthID nudges pixel values in a learned pattern, and Tree-Ring plants a ring-shaped mark in the initial noise that survives diffusion and is recovered by inverting the generation process. A matching detector then reads the mark back out and reports a confidence score. Like the watermark pressed into a banknote: invisible in normal use, obvious under the right lamp, and hard to forge or scrub off. The built-in tension is robustness vs invisibility — a mark strong enough to survive cropping and JPEG compression is harder to keep imperceptible. Example: generate 1,000 images, run the detector, and it should flag nearly all of them while leaving real photos unflagged.
WBC
Whole-Body Control — fast QP solving for joint torques from task-space goals
WebDataset
A library that streams training data directly from sharded .tar archives, avoiding the need to unpack millions of individual files.
Weight decay
A regularization technique that shrinks model parameters toward zero at each update step, discouraging large weights and improving generalization
Weights
The main, larger group of learned parameters in a layer — the W in y = xW + b — that decide how strongly each input affects each output. Think of the volume sliders on a soundboard: a big weight turns an input way up, a near-zero weight mutes it, and a negative weight flips it. During training the optimizer keeps nudging these sliders to lower the loss, and they make up the bulk of a model's size.
WGAN-GP
Short for Wasserstein GAN with Gradient Penalty — the most popular and reliable recipe for training a Wasserstein GAN. A Wasserstein GAN only works if its critic obeys a Lipschitz constraint (its output can't change too fast). The original WGAN enforced that bluntly, by clipping every critic weight back into a fixed range after each step — a heavy-handed move that often crippled the model's quality. WGAN-GP replaces the clipping with a gentle gradient penalty that simply nudges the size of the critic's gradient toward 1, which keeps training far more stable. Like keeping a car at the speed limit with a smooth governor that eases off the gas, instead of a hard rev-cut that jerks the whole engine every time you nudge past it.
Whisper
OpenAI's open speech-recognition model — an encoder-decoder transformer that turns a mel spectrogram of speech into text, trained on 680,000 hours of multilingual audio scraped from the web. Its encoder digests the audio into rich embeddings and its decoder writes out the words, so one model handles transcription, translation, and language identification. Because that encoder learned such general audio representations, people often reuse just the encoder — freezing it and training a small head on top — as a ready-made audio feature extractor (much like a vision linear probe). The name evokes catching even quiet, whispered speech.
Windowed attention
A cheaper form of attention that lets each token attend only to others inside a small local neighborhood (a "window") rather than to the whole sequence. In video, windowed spatiotemporal attention applies full joint space-and-time attention but only within small 3D boxes of nearby frames and pixels, so cost grows with the window size instead of the full T×H×W. It is the middle ground between cheap (2+1)D attention and expensive full spatiotemporal attention: you keep some direct space-time interaction but give up reach across the whole clip. Like reading a document through a small sliding window that shows only a few lines at a time — fast, but you cannot see the whole page at once.
Worker processes
Background subprocesses that a DataLoader spawns to load and preprocess data in parallel with GPU computation.
Workhorse
The dependable, go-to method that does the bulk of the everyday work in a field — not the flashiest, but the one practitioners reach for by default because it reliably gets the job done. Just as a workhorse on a farm pulls the heavy loads day in and day out, PPO earned the title in RLHF and the PID controller earned it in robotics.
World Model
Action-conditioned generative model of the world; a video model with actions
WSD
Warmup-Stable-Decay — a learning-rate schedule that warms up, holds the rate constant for most of training, then decays sharply at the end.
XLA
Accelerated Linear Algebra — a compiler backend (e.g. for TPUs) used via torch_xla
YaRN
Yet another RoPE extensioN method — a context-extension scheme that rescales rotation frequencies unevenly across dimensions to reach long contexts with minimal fine-tuning
ZeRO
DeepSpeed's parameter/gradient/state sharding scheme — comparable to FSDP
Zero-conv
A 1×1 convolution whose weights and bias all start at exactly zero, used by ControlNet to bolt a new branch onto a pretrained model without disturbing it. At initialization a zero-conv outputs nothing, so the new branch adds zero to the original network and the model behaves exactly as before — yet because the layer still receives gradients, it can gradually learn how much signal to pass through. Like wiring in a new tap that is turned fully off at first, then opened slowly as training discovers how much to let flow. This is what lets ControlNet train a fresh control signal without damaging the base model's existing quality.
Zero-shot
Doing a task the model was never explicitly trained for, with zero task-specific examples shown at test time. The classic case is CLIP zero-shot image classification: instead of training a classifier head on labelled images, you write each candidate label as a short sentence — a prompt template such as "a photo of a {label}" — encode every sentence with the text encoder, and assign the image whichever label sentence has the highest cosine similarity to its image embedding. The prompt wording matters: phrasing the label as a natural caption matches the style CLIP saw during training, and averaging several templates (prompt ensembling) lifts accuracy a little more. Like a quiz contestant who never studied your specific exam but has read so widely that, handed the answer choices written out in full, they can pick the best match anyway. Example: deciding whether a photo is a cat or a dog by checking whether "a photo of a cat" or "a photo of a dog" sits closer to the image in CLIP's shared space.
ZMP
Zero-Moment Point — classical biped balance criterion
Zoom
A camera move that magnifies or shrinks the view without the camera physically moving — like using binoculars to pull a faraway sign closer while your feet stay planted. It narrows or widens the lens's field of view, so the whole frame scales in or out at once. Contrast it with a dolly, which changes the picture by actually rolling the camera nearer or farther. It is one of the moves a video model can be steered through with camera control.
β-VAE
A VAE variant that multiplies the KL divergence part of the ELBO by an adjustable knob called β. Turning β up past 1 pressures the model to use its latent space more tidily, often making individual latent dimensions line up with meaningful features (like rotation or thickness) — but push it too far and the model stops reconstructing the input well. It is the simplest way to trade reconstruction quality against a cleaner, more interpretable latent.
σ-schedule (Karras)
The EDM convention of describing each noise level by its standard deviation σ (a real number) rather than by a discrete timestep t (an index from 0 to ~1000). Because σ directly measures "how much noise is on the image right now," the math for training and sampling becomes cleaner and sampler step sizes are easier to choose. Like labeling oven settings by their actual temperature instead of an arbitrary dial number from 1 to 10.
License
MIT License. See the LICENSE file for details. ://github.com/25621/ai-learning-guides/blob/main/LICENSE) file for details.