How LLM Inference Works: From Prompt to Generated Tokens
1. Introductionâ
When you type a question into ChatGPT or Claude and watch the answer appear word by word, youâre witnessing LLM inferenceâthe realâtime process of running a trained transformer model to generate new text. Unlike training, which builds the modelâs parameters over weeks on massive GPU clusters, inference is the moment the model actually produces value for end users.
Inference isnât magic. Itâs a carefully orchestrated pipeline that converts a user prompt into tokens, runs a forward pass through the neural network, and then iteratively generates the next token until a response is complete. Understanding how this pipeline worksâhow the model manages memory with a KV cache, why the first token takes longer, and how sampling strategies shape creativityâis essential for designing production systems that are fast, costâeffective, and reliable.
In this article, weâll trace the entire inference journey from prompt to output, explain the two critical phases (prefill and decode), and reveal why inference can be both computationally expensive and beautifully efficient.
1. What Happens When You Send a Prompt?â
The moment you submit a prompt, the model doesnât âthinkâ like a human. It performs a deterministic (or semiârandom, depending on sampling) series of mathematical operations. The complete flow looks like this:
Figure 1: The inference loop. The process repeats until an endâofâsequence token is generated or the maximum output length is reached.
Each step involves distinct components and resource costs. Letâs unpack them one by one.
2. Tokenization Stageâ
Before the model can process your text, it must be split into tokensâthe atomic units the model understands. A token can be a whole word, a subword, or a character, depending on the tokenizer.
Example:
"Hello world" â [15496, 995] (using a typical GPTâstyle tokenizer).
These token IDs are integers that index into the modelâs embedding table. The tokenizer runs strictly on CPU (or in a lightweight service) and its overhead is usually negligible compared to the GPU computation that follows. Itâs critical that the same tokenizer is used for both training and inference; otherwise, the model receives completely garbled input.
Deeper on tokens: See our Tokens article for how different tokenizers work and why token count matters.
3. Embedding Layerâ
The token IDs are then fed into the embedding layerâa large lookup table. Each ID is mapped to a dense vector of floatingâpoint numbers (e.g., 4096 dimensions for Llama 3 8B). These vectors encode semantic meaning learned during training. At this stage, each token is still independent; the embeddings have no context about surrounding words. That context will be injected later by the attention layers.
The embedding layerâs weights are part of the model parameters and are loaded into GPU memory during inference.
How embeddings capture meaning: Read the Embeddings article for a full explanation.
4. Transformer Forward Passâ
The embedded token vectors (plus positional encodings) now flow through the stack of transformer layersâthe heart of the LLM. Each layer applies:
- Multiâhead selfâattention: Every token attends to every other token (or previous tokens only, in decoder models), exchanging information.
- Feedâforward network (FFN): A positionâwise transformation that enriches each tokenâs representation.
- Residual connections and layer normalization: These stabilize gradients and allow deep stacking.
After the final layer, each position holds a contextualized vector that captures the meaning of that token in the context of the entire sequence seen so far.
Deep architecture walkthrough: The Transformer Architecture article explains each component in detail.
5. Prefill Phaseâ
When inference begins, the model processes the entire input prompt in one parallel step. This is called the prefill phase. Because the prompt is fully known, the model can compute attention scores for all prompt tokens simultaneouslyâa massive matrix multiplication that runs efficiently on GPUs.
Key characteristics:
- Parallel processing: All prompt tokens are ingested at once.
- Heavy GPU compute: The full attention matrix is computed for the prompt length, which can be hundreds or thousands of tokens.
- Builds the KV cache: The Keys and Values from attention are stored for later reuse.
- Produces the first token: After prefill, the model has the hidden state for the final prompt token, enabling prediction of the first output token.
The latency from sending the prompt to receiving the first token is called Time To First Token (TTFT). Itâs dominated by the prefill compute and is particularly sensitive to prompt length.
6. KV Cache (KeyâValue Cache)â
During autoregressive generation, the model generates one token at a time. If we recomputed attention for the entire (prompt + previously generated tokens) sequence from scratch at each step, we would repeat enormous amounts of computation. The KV cache solves this.
How it works:
- During the first forward pass (prefill), the model computes the Key and Value vectors for every token and stores them in a dedicated cache.
- When generating the next token, only the new tokenâs Query, Key, and Value are computed. The new Q is compared against all cached K vectors (old + new), and the weighted sum uses cached V vectors.
- This avoids reâcomputing the heavy attention projection for all previous tokens, slashing the perâstep compute from O(n²) to O(n) for the attention part.
Why it matters:
- Without KV cache, generating the 100th token would require recomputing attention over 100 tokens. With KV cache, it only processes the 1 new token plus a highly optimized cache lookup.
- The KV cache grows linearly with sequence length and batch size, and it can easily consume more GPU memory than the model weights themselvesâespecially with long contexts. Memory management of the KV cache is a central challenge in serving systems.
7. Decode Phase (Autoregressive Generation)â
After prefill, the model enters the decode phase, where it generates output tokens one at a time in an autoregressive loop:
- The model takes the last generated token (or the final prompt token for the first step) as input.
- It runs a single forward pass (with KV cache) to produce logits.
- A sampling strategy selects the next token.
- The token is appended to the sequence, and the KV cache is updated.
- The loop continues until a special endâofâsequence token (
<|eos|>) is produced or a maximum length is reached.
Tokenâ â Tokenâ â Tokenâ â ... â <|eos|>
Each iteration is sequentially dependent on the previous oneâyou cannot parallelize across time steps. This is why generation speed is measured in tokens per second and why techniques like speculative decoding (using a draft model to guess tokens) aim to accelerate the decode phase.
8. Attention During Inferenceâ
Attention remains the backbone of context understanding during inference. With the KV cache, each new token can attend to the entire previous context efficiently. This mechanism enables the model to:
- Resolve pronouns (âitâ refers back to the âcatâ mentioned earlier).
- Maintain longârange dependencies across paragraphs.
- Integrate retrieved context in RAG applications.
However, attentionâs O(n²) memory footprint in the KV cache is still a bottleneck for very long sequences. Innovations like FlashAttention and paged attention (used in vLLM) drastically reduce memory overhead, making longâcontext inference practical.
Attention mechanics explained: Visit the Attention Mechanism article to understand Q, K, V, and multiâhead attention.
9. Sampling Strategyâ
After the transformer forward pass, the model outputs logitsâraw scores for every token in the vocabulary. How you pick the next token from these scores dramatically affects the output style and quality.
| Strategy | Description | When to Use |
|---|---|---|
| Greedy decoding | Always pick the token with the highest probability. | Deterministic outputs, factual Q&A. |
| Topâk sampling | Consider only the k tokens with the highest probabilities, then sample. | Adds creativity while filtering out lowâquality tokens. |
| Topâp (nucleus) sampling | Keep the smallest set of tokens whose cumulative probability exceeds p, then sample. | Dynamic creativity; adapts to certainty. |
| Temperature | Scale the logits before softmax: low temp (< 1) sharpens the distribution (more deterministic), high temp (>1) flattens it (more random). | Controls randomness; applied together with topâk/topâp. |
In practice, most API providers use a combination: e.g., temperature=0.7 and top_p=0.9. Lower temperature is preferred for code or math; higher temperature for creative writing.
10. Why Inference Is Expensiveâ
Running inference at scale isnât trivial. Several factors drive costs:
- Model size: A 70B parameter model in FP16 requires ~140 GB just for weights. You need multiple highâend GPUs (e.g., A100 80GB or H100) simply to load it.
- Attention complexity: The prefill phase involves O(n²) compute. For a 128K prompt, thatâs an enormous amount of FLOPs, even with optimizations.
- Sequential decode bottleneck: Tokens are generated one by one, making latency directly proportional to output length. Throughput (tokens per second) is limited by memory bandwidth.
- KV cache memory: Serving many users with long conversations requires a huge KV cache, straining GPU HBM. Efficient memory management (paging, offloading) is critical.
- Hardware utilization: GPUs are designed for parallel throughput, but autoregressive generation is inherently sequential, causing underâutilization of compute units and shifting the bottleneck to memory bandwidth.
This is why specialized inference frameworks and quantization techniques (reducing precision to INT8 or INT4) are active areas of engineering.
11. Context Window During Inferenceâ
The context window (the maximum number of tokens the model can handle) is a hard cap during inference. If the total tokens (prompt + generation) exceed this limit, the inference engine must truncate the earliest tokens, which can cause the model to lose crucial context.
Long contexts increase both prefill compute and KV cache size, directly impacting cost and latency. Many production systems implement strategic context management (summarization, sliding windows) to stay within budget.
Managing token limits: Read our Context Window article for optimization techniques.
12. Prefill vs Decode Comparisonâ
| Aspect | Prefill | Decode |
|---|---|---|
| Processing | Parallel (all prompt tokens at once) | Sequential (one token at a time) |
| GPU Compute | Very high (computeâbound) | Lower per step (memoryâbound) |
| Latency | Defines Time To First Token (TTFT) | Defines Time Per Output Token (TPOT) |
| KV Cache | Built from scratch | Read/write on new token |
| Optimization focus | Kernel fusion, FlashAttention | Memory bandwidth, batching |
Understanding this distinction is crucial for capacity planning and latency optimization.
13. RealâWorld Exampleâ
Letâs trace the prompt: "The cat sat on the mat because"
- Tokenization:
["The", " cat", " sat", " on", " the", " mat", " because"]â[133, 4721, 318, 19, 278, 6416, 562] - Embedding: Each ID becomes a vector of 4096 numbers.
- Positional encoding added.
- Prefill phase: All 7 vectors enter the transformer. After 32 layers, each token is now contextualized. The KV cache is filled.
- First decode step: The hidden state for the last token
"because"passes through the output projection to produce logits. Sampling (with temp=0.7) selects token" it"(likely high probability). - Append & repeat: The sequence becomes 8 tokens. The KV cache now holds keys/values for all 8. The new tokenâs Q attends to all cached Ks. Next token
" was"is generated. - Continue until
"."or<|eos|>is generated:" it was tired."
This entire loop runs in milliseconds on modern GPUs.
14. Production Inference Systemsâ
Serving LLMs to thousands of concurrent users requires sophisticated systems. Key examples:
- vLLM: Openâsource, implements PagedAttention to manage KV cache memory efficiently and supports continuous batchingâdynamically adding/removing requests from a running batch to maximize throughput.
- TensorRTâLLM: NVIDIAâs optimized inference library with advanced quantization, kernel fusion, and speculative decoding.
- Hugging Face TGI (Text Generation Inference): A popular openâsource server with builtâin watermarking, stopping criteria, and batching.
- Continuous batching: Instead of waiting for a whole batch to finish, new requests are inserted as soon as one slot frees up, increasing hardware utilization.
These frameworks handle the heavy lifting of memory management, scheduling, and optimization, allowing developers to focus on application logic.
15. Common Misconceptionsâ
âLLMs generate full sentences at once.ââ
False. Text is generated token by token, autoregressively. The illusion of a complete response appearing at once is due to streaming, where tokens are sent as theyâre produced.
âLLMs âthinkâ like humans.ââ
Inference is pure matrix multiplication. Thereâs no consciousness, no inner monologueâjust statistical prediction based on patterns learned during training.
âInference updates model weights.ââ
Weights are frozen during inference (unless youâre doing online learning, which is rare and risky). The model doesnât learn from user interactions.
âMore context always means better reasoning.ââ
Even with large windows, models can get âlost in the middleâ and struggle to prioritize information. Quality of context placement and retrieval often outweighs sheer volume.
16. Relationship to Other Conceptsâ
Inference ties together all the foundations: tokens provide the language; embeddings give them numerical form; the transformer processes them; attention builds context; the KV cache accelerates generation; and sampling adds controlled creativity.
17. Key Takeawaysâ
- Inference is the runtime generation processâitâs what users interact with, unlike batch training.
- The pipeline: Tokenization â Embeddings â Prefill (parallel prompt processing) â Decode (autoregressive token generation).
- KV cache is the critical optimization that reuses Keys and Values from previous tokens, avoiding repeated computation.
- Prefill phase determines Time To First Token (TTFT); decode phase determines tokensâperâsecond throughput.
- Sampling strategies (temperature, topâp) control the randomness and creativity of outputs.
- Inference is expensive due to model size, quadratic attention, sequential decoding, and memory bandwidth limitsâmanaged by specialized frameworks like vLLM and TensorRTâLLM.
- Context window size directly affects inference cost and latency; efficient memory management is essential.
- Understanding inference empowers you to design responsive, costâeffective AI applications that deliver realâtime value.