How LLM Inference Works: From Prompt to Generated Tokens

1. Introduction

When you type a question into ChatGPT or Claude and watch the answer appear word by word, you’re witnessing LLM inference—the real‑time process of running a trained transformer model to generate new text. Unlike training, which builds the model’s parameters over weeks on massive GPU clusters, inference is the moment the model actually produces value for end users.

Inference isn’t magic. It’s a carefully orchestrated pipeline that converts a user prompt into tokens, runs a forward pass through the neural network, and then iteratively generates the next token until a response is complete. Understanding how this pipeline works—how the model manages memory with a KV cache, why the first token takes longer, and how sampling strategies shape creativity—is essential for designing production systems that are fast, cost‑effective, and reliable.

In this article, we’ll trace the entire inference journey from prompt to output, explain the two critical phases (prefill and decode), and reveal why inference can be both computationally expensive and beautifully efficient.

1. What Happens When You Send a Prompt?

The moment you submit a prompt, the model doesn’t “think” like a human. It performs a deterministic (or semi‑random, depending on sampling) series of mathematical operations. The complete flow looks like this:

Figure 1: The inference loop. The process repeats until an end‑of‑sequence token is generated or the maximum output length is reached.

Each step involves distinct components and resource costs. Let’s unpack them one by one.

2. Tokenization Stage

Before the model can process your text, it must be split into tokens—the atomic units the model understands. A token can be a whole word, a subword, or a character, depending on the tokenizer.

Example:
"Hello world" → [15496, 995] (using a typical GPT‑style tokenizer).

These token IDs are integers that index into the model’s embedding table. The tokenizer runs strictly on CPU (or in a lightweight service) and its overhead is usually negligible compared to the GPU computation that follows. It’s critical that the same tokenizer is used for both training and inference; otherwise, the model receives completely garbled input.

Deeper on tokens: See our Tokens article for how different tokenizers work and why token count matters.

3. Embedding Layer

The token IDs are then fed into the embedding layer—a large lookup table. Each ID is mapped to a dense vector of floating‑point numbers (e.g., 4096 dimensions for Llama 3 8B). These vectors encode semantic meaning learned during training. At this stage, each token is still independent; the embeddings have no context about surrounding words. That context will be injected later by the attention layers.

The embedding layer’s weights are part of the model parameters and are loaded into GPU memory during inference.

How embeddings capture meaning: Read the Embeddings article for a full explanation.

4. Transformer Forward Pass

The embedded token vectors (plus positional encodings) now flow through the stack of transformer layers—the heart of the LLM. Each layer applies:

Multi‑head self‑attention: Every token attends to every other token (or previous tokens only, in decoder models), exchanging information.
Feed‑forward network (FFN): A position‑wise transformation that enriches each token’s representation.
Residual connections and layer normalization: These stabilize gradients and allow deep stacking.

After the final layer, each position holds a contextualized vector that captures the meaning of that token in the context of the entire sequence seen so far.

Deep architecture walkthrough: The Transformer Architecture article explains each component in detail.

5. Prefill Phase

When inference begins, the model processes the entire input prompt in one parallel step. This is called the prefill phase. Because the prompt is fully known, the model can compute attention scores for all prompt tokens simultaneously—a massive matrix multiplication that runs efficiently on GPUs.

Key characteristics:

Parallel processing: All prompt tokens are ingested at once.
Heavy GPU compute: The full attention matrix is computed for the prompt length, which can be hundreds or thousands of tokens.
Builds the KV cache: The Keys and Values from attention are stored for later reuse.
Produces the first token: After prefill, the model has the hidden state for the final prompt token, enabling prediction of the first output token.

The latency from sending the prompt to receiving the first token is called Time To First Token (TTFT). It’s dominated by the prefill compute and is particularly sensitive to prompt length.

6. KV Cache (Key‑Value Cache)

During autoregressive generation, the model generates one token at a time. If we recomputed attention for the entire (prompt + previously generated tokens) sequence from scratch at each step, we would repeat enormous amounts of computation. The KV cache solves this.

How it works:

During the first forward pass (prefill), the model computes the Key and Value vectors for every token and stores them in a dedicated cache.
When generating the next token, only the new token’s Query, Key, and Value are computed. The new Q is compared against all cached K vectors (old + new), and the weighted sum uses cached V vectors.
This avoids re‑computing the heavy attention projection for all previous tokens, slashing the per‑step compute from O(n²) to O(n) for the attention part.

Why it matters:

Without KV cache, generating the 100th token would require recomputing attention over 100 tokens. With KV cache, it only processes the 1 new token plus a highly optimized cache lookup.
The KV cache grows linearly with sequence length and batch size, and it can easily consume more GPU memory than the model weights themselves—especially with long contexts. Memory management of the KV cache is a central challenge in serving systems.

7. Decode Phase (Autoregressive Generation)

After prefill, the model enters the decode phase, where it generates output tokens one at a time in an autoregressive loop:

The model takes the last generated token (or the final prompt token for the first step) as input.
It runs a single forward pass (with KV cache) to produce logits.
A sampling strategy selects the next token.
The token is appended to the sequence, and the KV cache is updated.
The loop continues until a special end‑of‑sequence token (<|eos|>) is produced or a maximum length is reached.

Token₁ → Token₂ → Token₃ → ... → <|eos|>

Each iteration is sequentially dependent on the previous one—you cannot parallelize across time steps. This is why generation speed is measured in tokens per second and why techniques like speculative decoding (using a draft model to guess tokens) aim to accelerate the decode phase.

8. Attention During Inference

Attention remains the backbone of context understanding during inference. With the KV cache, each new token can attend to the entire previous context efficiently. This mechanism enables the model to:

Resolve pronouns (“it” refers back to the “cat” mentioned earlier).
Maintain long‑range dependencies across paragraphs.
Integrate retrieved context in RAG applications.

However, attention’s O(n²) memory footprint in the KV cache is still a bottleneck for very long sequences. Innovations like FlashAttention and paged attention (used in vLLM) drastically reduce memory overhead, making long‑context inference practical.

Attention mechanics explained: Visit the Attention Mechanism article to understand Q, K, V, and multi‑head attention.

9. Sampling Strategy

After the transformer forward pass, the model outputs logits—raw scores for every token in the vocabulary. How you pick the next token from these scores dramatically affects the output style and quality.

Strategy	Description	When to Use
Greedy decoding	Always pick the token with the highest probability.	Deterministic outputs, factual Q&A.
Top‑k sampling	Consider only the k tokens with the highest probabilities, then sample.	Adds creativity while filtering out low‑quality tokens.
Top‑p (nucleus) sampling	Keep the smallest set of tokens whose cumulative probability exceeds p, then sample.	Dynamic creativity; adapts to certainty.
Temperature	Scale the logits before softmax: low temp (< 1) sharpens the distribution (more deterministic), high temp (>1) flattens it (more random).	Controls randomness; applied together with top‑k/top‑p.

In practice, most API providers use a combination: e.g., temperature=0.7 and top_p=0.9. Lower temperature is preferred for code or math; higher temperature for creative writing.

10. Why Inference Is Expensive

Running inference at scale isn’t trivial. Several factors drive costs:

Model size: A 70B parameter model in FP16 requires ~140 GB just for weights. You need multiple high‑end GPUs (e.g., A100 80GB or H100) simply to load it.
Attention complexity: The prefill phase involves O(n²) compute. For a 128K prompt, that’s an enormous amount of FLOPs, even with optimizations.
Sequential decode bottleneck: Tokens are generated one by one, making latency directly proportional to output length. Throughput (tokens per second) is limited by memory bandwidth.
KV cache memory: Serving many users with long conversations requires a huge KV cache, straining GPU HBM. Efficient memory management (paging, offloading) is critical.
Hardware utilization: GPUs are designed for parallel throughput, but autoregressive generation is inherently sequential, causing under‑utilization of compute units and shifting the bottleneck to memory bandwidth.

This is why specialized inference frameworks and quantization techniques (reducing precision to INT8 or INT4) are active areas of engineering.

11. Context Window During Inference

The context window (the maximum number of tokens the model can handle) is a hard cap during inference. If the total tokens (prompt + generation) exceed this limit, the inference engine must truncate the earliest tokens, which can cause the model to lose crucial context.

Long contexts increase both prefill compute and KV cache size, directly impacting cost and latency. Many production systems implement strategic context management (summarization, sliding windows) to stay within budget.

Managing token limits: Read our Context Window article for optimization techniques.

12. Prefill vs Decode Comparison

Aspect	Prefill	Decode
Processing	Parallel (all prompt tokens at once)	Sequential (one token at a time)
GPU Compute	Very high (compute‑bound)	Lower per step (memory‑bound)
Latency	Defines Time To First Token (TTFT)	Defines Time Per Output Token (TPOT)
KV Cache	Built from scratch	Read/write on new token
Optimization focus	Kernel fusion, FlashAttention	Memory bandwidth, batching

Understanding this distinction is crucial for capacity planning and latency optimization.

13. Real‑World Example

Let’s trace the prompt: "The cat sat on the mat because"

Tokenization: ["The", " cat", " sat", " on", " the", " mat", " because"] → [133, 4721, 318, 19, 278, 6416, 562]
Embedding: Each ID becomes a vector of 4096 numbers.
Positional encoding added.
Prefill phase: All 7 vectors enter the transformer. After 32 layers, each token is now contextualized. The KV cache is filled.
First decode step: The hidden state for the last token "because" passes through the output projection to produce logits. Sampling (with temp=0.7) selects token " it" (likely high probability).
Append & repeat: The sequence becomes 8 tokens. The KV cache now holds keys/values for all 8. The new token’s Q attends to all cached Ks. Next token " was" is generated.
Continue until "." or <|eos|> is generated: " it was tired."

This entire loop runs in milliseconds on modern GPUs.

14. Production Inference Systems

Serving LLMs to thousands of concurrent users requires sophisticated systems. Key examples:

vLLM: Open‑source, implements PagedAttention to manage KV cache memory efficiently and supports continuous batching—dynamically adding/removing requests from a running batch to maximize throughput.
TensorRT‑LLM: NVIDIA’s optimized inference library with advanced quantization, kernel fusion, and speculative decoding.
Hugging Face TGI (Text Generation Inference): A popular open‑source server with built‑in watermarking, stopping criteria, and batching.
Continuous batching: Instead of waiting for a whole batch to finish, new requests are inserted as soon as one slot frees up, increasing hardware utilization.

These frameworks handle the heavy lifting of memory management, scheduling, and optimization, allowing developers to focus on application logic.

15. Common Misconceptions

“LLMs generate full sentences at once.”

False. Text is generated token by token, autoregressively. The illusion of a complete response appearing at once is due to streaming, where tokens are sent as they’re produced.

“LLMs ‘think’ like humans.”

Inference is pure matrix multiplication. There’s no consciousness, no inner monologue—just statistical prediction based on patterns learned during training.

“Inference updates model weights.”

Weights are frozen during inference (unless you’re doing online learning, which is rare and risky). The model doesn’t learn from user interactions.

“More context always means better reasoning.”

Even with large windows, models can get “lost in the middle” and struggle to prioritize information. Quality of context placement and retrieval often outweighs sheer volume.

16. Relationship to Other Concepts

Inference ties together all the foundations: tokens provide the language; embeddings give them numerical form; the transformer processes them; attention builds context; the KV cache accelerates generation; and sampling adds controlled creativity.

17. Key Takeaways

Inference is the runtime generation process—it’s what users interact with, unlike batch training.
The pipeline: Tokenization → Embeddings → Prefill (parallel prompt processing) → Decode (autoregressive token generation).
KV cache is the critical optimization that reuses Keys and Values from previous tokens, avoiding repeated computation.
Prefill phase determines Time To First Token (TTFT); decode phase determines tokens‑per‑second throughput.
Sampling strategies (temperature, top‑p) control the randomness and creativity of outputs.
Inference is expensive due to model size, quadratic attention, sequential decoding, and memory bandwidth limits—managed by specialized frameworks like vLLM and TensorRT‑LLM.
Context window size directly affects inference cost and latency; efficient memory management is essential.
Understanding inference empowers you to design responsive, cost‑effective AI applications that deliver real‑time value.

1. Introduction​

1. What Happens When You Send a Prompt?​

2. Tokenization Stage​

3. Embedding Layer​

4. Transformer Forward Pass​

5. Prefill Phase​

6. KV Cache (Key‑Value Cache)​

7. Decode Phase (Autoregressive Generation)​

8. Attention During Inference​

9. Sampling Strategy​

10. Why Inference Is Expensive​

11. Context Window During Inference​

12. Prefill vs Decode Comparison​

13. Real‑World Example​

14. Production Inference Systems​

15. Common Misconceptions​

“LLMs generate full sentences at once.”​

“LLMs ‘think’ like humans.”​

“Inference updates model weights.”​

“More context always means better reasoning.”​

16. Relationship to Other Concepts​

17. Key Takeaways​