LLM Tokens Explained: The Building Blocks of Large Language Models
1. Introductionâ
When you interact with ChatGPT, Claude, Gemini, Llama, or DeepSeek, you quickly encounter language that revolves around a simple but powerful concept: tokens. You see âinput tokensâ and âoutput tokensâ on billing dashboards, âtoken limitsâ in API documentation, and âcontext windowsâ described in thousands of tokens. But what exactly is a token? Why do AI models count tokens instead of words, characters, or sentences?
Tokens are the fundamental unit of information inside a large language model. Every piece of text you send or receive is broken down into tokens before the model can process it. Understanding tokens is not just an academic exerciseâit directly affects the cost, latency, capacity, and behavior of every AIâpowered application you build.
In this article, youâll learn:
- What tokens are, with concrete examples in English, code, and other languages.
- Why tokens are necessary and how they differ from words.
- How tokenization works under the hood, including popular algorithms like BPE and SentencePiece.
- Why token count determines context window size and API pricing.
- How to estimate tokens, optimize usage, and avoid common misconceptions.
By the end, youâll have a solid, productionâoriented grasp of tokens that will help you design better prompts, manage costs, and architect AI systems that respect token limits.
What Is a Token?â
A token is the smallest unit of text that an LLM can process in one step. It can be a whole word, a subâword fragment, a character, or even a special symbol. The exact form depends on the tokenizer used by the model.
WordâLevel Examplesâ
In a simple wordâlevel tokenizer, each word might be a token:
| Input Text | Tokens |
|---|---|
| "Hello world" | ["Hello", " world"] |
| "AI" | ["AI"] |
| "machine learning" | ["machine", " learning"] |
Subword Examplesâ
Modern LLMs typically use subword tokenization, which splits rare or complex words into smaller pieces:
| Input Text | Tokens (BPE style) |
|---|---|
| "unbelievable" | ["un", "believ", "able"] |
| "running" | ["runn", "ing"] |
| "tokenization" | ["token", "ization"] |
| "LLM" | ["L", "L", "M"] |
CharacterâLevel Examplesâ
Some languages rarely have spaces; Chinese and Japanese often require character or subâcharacter level splits:
| Input Text | Possible Tokens |
|---|---|
| "äșșć·„æșèœ" | ["äșș", "ć·„", "æșèœ"] or ["äșșć·„æșèœ"] (depends on tokenizer) |
| "æ„æŹèȘ" | ["æ„", "æŹ", "èȘ"] or subword units |
The key takeaway: a token is not necessarily a word, a character, or a syllableâitâs whatever the modelâs vocabulary defines as a single processing unit.
Why LLMs Use Tokens Instead of Wordsâ
If words are natural to humans, why not treat each word as a single unit? There are several practical problems:
- Vocabulary explosion: English alone has hundreds of thousands of words, and new words are invented constantly (slang, technical terms, brand names). A model that uses whole words would need a vocabulary in the millions, which is computationally infeasible.
- Rare and unknown words: A wordâlevel model would have no way to handle misspellings, rare proper names, or outâofâvocabulary terms like âGPTâ4oâ or âLlama 3.1â. Subword tokenization can represent any string by falling back to characters or character nâgrams.
- Multilingual support: Chinese, Japanese, Korean, and many other languages donât use spaces. A wordâbased approach would require languageâspecific segmentation, while subword tokenizers can treat text as a stream of bytes or characters and learn multilingual patterns.
- Code and structured data: Programming languages, JSON, and markdown have a completely different âwordâ structure. Tokenizers like those used in codeâLLMs can handle indentation, camelCase splitting, and operators gracefully.
Tokenization gives models a fixedâsize vocabulary (typically 32,000â256,000 tokens) while maintaining the ability to represent any input text without unknown symbols.
How Text Becomes Tokensâ
The journey from raw text to model input is a pipeline:
- Raw Text: The user submits a prompt like âHello, world!â
- Tokenizer: The text is split into tokens according to a predefined vocabulary and merging rules.
- Token Sequence: The output is a list of token strings, e.g.,
["Hello", ",", " world", "!"]. - Token IDs: Each token is mapped to a unique integer from the vocabulary.
- Embeddings: These IDs are then converted into dense vectors by the embedding layer, which the Transformer can process.
The same tokenizer must be used consistently for both training and inference, otherwise the model receives completely mismatched input.
Tokenization vs Tokensâ
Itâs easy to confuse the two terms, but they refer to different things:
- Token: The output unitâa substring or symbol in the modelâs vocabulary. Itâs what you count, pay for, and measure context limits with.
- Tokenization: The process of splitting text into those units. Different tokenizers produce different token sequences from the same text.
A single word can become one token or several, depending on the tokenizer and its vocabulary. For instance, âunhappinessâ might be tokenized as ["un", "happiness"] or ["un", "happy", "ness"], changing the token count.
Examples of Tokenizationâ
To build intuition, letâs see how the same text can be tokenized differently based on model and language.
English Sentenceâ
Text: "The cat sat on the mat."
Common BPE tokenization might yield:
["The", " cat", " sat", " on", " the", " mat", "."] â 7 tokens.
Some tokenizers treat the leading space as part of the token (the space before âcatâ is attached to âcatâ itself), while others keep spaces separate.
Technical Textâ
Text: "Transformer architecture"
Tokens: ["Transformer", " architecture"] or ["Transform", "er", " architecture"] depending on the tokenizer. GPTâ4âs tokenizer might split âTransformerâ as ["Transform", "er"] because âTransformâ is common in code and math contexts.
Programming Codeâ
Python:
def hello():
print("Hello, world!")
Tokens could be:
["def", " hello", "(", ")", ":", "\n", " ", "print", "(", '"', "Hello", ",", " world", "!", '"', ")"]
Notice indentation (" ") and newline ("\n") are tokens! For code models, whitespace tokens matter for understanding structure.
Chinese Textâ
Text: "äșșć·„æșèœæŻæȘæ„" (meaning âAI is the futureâ)
Depending on the tokenizer, it might become:
["äșșć·„æșèœ", "æŻ", "æȘæ„"]if the tokenizer has common Chinese bigrams.["äșș", "ć·„", "æș", "èœ", "æŻ", "æȘ", "æ„"]if it falls back to characters.
The difference in token count can be dramatic, directly affecting cost and context efficiency for nonâEnglish content.
Token IDsâ
Models never see the token strings "cat" or "ă" directly. They operate on token IDsâintegers that index into the vocabulary. A typical vocabulary might map:
| Token String | Token ID |
|---|---|
| "the" | 133 |
| "cat" | 4721 |
| "äșșć·„æșèœ" | 29876 |
| "endoftext" | 50256 |
These IDs are what the embedding layer and subsequent Transformer layers consume. Special tokens like endâofâtext, beginningâofâsequence, or padding markers also have fixed IDs and are critical for structuring prompts and responses.
Because IDs are just numbers, the same model can handle any language or script that its tokenizer can encode, without linguistic knowledge. This is why models can read mixed Chinese, English, code, and emojis in a single prompt.
From Tokens to Embeddingsâ
Once text is converted to token IDs, each ID is fed into an embedding layerâa large lookup table. That table returns a dense vector of (for example) 4096 floatingâpoint numbers for each ID. This vector captures the semantic meaning of the token in context, but initially itâs just a fixed representation learned during training.
The embedding vectors are then combined with positional encodings and passed into the Transformer stack. The contextual meaning of "cat" in âcat satâ vs âcat scanâ will be refined by attention layers, but the token ID and its initial embedding are the starting point.
Deep dive: Our Embeddings article explains how these vectors enable semantic search, similarity, and crossâlingual understanding.
Different Tokenization Strategiesâ
There are three broad families of tokenization, with tradeâoffs:
| Strategy | How it works | Advantages | Disadvantages |
|---|---|---|---|
| Character | Splits into individual characters | Tiny vocabulary (e.g., 256), handles all text | Very long sequences, loses word boundaries |
| Word | Splits on spaces/punctuation | Intuitive, short sequences | Huge vocabulary, outâofâvocabulary words impossible |
| Subword | Splits into frequently occurring character nâgrams | Fixed, manageable vocabulary, handles any text | Some arbitrary splits, languageâdependent efficiency |
Modern LLMs overwhelmingly use subword tokenization. It balances vocabulary size and sequence length, and it gracefully handles rare terms by decomposing them into known fragments.
Byte Pair Encoding (BPE)â
BPE is the most common subword algorithm, used by GPT, Llama, and many others. It starts with a vocabulary of all individual characters (or bytes). Then it repeatedly merges the most frequent adjacent pair of tokens in the training corpus until the desired vocabulary size is reached.
Simplified example:
Training text: "low lower lowest"
- Start:
l o w _ l o w e r _ l o w e s t - Most frequent pair:
l oâ merge intolo - Now:
lo w _ lo w e r _ lo w e s t - Next frequent pair:
lo wâlow - Eventually:
low,er,est, and spaces become tokens.
The resulting vocabulary can represent new words like âlowestâ as ["low", "est"]. This keeps the vocabulary small while minimizing sequence length.
BPE tends to work well across languages, but can produce unintuitive splits for compound languages and heavily favors common patterns in the training data.
SentencePieceâ
SentencePiece is another subword tokenizer used by many openâsource models like Llama, T5, and several Google models. Unlike BPE that assumes preâtokenized words (with spaces), SentencePiece treats the input as a raw stream of Unicode characters, including spaces as normal characters.
This has important advantages:
- Language agnostic: No reliance on languageâspecific word boundaries. It works identically for Chinese, Finnish, or English.
- Lossless tokenization: The original text can be perfectly reconstructed from the token sequence because the tokenizer models whitespace explicitly.
- Unigram language model: SentencePiece often uses a unigramâbased approach instead of BPE, which can yield more natural splits for some languages.
SentencePiece is the reason Llama models can handle code and multilingual text without a separate preâtokenization step. Its vocabulary is trained on a raw byte stream, so even unseen Unicode symbols get a token representation.
Why Token Counts Matterâ
Everything the model does is measured in tokens. Understanding token counts is essential for:
- Processing: The modelâs forward pass must handle every token in the input. More tokens = more compute.
- Memory: The KV cache grows linearly with the number of tokens in the sequence, consuming precious GPU memory.
- Cost: API providers charge per token. Reducing token count directly reduces your bill.
- Performance: Longer sequences (more tokens) can degrade model attention quality and increase latency.
In production systems, a 10% reduction in tokens can mean 10% lower latency and costâso token efficiency is a concrete engineering goal.
Tokens and Context Windowsâ
The context window is the maximum number of tokens a model can process in one go. It includes everything: the system prompt, conversation history, retrieved documents, and the modelâs output so far.
| Model | Typical Context Window (tokens) |
|---|---|
| GPTâ4o mini | 128K |
| Claude 3.5 Sonnet | 200K |
| Gemini 1.5 Pro | 1Mâ2M |
| Llama 3.1 8B | 128K |
All the components that fill the window can be visualized like this:
If the total exceeds the window, the model will truncate the oldest tokens, potentially losing crucial instructions or facts. Thatâs why context management (chunking, summarization, sliding windows) is a core skill in AI engineering.
More context techniques: Read our Context Window article for deep dives on longâcontext architectures and mitigation strategies.
Tokens and AI Pricingâ
Commercial LLM APIs bill you per tokenâoften with different rates for input and output tokens (output usually costs more because generating text requires more computation).
Example pricing scenario (hypothetical but representative):
- Input: $0.00015 per token
- Output: $0.00060 per token
A single customer support chat that sends 2000 tokens of conversation and receives a 500âtoken answer costs:
- Input: 2000 Ă $0.00015 = $0.30
- Output: 500 Ă $0.00060 = $0.30
- Total per turn: $0.60
At scale (millions of conversations per month), token count optimization can save tens of thousands of dollars. Thatâs why techniques like prompt compression, fewâshot example pruning, and output length limits are engineering imperatives, not niceties.
Tokens and Model Performanceâ
Tokenization directly influences how well a model performs in subtle ways:
- Efficiency: The same idea expressed with fewer tokens is faster and cheaper. For example, a Chinese prompt that is tokenized into many characters may cost 3Ă more than an English equivalent.
- Accuracy: Some tokenizers break numbers arbitrarily (e.g., â1234â â
["1", "23", "4"]). This can harm arithmetic and code understanding. Modern tokenizers are now being designed with ânumberâawareâ splitting. - Multilingual support: Models with predominantly English training data may tokenize other scripts very inefficiently, leading to inflated costs and reduced accuracy.
- Compression: A tokenizer that learns frequent phrases as single tokens (e.g., âthank youâ â one token) reduces sequence length and improves coherence.
When evaluating or fineâtuning a model, itâs worth understanding the tokenizerâs language biasesâthey directly affect the endâuser experience.
Tokens in Programming and RAG Systemsâ
Tokens are not just an academic concept; they are a daily constraint when building:
- AI Chatbots: You must carefully fit the system prompt, personality instructions, recent conversation turns, and the new user message within the context windowâwhile leaving room for the modelâs reply.
- AI Agents: Agents often call tools, ingest tool outputs, and carry long conversation histories. Every tool response consumes tokens, so you need to be judicious about what you feed back into the model.
- RAG Systems: Retrieved documents are tokenized and inserted into the prompt. To maximize the useful information within the window, you need to count tokens, split documents intelligently, and rank by relevanceânot just dump everything.
- Coding Assistants: The entire open file, plus imports and relevant snippets, must be tokenized and fit into the context. Optimizing which code to include (and how) directly affects the quality of completions.
In all these cases, a token counter becomes as essential as a line profiler; you canât manage what you donât measure.
Common Token Misconceptionsâ
âOne word equals one token.ââ
False. While many common short English words are a single token, longer or rarer words almost always become multiple tokens. âInternationalizationâ is often ["Intern", "ational", "ization"].
âOne character equals one token.ââ
Not in modern subword tokenizers. English characters frequently merge into larger tokens, while some Unicode characters may span multiple tokens (byteâlevel fallback).
âToken count is always predictable.ââ
Itâs not. The same text can produce different token counts across models, even if both use BPE. Vocabulary size and training corpus influence the splits. Always use the specific modelâs tokenizer to count tokens.
âLarger context means unlimited memory.ââ
A bigger window only increases the capacity, but the model can still âforgetâ information buried in the middle or fail to use the full window effectively. Effective context utilization is a research challenge, not a solved problem.
Estimating Token Countsâ
While you should always use a precise tokenizer (like tiktoken for OpenAI models or the SentencePiece model file for Llama), here are rough heuristics:
| Content Type | Approximate Tokens |
|---|---|
| 1 English word (common) | 0.75â1.3 tokens |
| 1 Chinese character | 1.5â3 tokens |
| 1 line of Python code | 5â15 tokens |
| 1 paragraph (150 words) | 180â250 tokens |
| 1 blog article (1000 words) | 1300â1500 tokens |
| 1 documentation page | 3000â6000 tokens |
| 1 average PDF page | 500â1000 tokens (text only) |
| 1 JSON object (small) | 20â80 tokens |
A reliable rule of thumb: For English, 1 token â 0.75 words, or 750 words â 1000 tokens. For code, itâs highly variable; indentation and syntax add many tokens.
How Developers Optimize Token Usageâ
Production systems actively manage tokens to stay within limits and reduce cost:
- Prompt Compression: Use summarization models or heuristic trimming to shorten long instructions while preserving intent.
- Context Management: Implement sliding windows, conversation summarization, or periodic state resets so that only the most relevant history remains.
- Chunking: In RAG, split documents into overlapping chunks of ~500 tokens, rank them, and insert only the topâk into the prompt.
- Summarization: Ask the model itself to compress a long document before embedding, reducing token count.
- RAG Retrieval Optimization: Use better embeddings or hybrid search to retrieve fewer, more relevant chunks, minimizing wasted tokens.
- Output Limit Control: Set
max_tokensand stop sequences appropriately to avoid verbose, costly generations. - Tokenâaware Prompt Design: Experiment with instruction phrasing that yields concise answers, and prune excessive fewâshot examples.
These techniques shift token usage from a blackâbox budget to a controllable dial.
Relationship to Other LLM Conceptsâ
Tokens are the thread that ties the entire LLM stack together:
- Tokenization: Defines how text becomes tokens.
- Embeddings: Tokens IDs are mapped to vectors.
- Transformer: Processes sequences of token embeddings.
- Context Window: Measured in tokens, bounding capacity.
- Inference: Generates output tokens one by one.
- RAG: Carefully manages token budget for retrieved documents.
- Prompt Engineering: Shapes the token sequence to elicit desired behavior.
Grasping tokens means you understand the unit of currency in the LLM economyâand youâre ready to design systems that handle them with precision.
Key Takeawaysâ
- Tokens are the fundamental processing units of LLMs, not words or characters. They can be whole words, subword pieces, or individual characters.
- Tokenization is the critical first step that converts text into token IDs; the choice of tokenizer impacts multilingual performance, cost, and accuracy.
- Context windows are measured in tokensâevery prompt, history, and generated word consumes a limited budget that must be managed.
- AI pricing revolves around tokens; optimizing token count has a direct, measurable impact on your operational costs.
- Token count estimation is essential for engineering AI applications. Use modelâspecific tokenizers and build tokenâaware pipelines.
- Developer techniques like prompt compression, chunking, and context summarization keep applications within token budgets without sacrificing quality.
- Tokens connect every LLM componentâfrom embeddings and attention to inference and RAG. Mastering tokens is one of the first steps to building productionâgrade AI systems.