Skip to main content

LLM Tokens Explained: The Building Blocks of Large Language Models

1. Introduction​

When you interact with ChatGPT, Claude, Gemini, Llama, or DeepSeek, you quickly encounter language that revolves around a simple but powerful concept: tokens. You see “input tokens” and “output tokens” on billing dashboards, “token limits” in API documentation, and “context windows” described in thousands of tokens. But what exactly is a token? Why do AI models count tokens instead of words, characters, or sentences?

Tokens are the fundamental unit of information inside a large language model. Every piece of text you send or receive is broken down into tokens before the model can process it. Understanding tokens is not just an academic exercise—it directly affects the cost, latency, capacity, and behavior of every AI‑powered application you build.

In this article, you’ll learn:

  • What tokens are, with concrete examples in English, code, and other languages.
  • Why tokens are necessary and how they differ from words.
  • How tokenization works under the hood, including popular algorithms like BPE and SentencePiece.
  • Why token count determines context window size and API pricing.
  • How to estimate tokens, optimize usage, and avoid common misconceptions.

By the end, you’ll have a solid, production‑oriented grasp of tokens that will help you design better prompts, manage costs, and architect AI systems that respect token limits.

What Is a Token?​

A token is the smallest unit of text that an LLM can process in one step. It can be a whole word, a sub‑word fragment, a character, or even a special symbol. The exact form depends on the tokenizer used by the model.

Word‑Level Examples​

In a simple word‑level tokenizer, each word might be a token:

Input TextTokens
"Hello world"["Hello", " world"]
"AI"["AI"]
"machine learning"["machine", " learning"]

Subword Examples​

Modern LLMs typically use subword tokenization, which splits rare or complex words into smaller pieces:

Input TextTokens (BPE style)
"unbelievable"["un", "believ", "able"]
"running"["runn", "ing"]
"tokenization"["token", "ization"]
"LLM"["L", "L", "M"]

Character‑Level Examples​

Some languages rarely have spaces; Chinese and Japanese often require character or sub‑character level splits:

Input TextPossible Tokens
"äșșć·„æ™ș胜"["äșș", "ć·„", "æ™ș胜"] or ["äșșć·„æ™ș胜"] (depends on tokenizer)
"æ—„æœŹèȘž"["æ—„", "æœŹ", "èȘž"] or subword units

The key takeaway: a token is not necessarily a word, a character, or a syllable—it’s whatever the model’s vocabulary defines as a single processing unit.

Why LLMs Use Tokens Instead of Words​

If words are natural to humans, why not treat each word as a single unit? There are several practical problems:

  • Vocabulary explosion: English alone has hundreds of thousands of words, and new words are invented constantly (slang, technical terms, brand names). A model that uses whole words would need a vocabulary in the millions, which is computationally infeasible.
  • Rare and unknown words: A word‑level model would have no way to handle misspellings, rare proper names, or out‑of‑vocabulary terms like “GPT‑4o” or “Llama 3.1”. Subword tokenization can represent any string by falling back to characters or character n‑grams.
  • Multilingual support: Chinese, Japanese, Korean, and many other languages don’t use spaces. A word‑based approach would require language‑specific segmentation, while subword tokenizers can treat text as a stream of bytes or characters and learn multilingual patterns.
  • Code and structured data: Programming languages, JSON, and markdown have a completely different “word” structure. Tokenizers like those used in code‑LLMs can handle indentation, camelCase splitting, and operators gracefully.

Tokenization gives models a fixed‑size vocabulary (typically 32,000–256,000 tokens) while maintaining the ability to represent any input text without unknown symbols.

How Text Becomes Tokens​

The journey from raw text to model input is a pipeline:

  1. Raw Text: The user submits a prompt like “Hello, world!”
  2. Tokenizer: The text is split into tokens according to a predefined vocabulary and merging rules.
  3. Token Sequence: The output is a list of token strings, e.g., ["Hello", ",", " world", "!"].
  4. Token IDs: Each token is mapped to a unique integer from the vocabulary.
  5. Embeddings: These IDs are then converted into dense vectors by the embedding layer, which the Transformer can process.

The same tokenizer must be used consistently for both training and inference, otherwise the model receives completely mismatched input.

Tokenization vs Tokens​

It’s easy to confuse the two terms, but they refer to different things:

  • Token: The output unit—a substring or symbol in the model’s vocabulary. It’s what you count, pay for, and measure context limits with.
  • Tokenization: The process of splitting text into those units. Different tokenizers produce different token sequences from the same text.

A single word can become one token or several, depending on the tokenizer and its vocabulary. For instance, “unhappiness” might be tokenized as ["un", "happiness"] or ["un", "happy", "ness"], changing the token count.

Examples of Tokenization​

To build intuition, let’s see how the same text can be tokenized differently based on model and language.

English Sentence​

Text: "The cat sat on the mat."

Common BPE tokenization might yield: ["The", " cat", " sat", " on", " the", " mat", "."] → 7 tokens.

Some tokenizers treat the leading space as part of the token (the space before “cat” is attached to “cat” itself), while others keep spaces separate.

Technical Text​

Text: "Transformer architecture"

Tokens: ["Transformer", " architecture"] or ["Transform", "er", " architecture"] depending on the tokenizer. GPT‑4’s tokenizer might split “Transformer” as ["Transform", "er"] because “Transform” is common in code and math contexts.

Programming Code​

Python:

def hello():
print("Hello, world!")

Tokens could be: ["def", " hello", "(", ")", ":", "\n", " ", "print", "(", '"', "Hello", ",", " world", "!", '"', ")"]

Notice indentation (" ") and newline ("\n") are tokens! For code models, whitespace tokens matter for understanding structure.

Chinese Text​

Text: "äșșć·„æ™șèƒœæ˜ŻæœȘ杄" (meaning “AI is the future”)

Depending on the tokenizer, it might become:

  • ["äșșć·„æ™ș胜", "æ˜Ż", "æœȘ杄"] if the tokenizer has common Chinese bigrams.
  • ["äșș", "ć·„", "æ™ș", "胜", "æ˜Ż", "æœȘ", "杄"] if it falls back to characters.

The difference in token count can be dramatic, directly affecting cost and context efficiency for non‑English content.

Token IDs​

Models never see the token strings "cat" or "。" directly. They operate on token IDs—integers that index into the vocabulary. A typical vocabulary might map:

Token StringToken ID
"the"133
"cat"4721
"äșșć·„æ™ș胜"29876
"endoftext"50256

These IDs are what the embedding layer and subsequent Transformer layers consume. Special tokens like end‑of‑text, beginning‑of‑sequence, or padding markers also have fixed IDs and are critical for structuring prompts and responses.

Because IDs are just numbers, the same model can handle any language or script that its tokenizer can encode, without linguistic knowledge. This is why models can read mixed Chinese, English, code, and emojis in a single prompt.

From Tokens to Embeddings​

Once text is converted to token IDs, each ID is fed into an embedding layer—a large lookup table. That table returns a dense vector of (for example) 4096 floating‑point numbers for each ID. This vector captures the semantic meaning of the token in context, but initially it’s just a fixed representation learned during training.

The embedding vectors are then combined with positional encodings and passed into the Transformer stack. The contextual meaning of "cat" in “cat sat” vs “cat scan” will be refined by attention layers, but the token ID and its initial embedding are the starting point.

Deep dive: Our Embeddings article explains how these vectors enable semantic search, similarity, and cross‑lingual understanding.

Different Tokenization Strategies​

There are three broad families of tokenization, with trade‑offs:

StrategyHow it worksAdvantagesDisadvantages
CharacterSplits into individual charactersTiny vocabulary (e.g., 256), handles all textVery long sequences, loses word boundaries
WordSplits on spaces/punctuationIntuitive, short sequencesHuge vocabulary, out‑of‑vocabulary words impossible
SubwordSplits into frequently occurring character n‑gramsFixed, manageable vocabulary, handles any textSome arbitrary splits, language‑dependent efficiency

Modern LLMs overwhelmingly use subword tokenization. It balances vocabulary size and sequence length, and it gracefully handles rare terms by decomposing them into known fragments.

Byte Pair Encoding (BPE)​

BPE is the most common subword algorithm, used by GPT, Llama, and many others. It starts with a vocabulary of all individual characters (or bytes). Then it repeatedly merges the most frequent adjacent pair of tokens in the training corpus until the desired vocabulary size is reached.

Simplified example:

Training text: "low lower lowest"

  • Start: l o w _ l o w e r _ l o w e s t
  • Most frequent pair: l o → merge into lo
  • Now: lo w _ lo w e r _ lo w e s t
  • Next frequent pair: lo w → low
  • Eventually: low, er, est, and spaces become tokens.

The resulting vocabulary can represent new words like “lowest” as ["low", "est"]. This keeps the vocabulary small while minimizing sequence length.

BPE tends to work well across languages, but can produce unintuitive splits for compound languages and heavily favors common patterns in the training data.

SentencePiece​

SentencePiece is another subword tokenizer used by many open‑source models like Llama, T5, and several Google models. Unlike BPE that assumes pre‑tokenized words (with spaces), SentencePiece treats the input as a raw stream of Unicode characters, including spaces as normal characters.

This has important advantages:

  • Language agnostic: No reliance on language‑specific word boundaries. It works identically for Chinese, Finnish, or English.
  • Lossless tokenization: The original text can be perfectly reconstructed from the token sequence because the tokenizer models whitespace explicitly.
  • Unigram language model: SentencePiece often uses a unigram‑based approach instead of BPE, which can yield more natural splits for some languages.

SentencePiece is the reason Llama models can handle code and multilingual text without a separate pre‑tokenization step. Its vocabulary is trained on a raw byte stream, so even unseen Unicode symbols get a token representation.

Why Token Counts Matter​

Everything the model does is measured in tokens. Understanding token counts is essential for:

  • Processing: The model’s forward pass must handle every token in the input. More tokens = more compute.
  • Memory: The KV cache grows linearly with the number of tokens in the sequence, consuming precious GPU memory.
  • Cost: API providers charge per token. Reducing token count directly reduces your bill.
  • Performance: Longer sequences (more tokens) can degrade model attention quality and increase latency.

In production systems, a 10% reduction in tokens can mean 10% lower latency and cost—so token efficiency is a concrete engineering goal.

Tokens and Context Windows​

The context window is the maximum number of tokens a model can process in one go. It includes everything: the system prompt, conversation history, retrieved documents, and the model’s output so far.

ModelTypical Context Window (tokens)
GPT‑4o mini128K
Claude 3.5 Sonnet200K
Gemini 1.5 Pro1M–2M
Llama 3.1 8B128K

All the components that fill the window can be visualized like this:

If the total exceeds the window, the model will truncate the oldest tokens, potentially losing crucial instructions or facts. That’s why context management (chunking, summarization, sliding windows) is a core skill in AI engineering.

More context techniques: Read our Context Window article for deep dives on long‑context architectures and mitigation strategies.

Tokens and AI Pricing​

Commercial LLM APIs bill you per token—often with different rates for input and output tokens (output usually costs more because generating text requires more computation).

Example pricing scenario (hypothetical but representative):

  • Input: $0.00015 per token
  • Output: $0.00060 per token

A single customer support chat that sends 2000 tokens of conversation and receives a 500‑token answer costs:

  • Input: 2000 × $0.00015 = $0.30
  • Output: 500 × $0.00060 = $0.30
  • Total per turn: $0.60

At scale (millions of conversations per month), token count optimization can save tens of thousands of dollars. That’s why techniques like prompt compression, few‑shot example pruning, and output length limits are engineering imperatives, not niceties.

Tokens and Model Performance​

Tokenization directly influences how well a model performs in subtle ways:

  • Efficiency: The same idea expressed with fewer tokens is faster and cheaper. For example, a Chinese prompt that is tokenized into many characters may cost 3× more than an English equivalent.
  • Accuracy: Some tokenizers break numbers arbitrarily (e.g., “1234” → ["1", "23", "4"]). This can harm arithmetic and code understanding. Modern tokenizers are now being designed with “number‑aware” splitting.
  • Multilingual support: Models with predominantly English training data may tokenize other scripts very inefficiently, leading to inflated costs and reduced accuracy.
  • Compression: A tokenizer that learns frequent phrases as single tokens (e.g., “thank you” → one token) reduces sequence length and improves coherence.

When evaluating or fine‑tuning a model, it’s worth understanding the tokenizer’s language biases—they directly affect the end‑user experience.

Tokens in Programming and RAG Systems​

Tokens are not just an academic concept; they are a daily constraint when building:

  • AI Chatbots: You must carefully fit the system prompt, personality instructions, recent conversation turns, and the new user message within the context window—while leaving room for the model’s reply.
  • AI Agents: Agents often call tools, ingest tool outputs, and carry long conversation histories. Every tool response consumes tokens, so you need to be judicious about what you feed back into the model.
  • RAG Systems: Retrieved documents are tokenized and inserted into the prompt. To maximize the useful information within the window, you need to count tokens, split documents intelligently, and rank by relevance—not just dump everything.
  • Coding Assistants: The entire open file, plus imports and relevant snippets, must be tokenized and fit into the context. Optimizing which code to include (and how) directly affects the quality of completions.

In all these cases, a token counter becomes as essential as a line profiler; you can’t manage what you don’t measure.

Common Token Misconceptions​

“One word equals one token.”​

False. While many common short English words are a single token, longer or rarer words almost always become multiple tokens. “Internationalization” is often ["Intern", "ational", "ization"].

“One character equals one token.”​

Not in modern subword tokenizers. English characters frequently merge into larger tokens, while some Unicode characters may span multiple tokens (byte‑level fallback).

“Token count is always predictable.”​

It’s not. The same text can produce different token counts across models, even if both use BPE. Vocabulary size and training corpus influence the splits. Always use the specific model’s tokenizer to count tokens.

“Larger context means unlimited memory.”​

A bigger window only increases the capacity, but the model can still “forget” information buried in the middle or fail to use the full window effectively. Effective context utilization is a research challenge, not a solved problem.

Estimating Token Counts​

While you should always use a precise tokenizer (like tiktoken for OpenAI models or the SentencePiece model file for Llama), here are rough heuristics:

Content TypeApproximate Tokens
1 English word (common)0.75–1.3 tokens
1 Chinese character1.5–3 tokens
1 line of Python code5–15 tokens
1 paragraph (150 words)180–250 tokens
1 blog article (1000 words)1300–1500 tokens
1 documentation page3000–6000 tokens
1 average PDF page500–1000 tokens (text only)
1 JSON object (small)20–80 tokens

A reliable rule of thumb: For English, 1 token ≈ 0.75 words, or 750 words ≈ 1000 tokens. For code, it’s highly variable; indentation and syntax add many tokens.

How Developers Optimize Token Usage​

Production systems actively manage tokens to stay within limits and reduce cost:

  • Prompt Compression: Use summarization models or heuristic trimming to shorten long instructions while preserving intent.
  • Context Management: Implement sliding windows, conversation summarization, or periodic state resets so that only the most relevant history remains.
  • Chunking: In RAG, split documents into overlapping chunks of ~500 tokens, rank them, and insert only the top‑k into the prompt.
  • Summarization: Ask the model itself to compress a long document before embedding, reducing token count.
  • RAG Retrieval Optimization: Use better embeddings or hybrid search to retrieve fewer, more relevant chunks, minimizing wasted tokens.
  • Output Limit Control: Set max_tokens and stop sequences appropriately to avoid verbose, costly generations.
  • Token‑aware Prompt Design: Experiment with instruction phrasing that yields concise answers, and prune excessive few‑shot examples.

These techniques shift token usage from a black‑box budget to a controllable dial.

Relationship to Other LLM Concepts​

Tokens are the thread that ties the entire LLM stack together:

  • Tokenization: Defines how text becomes tokens.
  • Embeddings: Tokens IDs are mapped to vectors.
  • Transformer: Processes sequences of token embeddings.
  • Context Window: Measured in tokens, bounding capacity.
  • Inference: Generates output tokens one by one.
  • RAG: Carefully manages token budget for retrieved documents.
  • Prompt Engineering: Shapes the token sequence to elicit desired behavior.

Grasping tokens means you understand the unit of currency in the LLM economy—and you’re ready to design systems that handle them with precision.

Key Takeaways​

  • Tokens are the fundamental processing units of LLMs, not words or characters. They can be whole words, subword pieces, or individual characters.
  • Tokenization is the critical first step that converts text into token IDs; the choice of tokenizer impacts multilingual performance, cost, and accuracy.
  • Context windows are measured in tokens—every prompt, history, and generated word consumes a limited budget that must be managed.
  • AI pricing revolves around tokens; optimizing token count has a direct, measurable impact on your operational costs.
  • Token count estimation is essential for engineering AI applications. Use model‑specific tokenizers and build token‑aware pipelines.
  • Developer techniques like prompt compression, chunking, and context summarization keep applications within token budgets without sacrificing quality.
  • Tokens connect every LLM component—from embeddings and attention to inference and RAG. Mastering tokens is one of the first steps to building production‑grade AI systems.