Explore how the vocabulary-based tokenization process in LLMs and how it impacts prompt design, context length, and overall system performance.
If you’ve worked with large language models (LLMs), you’ve probably asked:
“Why am I paying for tokens and not words?”
At first, it sounds odd. But once you understand how these models work, it becomes clear that tokens aren’t just a billing unit. Tokens are the fundamental building blocks LLMs actually use to read, understand, and generate text.
Understanding tokenization is essential for optimizing prompts, managing costs, and avoiding truncation issues in your LLM applications.
Show
- Tokens, not words, are the currency of LLMs — text must be converted into numeric token IDs before an LLM can process it, making tokenization the foundational step in all LLM workflows.
- Subword tokenization (BPE) strikes the balance — full-word vocabularies would require 500K+ entries; character-level models lose semantic context; BPE with 32K–100K tokens efficiently handles multilingual, technical, and messy real-world text.
- Vocabulary size directly affects token count and cost — GPT-4’s 100K vocabulary encodes more per token than Llama 2’s 32K vocabulary, meaning the same prompt can cost meaningfully different amounts across models.
- Token count ≠ word count — technical jargon, code snippets, and non-Latin scripts generate significantly more tokens per word, making rare or complex language disproportionately expensive in API billing.
- Token management affects more than cost — exceeding context window limits causes truncation in RAG pipelines and multi-turn chats, and more tokens mean larger embedding vectors, higher memory usage, and slower inference.
Why LLMs Use Tokens Instead of Words
Tokens exist because machines don’t understand words the way humans do. LLMs don’t process text like we read it they only work with numbers. So before text enters the model, it must be converted into a form the model can handle: tokens.
Full words are messy for machines. Tokens solve three major problems:
1. Languages are different
- English uses spaces, Chinese doesn’t, and German often glues words together.
- Words can’t be consistently separated across all languages.
2. Word vocabulary explodes
- Variations like run, runs, ran, running, runner are countless.
- Treating each word as a separate unit would require a massive vocabulary. Tokens let the model reuse subword pieces.
3. Real-world text is messy
- Usernames, URLs, typos, and code identifiers aren’t standard words but still need to be processed.
Example
unhappiness = un + happy + ness (3 tokens) xK93LmQp_auth = x + K93 + Lm + Qp + _ + auth (6 tokens)
Key reason LLMs use subword tokens instead of full words:
- Full-word vocabularies would require 500K+ entries to cover English alone
- Character-level models lose semantic context and create extremely long sequences
- Subword tokenization (BPE) balances coverage and efficiency with 32K–100K tokens
What Exactly is a Token?
Tokens are the atomic units LLMs understand. They can be:
- Whole words: “cat”, “house”
- Subwords: “running” = “run” + “ning”
- Punctuation or spaces: “!” or ” “
| Text Unit | Example | Approximate Tokens |
|---|---|---|
| Common short word | “the”, “is”, “run” | 1 token |
| Average English word | “important”, “system” | 1–2 tokens |
| Technical/rare word | “tokenization”, “embeddings” | 2–3 tokens |
| Code snippet | def calculate(): | ~4–6 tokens |
| Non-Latin script (Chinese/Japanese) | 3-character phrase | 3–6 tokens |
Why split like this? LLMs have a fixed-size vocabulary, usually around 50k–100k tokens. Each token has an embedding in the model. If the model tried to treat every word separately, the vocabulary would become unmanageably large.
How Every LLM Vocabulary Works
Every LLM comes with a vocabulary file — essentially a lookup table mapping strings to token IDs. When text is input:
- The tokenizer scans the text
- Finds the longest matches in the vocabulary
- Maps them to token IDs
Without a vocabulary, the model wouldn’t know how to represent text internally, and token counting would be impossible.
| Model | Tokenizer | Vocabulary Size | Notes |
|---|---|---|---|
| GPT-4 / GPT-4o | cl100k_base (tiktoken) | ~100,000 | Larger vocab = fewer tokens per prompt |
| GPT-3.5 | cl100k_base | ~100,000 | Same tokenizer as GPT-4 |
| Claude 3 / Claude 3.5 / Claude 4 | Anthropic (BPE-based) | ~100,000+ | Exact size undisclosed; context window up to 200K tokens |
| Llama 2 | SentencePiece | 32,000 | Smaller vocab; more tokens per equivalent text |
| Llama 3 / Llama 3.1 | tiktoken-based | 128,000 | Significantly larger than Llama 2 |
| Gemini 1.5 / 2.0 | SentencePiece | ~256,000 | Optimized for multilingual coverage |
| Mistral 7B | SentencePiece | 32,000 | Efficient for low-resource deployment |
Did you know?
The same text can produce different token counts depending on which model you use. GPT-4’s 100K vocabulary encodes more information per token than Llama 2’s 32K vocabulary. Always use the model-specific tokenizer when counting tokens for cost estimation.
The Role of Vocabulary in Token Counting
1. Vocabulary is the Token Dictionary
Every LLM comes with a vocabulary file essentially a lookup table mapping strings to token IDs. When text is input:
- The tokenizer scans the text.
- Finds the longest matches in the vocabulary.
- Maps them to token IDs.
Example:
Text:"I love programming!" Tokens: ["I"," love"," program","ming","!"]
Without a vocabulary, the model wouldn’t know how to represent text internally, and token counting would be impossible.
2. Vocabulary Determines Token Count
Token count doesn’t equal word count. It depends on how the tokenizer matches text to its vocabulary:
- Common words – usually 1 token
- Rare words – split into multiple tokens
- Spaces/punctuation – separate tokens
Example:
Text:"I love AI!" = ["I"," love"," AI","!"] - 4 tokens Text:"Antidisestablishmentarianism" = ["Anti","dis","establish","ment","arian","ism"] - 6 tokens
The vocabulary design affects tokenization:
- More granular vocabulary – fewer tokens per rare word, larger embedding tables
- Coarser vocabulary – more tokens per rare word, smaller embedding tables
Why Token Counting Matters
1. Cost Control
Most LLM APIs (OpenAI, Azure, Anthropic) charge per token, not per word. That means:
- A short prompt with rare words can cost more than a longer prompt with common words.
- Miscounting tokens can blow up your bill, especially in batch processing or embeddings.
Example:
PromptA:"I love AI" - 4 tokens Prompt B:"Antidisestablishmentarianism rocks" - 7 tokens
Even though Prompt B looks similar in length, the rare word splits into multiple tokens almost doubling the cost.
2. Prompt Length Management
Every LLM has a maximum token limit (context window). If your prompt + completion exceeds this:
- Input may be truncated
- Outputs can be cut off unexpectedly
Accurate token counting helps you:
- Split long text into chunks to fit the context window
- Design prompts to ensure the model sees all necessary information
This is especially critical for RAG pipelines, summarization, and multi-turn chat applications.
3. Embedding Size & Memory
Each token corresponds to an embedding vector in the model. That means:
- More tokens – more vectors – higher memory usage
- Large token sequences can slow inference or even cause out-of-memory errors
Example: Generating embeddings for a 10k-token document vs. a 2k-token document can be 5x heavier on memory, even if both have similar word counts.
Efficient token management ensures:
- Faster inference
- Lower memory footprint
- Predictable performance
How Token Count Affects API Costs
Token count doesn’t equal word count. It depends on how the tokenizer matches text to its vocabulary.
| Scenario | Word Count | Token Count | Estimated Cost (GPT-4o, $2.50/M input) |
|---|---|---|---|
| Simple English prompt | 100 words | ~133 tokens | ~$0.0003 |
| Technical prompt with jargon | 100 words | ~150–200 tokens | ~$0.0004–0.0005 |
| Code-heavy prompt | 100 words | ~200–300 tokens | ~$0.0005–0.0008 |
| Chinese/Japanese text | 100 characters | ~150–300 tokens | ~$0.0004–0.0008 |
Conclusion
Tokens aren’t just a billing detail they directly impact cost, performance, and context management. For developers building LLM applications, thinking in tokens instead of words is essential. Understanding your model’s vocabulary and tokenization behavior allows you to:
- Optimize prompt length
- Avoid unexpected truncation
- Control costs effectively
- Ensure predictable memory usage
By mastering tokens, you can build efficient, reliable, and cost-effective LLM-powered applications.
Struggling with LLM Costs or Token Limits? Let’s Fix It
Book a free 45-minute consultation with Intuz’s AI architects to review your token usage, prompt design, and LLM pipeline. We’ll help you reduce costs, prevent truncation issues, and design scalable, production-ready AI systems.
FAQs
What role does vocabulary size play in token counting?
LLMs use fixed vocabularies (30k-100k tokens) built via Byte-Pair Encoding (BPE), where frequent byte pairs merge into tokens during training. Larger vocabularies reduce subword splits for common words, lowering token counts (e.g., GPT-4’s 100k vocab packs more info per token), but increase embedding matrix size and compute. Optimal size balances coverage and efficiency.
How does BPE tokenization convert text to tokens?
BPE starts with characters, iteratively merging the most frequent pairs from training corpora until reaching vocabulary size. Out-of-vocabulary words split into known subwords, enabling counting via greedy longest-match lookup. This handles morphology efficiently, as shown in Sennrich et al. (2016), outperforming word-level tokenization.
Why do LLMs struggle with exact letter counting?
Tokenization aggregates subwords into vocabulary units, discarding positional letter info. Post-token, transformers process embeddings without raw strings, leading to failures in reversal tasks. Research shows mid-layer FFNs attempt detokenization via prefix aggregation, peaking at layer 7 before degradation.
What is the difference between WordPiece and BPE for token counts?
WordPiece (BERT) maximizes likelihood by merging high-probability symbol pairs; BPE (GPT/Claude) greedily merges raw frequencies. BPE yields compact counts for morphologically rich languages; WordPiece better handles rare terms. Both achieve 75% word coverage at 30k vocab, but BPE dominates modern LLMs for OOV robustness.
How do LLMs internally reconstruct words from subword tokens?
Models aggregate prefix token representations into the final token’s hidden state, then FFN layers retrieve full-word concepts. Retrieval peaks mid-layers, indicating emergent de-tokenization beyond explicit tokenize.
How can my team reduce LLM token costs in a production AI application?
The fastest wins are prompt caching, retrieval optimization, and model tiering. Prompt caching (available via Anthropic and OpenAI) eliminates repeated processing of static system prompts — reducing input costs by up to 90% on cached portions. In RAG pipelines, reducing top-K retrieval from 10 chunks to 3–5 chunks typically cuts per-query token usage by 50–70% with minimal accuracy loss. Model tiering routes simpler queries to cheaper models (GPT-4o Mini, Claude Haiku) and reserves frontier models for complex reasoning. Structured data compression formats like TOON can reduce token usage by 40–50% for structured reference data in prompts. Together, these strategies can reduce total token costs by 60–80% in high-volume enterprise deployments.
Does Intuz help businesses optimize LLM token usage and reduce AI infrastructure costs?
Yes. Intuz works with B2B teams to audit and redesign LLM pipelines — including tokenization strategy, RAG architecture, prompt compression, and model selection. If your AI system is consuming more tokens than expected, or if you’re scaling an LLM application and want to control costs before they compound, our team can review your highest-cost workflows and identify immediate optimization opportunities. Most enterprise teams see 40–60% token cost reduction after a structured pipeline review. Start with a consultation to benchmark your current token consumption against best-practice architecture patterns.