What Are Tokens?
Tokens are the atomic units LLMs read and write. Learn why tokenisation matters for cost, context limits, and model behaviour.
In the last lesson we said LLMs predict the next token. But we skipped over what a token actually is. This lesson fixes that - because tokens are not words, they are not characters, and the distinction causes real, practical problems if you do not understand it.
A chef reads a recipe word by word and understands each ingredient as a whole concept. A tokeniser works more like a printing press with pre-cut letter blocks - it breaks text into the most efficient reusable chunks from a fixed vocabulary of about 50,000 pieces. Common words like 'the' or 'run' get their own block. Rare or long words get chopped into multiple smaller blocks. The word 'Transformers' might be ['Trans', 'formers']. The word 'Unbelievable' might be ['Un', 'believ', 'able']. An emoji like 🚀 might consume 3 tokens. A date like '2026-06-06' might become ['2026', '-', '06', '-', '06'] - five tokens for what you think of as one value.
'Unbelievable' → 3 tokens: ['Un', 'believ', 'able']
'ChatGPT' → 3 tokens: ['Chat', 'G', 'PT']
'192.168.1.1' → 7 tokens (each number and dot is separate)
100 English words → ~130–150 tokens on average
Rule of thumb: 1 token ≈ 4 characters of English text. Non-English languages are often less efficient - the same sentence in Hindi or Arabic can cost 2–3× more tokens than English.
Why this hits your wallet. Every API call to a language model is billed per token - input tokens and output tokens separately. GPT-4o charges roughly $2.50 per million input tokens and $10 per million output tokens. If you send a 50-page PDF (about 25,000 words ≈ 33,000 tokens) to the model in every request, you are spending real money on context that may be 90% irrelevant to the question being asked. This is exactly why RAG exists - instead of dumping the whole document, you retrieve only the 3–5 most relevant chunks.
import tiktoken # pip install tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
texts = [
"hello world",
"Retrieval-Augmented Generation",
"2026-06-06",
"192.168.1.1",
]
for text in texts:
tokens = enc.encode(text)
decoded = [enc.decode([t]) for t in tokens]
print(f"{text!r:40} → {len(tokens)} tokens: {decoded}")
# Output:
# 'hello world' → 2 tokens: ['hello', ' world']
# 'Retrieval-Augmented Generation' → 4 tokens: ['Retrieval', '-Aug', 'mented', ' Generation']
# '2026-06-06' → 5 tokens: ['2026', '-', '06', '-', '06']
# '192.168.1.1' → 7 tokens: ['192', '.', '168', '.', '1', '.', '1']The context window is a token budget, not a word count. When a model says it has a 128K context window, that means 128,000 tokens of combined space for your system prompt, conversation history, retrieved documents, and the response. Spend 80K on an irrelevant document and you have 48K left. Understanding tokens helps you make deliberate decisions about what to include and what to leave out.