Foundations 3 min

What Are Tokens?

Tokens are the atomic units LLMs read and write. Learn why tokenisation matters for cost, context limits, and model behaviour.

In the last lesson we said LLMs predict the next token. But we skipped over what a token actually is. This lesson fixes that - because tokens are not words, they are not characters, and the distinction causes real, practical problems if you do not understand it.

How a Chef reads a recipe vs how a model reads text

A chef reads a recipe word by word and understands each ingredient as a whole concept. A tokeniser works more like a printing press with pre-cut letter blocks - it breaks text into the most efficient reusable chunks from a fixed vocabulary of about 50,000 pieces. Common words like 'the' or 'run' get their own block. Rare or long words get chopped into multiple smaller blocks. The word 'Transformers' might be ['Trans', 'formers']. The word 'Unbelievable' might be ['Un', 'believ', 'able']. An emoji like 🚀 might consume 3 tokens. A date like '2026-06-06' might become ['2026', '-', '06', '-', '06'] - five tokens for what you think of as one value.

Real tokenisation - exactly what the model sees
'hello world' → 2 tokens
'Unbelievable' → 3 tokens: ['Un', 'believ', 'able']
'ChatGPT' → 3 tokens: ['Chat', 'G', 'PT']
'192.168.1.1' → 7 tokens (each number and dot is separate)
100 English words → ~130–150 tokens on average

Rule of thumb: 1 token ≈ 4 characters of English text. Non-English languages are often less efficient - the same sentence in Hindi or Arabic can cost 2–3× more tokens than English.
Input Text
Characters248
Tokens136
Efficiency1.8 chars/tok
Tokenized Output
SpaceWordSubwordByte
Hey
Token:Hey
Token ID:52444
Type:subword
,
Token:,
Token ID:1044
Type:punctuation
Token:[space × 1]
Token ID:220
Type:space
I
Token:I
Token ID:50073
Type:subword
Token:[space × 1]
Token ID:220
Type:space
kno
Token:kno
Token ID:46348
Type:subword
w
Token:w
Token ID:50119
Type:subword
Token:[space × 1]
Token ID:220
Type:space
it
Token:it
Token ID:53371
Type:subword
'
Token:'
Token ID:1039
Type:punctuation
s
Token:s
Token ID:83
Type:word
Token:[space × 1]
Token ID:220
Type:space
bor
Token:bor
Token ID:57733
Type:subword
ing
Token:ing
Token ID:264
Type:subword
Token:[space × 1]
Token ID:220
Type:space
to
Token:to
Token ID:284
Type:word
Token:[space × 1]
Token ID:220
Type:space
learn
Token:learn
Token ID:12570
Type:subword
Token:[space × 1]
Token ID:220
Type:space
AI
Token:AI
Token ID:9432
Type:word
.
Token:.
Token ID:1046
Type:punctuation
Token:[space × 1]
Token ID:220
Type:space
We
Token:We
Token ID:52798
Type:subword
Token:[space × 1]
Token ID:220
Type:space
wer
Token:wer
Token ID:47604
Type:subword
e
Token:e
Token ID:50101
Type:subword
Token:[space × 1]
Token ID:220
Type:space
pro
Token:pro
Token ID:41277
Type:subword
mi
Token:mi
Token ID:53484
Type:subword
s
Token:s
Token ID:83
Type:subword
ed
Token:ed
Token ID:260
Type:subword
Token:[space × 1]
Token ID:220
Type:space
it
Token:it
Token ID:53371
Type:subword
Token:[space × 1]
Token ID:220
Type:space
wou
Token:wou
Token ID:47917
Type:subword
ld
Token:ld
Token ID:53448
Type:subword
Token:[space × 1]
Token ID:220
Type:space
mak
Token:mak
Token ID:47863
Type:subword
e
Token:e
Token ID:50101
Type:subword
Token:[space × 1]
Token ID:220
Type:space
our
Token:our
Token ID:50412
Type:subword
Token:[space × 1]
Token ID:220
Type:space
l
Token:l
Token ID:25321
Type:subword
ive
Token:ive
Token ID:30582
Type:subword
s
Token:s
Token ID:83
Type:subword
Token:[space × 1]
Token ID:220
Type:space
eas
Token:eas
Token ID:40183
Type:subword
i
Token:i
Token ID:50105
Type:subword
er
Token:er
Token ID:258
Type:subword
,
Token:,
Token ID:1044
Type:punctuation
Token:[space × 1]
Token ID:220
Type:space
but
Token:but
Token ID:57921
Type:subword
Token:[space × 1]
Token ID:220
Type:space
now
Token:now
Token ID:59270
Type:subword
Token:[space × 1]
Token ID:220
Type:space
we
Token:we
Token ID:53790
Type:subword
'
Token:'
Token ID:1039
Type:punctuation
re
Token:re
Token ID:53635
Type:subword
Token:[space × 1]
Token ID:220
Type:space
for
Token:for
Token ID:41577
Type:subword
c
Token:c
Token ID:50099
Type:subword
ed
Token:ed
Token ID:260
Type:subword
Token:[space × 1]
Token ID:220
Type:space
to
Token:to
Token ID:284
Type:word
Token:[space × 1]
Token ID:220
Type:space
stu
Token:stu
Token ID:44228
Type:subword
dy
Token:dy
Token ID:53221
Type:subword
Token:[space × 1]
Token ID:220
Type:space
how
Token:how
Token ID:53504
Type:subword
Token:[space × 1]
Token ID:220
Type:space
it
Token:it
Token ID:53371
Type:subword
Token:[space × 1]
Token ID:220
Type:space
token
Token:token
Token ID:1092
Type:subword
ize
Token:ize
Token ID:54788
Type:subword
s
Token:s
Token ID:83
Type:subword
Token:[space × 1]
Token ID:220
Type:space
wor
Token:wor
Token ID:47914
Type:subword
d
Token:d
Token ID:50100
Type:subword
s
Token:s
Token ID:83
Type:subword
.
Token:.
Token ID:1046
Type:punctuation
Token:[space × 1]
Token ID:220
Type:space
Hon
Token:Hon
Token ID:42743
Type:subword
est
Token:est
Token ID:40742
Type:subword
ly
Token:ly
Token ID:53469
Type:subword
,
Token:,
Token ID:1044
Type:punctuation
Token:[space × 1]
Token ID:220
Type:space
why
Token:why
Token ID:57704
Type:subword
Token:[space × 1]
Token ID:220
Type:space
are
Token:are
Token ID:56852
Type:subword
Token:[space × 1]
Token ID:220
Type:space
we
Token:we
Token ID:53790
Type:subword
Token:[space × 1]
Token ID:220
Type:space
learning
Token:learning
Token ID:4192
Type:word
Token:[space × 1]
Token ID:220
Type:space
thi
Token:thi
Token ID:54805
Type:subword
s
Token:s
Token ID:83
Type:subword
Token:[space × 1]
Token ID:220
Type:space
in
Token:in
Token ID:282
Type:subword
ste
Token:ste
Token ID:44212
Type:subword
ad
Token:ad
Token ID:53107
Type:subword
Token:[space × 1]
Token ID:220
Type:space
of
Token:of
Token ID:295
Type:word
Token:[space × 1]
Token ID:220
Type:space
chi
Token:chi
Token ID:48468
Type:subword
ll
Token:ll
Token ID:53456
Type:subword
ing
Token:ing
Token ID:264
Type:subword
?
Token:?
Token ID:1063
Type:punctuation
Token:[space × 1]
Token ID:220
Type:space
Esp
Token:Esp
Token ID:49986
Type:subword
eci
Token:eci
Token ID:40235
Type:subword
all
Token:all
Token ID:46673
Type:subword
y
Token:y
Token ID:50121
Type:subword
Token:[space × 1]
Token ID:220
Type:space
sin
Token:sin
Token ID:43880
Type:subword
ce
Token:ce
Token ID:53170
Type:subword
Token:[space × 1]
Token ID:220
Type:space
it
Token:it
Token ID:53371
Type:subword
'
Token:'
Token ID:1039
Type:punctuation
s
Token:s
Token ID:83
Type:word
Token:[space × 1]
Token ID:220
Type:space
go
Token:go
Token ID:25642
Type:subword
ing
Token:ing
Token ID:264
Type:subword
Token:[space × 1]
Token ID:220
Type:space
to
Token:to
Token ID:284
Type:word
Token:[space × 1]
Token ID:220
Type:space
tak
Token:tak
Token ID:44590
Type:subword
e
Token:e
Token ID:50101
Type:subword
Token:[space × 1]
Token ID:220
Type:space
our
Token:our
Token ID:50412
Type:subword
Token:[space × 1]
Token ID:220
Type:space
job
Token:job
Token ID:55405
Type:subword
s
Token:s
Token ID:83
Type:subword
Token:[space × 1]
Token ID:220
Type:space
any
Token:any
Token ID:46748
Type:subword
way
Token:way
Token ID:57487
Type:subword
!
Token:!
Token ID:1033
Type:punctuation

Why this hits your wallet. Every API call to a language model is billed per token - input tokens and output tokens separately. GPT-4o charges roughly $2.50 per million input tokens and $10 per million output tokens. If you send a 50-page PDF (about 25,000 words ≈ 33,000 tokens) to the model in every request, you are spending real money on context that may be 90% irrelevant to the question being asked. This is exactly why RAG exists - instead of dumping the whole document, you retrieve only the 3–5 most relevant chunks.

python
import tiktoken  # pip install tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

texts = [
    "hello world",
    "Retrieval-Augmented Generation",
    "2026-06-06",
    "192.168.1.1",
]

for text in texts:
    tokens = enc.encode(text)
    decoded = [enc.decode([t]) for t in tokens]
    print(f"{text!r:40} → {len(tokens)} tokens: {decoded}")

# Output:
# 'hello world'                            → 2 tokens: ['hello', ' world']
# 'Retrieval-Augmented Generation'         → 4 tokens: ['Retrieval', '-Aug', 'mented', ' Generation']
# '2026-06-06'                             → 5 tokens: ['2026', '-', '06', '-', '06']
# '192.168.1.1'                            → 7 tokens: ['192', '.', '168', '.', '1', '.', '1']

The context window is a token budget, not a word count. When a model says it has a 128K context window, that means 128,000 tokens of combined space for your system prompt, conversation history, retrieved documents, and the response. Spend 80K on an irrelevant document and you have 48K left. Understanding tokens helps you make deliberate decisions about what to include and what to leave out.

What's next
Now that you know tokens are the unit of input, the natural question is: what does the model do with all those tokens it received during training? And what is the difference between that massive training process and the quick response you get when you ask a question? That is the Training vs Inference lesson.