How Language Models Actually Work
Demystify LLMs from tokens to next-word prediction, using simple analogies you will actually remember.
Let us play a quick game. Complete this sentence:
'Once upon a...'
Your brain instantly predicted the word 'time'. You did not search a dictionary or look up the history of storytelling. Instead, you matched a pattern you have heard hundreds of times in your life. That is the fundamental concept behind Large Language Models.
Your phone keyboard's autocomplete guesses the next word based on the last few letters you typed. It is a very simple pattern matcher. Now imagine that same autocomplete, except it has read almost all public books, papers, and code repositories on the internet, and has billions of adjustable settings (parameters) to remember those connections. That is an LLM. It is not an 'oracle' that retrieves pre-written answers from a database - it is just a supercharged next-word predictor.
How does it connect words? (The Transformer)
Before 2017, language models read sentences word-by-word, like a person reading left-to-right through a tiny straw. They often forgot the beginning of a long sentence by the time they reached the end. This changed with the Transformer architecture, introduced in the famous paper Attention Is All You Need.
Transformers process the entire sentence at once, using a mechanism called Self-Attention to link related words together, no matter how far apart they are.
How do you know bank means a riverbank and not a financial institution? You instantly connected it to the word river. Self-Attention calculates mathematical weights between words so the model understands context in the exact same way.
Temperature: Setting the Creativity
When predicting the next word, the model generates a list of possibilities with probability scores. The Temperature setting controls how we choose from this list:
- Temperature = 0 (Predictable): The model always picks the single highest-scoring word. Excellent for writing code, solving math, or structured JSON output where repetition and consistency are preferred.
- Temperature = 0.7 (Balanced): The model picks proportionally from the most likely options, adding variety. Great for general assistants and essays.
- Temperature = 1.2+ (Creative): The model takes wilder guesses. Fun for brainstorming, but can quickly lead to nonsensical sentences.
Default balanced mode. Blends predictable structures with moderate word variety.
Why LLMs are Stateless
By default, a raw model is completely stateless: it does not remember your past messages, cannot browse the web, and has no idea what today's date is. Every conversation feels like its first. To build modern apps that seem to have memory or access live web search, we wrap the model in systems (like RAG or databases) that feed the necessary context into its prompt with every new request. The base model itself remains a static pattern-matcher.