Foundations 4 min

How Transformers Work

The architecture behind every modern LLM. Understand Self-Attention, multi-head attention, and why Transformers changed everything.

Last updated July 3, 2026

Every major language model (GPT-4, Gemini, Claude, LLaMA) is built on one architecture: the Transformer. Introduced in 2017 in the paper Attention Is All You Need, it replaced the old approach of reading text one word at a time with a radically different idea: process every word in parallel, and let the model decide which words matter most to each other. This change is why LLMs can handle context windows of millions of tokens today.

The dinner table vs the telephone game

Older architectures (RNNs) worked like a telephone game: each person whispers to the next, and information degrades as it travels down the chain. By the time you reach person 20, the original message is garbled. Transformers work like a dinner table conversation: everyone hears everyone at once. Person 1 can hear person 20 directly, no degradation, no lag. This is exactly how Transformers process tokens: every token in the sequence attends to every other token simultaneously, no matter how far apart they are.

Self-Attention: The Core Mechanism. For each token in the input, Self-Attention asks: which other tokens in this sequence should I pay attention to, and how much?

Each token is transformed into three vectors:
- Query (Q): 'What am I looking for?'
- Key (K): 'What do I contain?'
- Value (V): 'What information do I provide?'

The attention score between two tokens is the dot product of Q and K. High score = strong relationship. These scores are softmax-normalised into weights that sum to 1, then multiplied by V to produce a weighted sum, the attended representation.

The 'bank' example: why context matters

Consider: 'The bank by the river flooded.'

How do you know 'bank' means a riverbank and not a financial institution? You instantly connected it to 'river'. Self-Attention does this mathematically: when processing the word 'bank', its Query vector fires strongly against the Key vector of 'river', giving it a high attention weight. The resulting Value computation pulls in the river-context, resolving the ambiguity.

This happens for every pair of tokens, simultaneously, in parallel.

Multi-Head Attention: Multiple Perspectives. A single attention head captures one type of relationship. Real language has many: grammatical subject-verb agreement, coreference (linking 'it' to what it refers to), semantic meaning, positional proximity.

Transformers run several attention heads in parallel, typically 8 to 32, each learning to capture a different kind of relationship. Their outputs are concatenated and projected into the final representation. This is why Transformers are so powerful: they simultaneously capture multiple relationship types that older models could only process sequentially.

Positional Encoding is added to give the model a sense of word order (since attention itself has no inherent sense of position).

Attention Visualizer

Interactive

Click any word to see which other words it attends to. Thicker connections = stronger attention weight. Switch heads to see different relationship types.

👆Click a word above to see its attention pattern across the sentence.

Why this matters for context windows and RAG

The context window is the attention window: the Transformer literally runs self-attention across all tokens in the input. Larger context windows mean more token pairs to attend over, which is why longer contexts are exponentially more expensive to compute.

In RAG, when you inject retrieved chunks into the prompt, the Transformer's self-attention is what allows the model to connect information from chunk 3 back to the user's question in chunk 1. The quality of that connection depends directly on how relevant the retrieved chunks are.

What's next

Now that you understand how the Transformer processes tokens, the natural question is: what exactly are tokens? Not words, not characters, something in between. The What Are Tokens? lesson covers this in depth, including why tokenisation directly affects cost and model behaviour.