How Transformers Work
The architecture behind every modern LLM. Understand Self-Attention, multi-head attention, and why Transformers changed everything.
Every major language model (GPT-4, Gemini, Claude, LLaMA) is built on one architecture: the Transformer. Introduced in 2017 in the paper Attention Is All You Need, it replaced the old approach of reading text one word at a time with a radically different idea: process every word in parallel, and let the model decide which words matter most to each other. This change is why LLMs can handle context windows of millions of tokens today.
Older architectures (RNNs) worked like a telephone game: each person whispers to the next, and information degrades as it travels down the chain. By the time you reach person 20, the original message is garbled. Transformers work like a dinner table conversation: everyone hears everyone at once. Person 1 can hear person 20 directly, no degradation, no lag. This is exactly how Transformers process tokens: every token in the sequence attends to every other token simultaneously, no matter how far apart they are.
Self-Attention: The Core Mechanism. For each token in the input, Self-Attention asks: which other tokens in this sequence should I pay attention to, and how much?
Each token is transformed into three vectors:
- Query (Q): 'What am I looking for?'
- Key (K): 'What do I contain?'
- Value (V): 'What information do I provide?'
The attention score between two tokens is the dot product of Q and K. High score = strong relationship. These scores are softmax-normalised into weights that sum to 1, then multiplied by V to produce a weighted sum, the attended representation.
How do you know 'bank' means a riverbank and not a financial institution? You instantly connected it to 'river'. Self-Attention does this mathematically: when processing the word 'bank', its Query vector fires strongly against the Key vector of 'river', giving it a high attention weight. The resulting Value computation pulls in the river-context, resolving the ambiguity.
This happens for every pair of tokens, simultaneously, in parallel.
Multi-Head Attention: Multiple Perspectives. A single attention head captures one type of relationship. Real language has many: grammatical subject-verb agreement, coreference (linking 'it' to what it refers to), semantic meaning, positional proximity.
Transformers run several attention heads in parallel, typically 8 to 32, each learning to capture a different kind of relationship. Their outputs are concatenated and projected into the final representation. This is why Transformers are so powerful: they simultaneously capture multiple relationship types that older models could only process sequentially.
Positional Encoding is added to give the model a sense of word order (since attention itself has no inherent sense of position).
Click any word to see which other words it attends to. Thicker connections = stronger attention weight. Switch heads to see different relationship types.
In RAG, when you inject retrieved chunks into the prompt, the Transformer's self-attention is what allows the model to connect information from chunk 3 back to the user's question in chunk 1. The quality of that connection depends directly on how relevant the retrieved chunks are.