The Context Window Explained
Why LLMs forget, what context limits mean in practice, and how to design around them.
You now know that models process tokens and that the context window is measured in them. But what exactly is the context window? It is the model's working memory - the total space available for everything in a single request: your instructions, the conversation history, any documents you have attached, and the response being generated. Once you exceed it, the model starts losing information. Not gracefully. Not intelligently. It just drops.
Imagine a brilliant surgeon who, at the start of every operation, is handed a clipboard with all the patient's information - allergies, medical history, the procedure plan, notes from earlier that day. The clipboard holds exactly 20 pages. If the stack of notes is 25 pages thick, the assistant has to leave 5 pages behind. Which 5? The oldest ones. The surgeon performs the operation with whatever is on the clipboard - missing information they do not know is missing. This is exactly how the context window works. The model reasons over whatever fits. It does not know what was cut. It will not ask for it. It will just silently work with incomplete information and still sound confident.
How context degrades in long conversations. Most chat applications prepend the full conversation history with every new message, so the model has context for what was said earlier. But as the conversation grows, older messages get pushed out to make room. The model might forget an instruction you gave in message 3 by the time you reach message 40. This is why long-running conversations with complex instructions sometimes go wrong - not because the model got dumber, but because it can no longer see the early context that shaped its behaviour.
Modern Context Window Sizes (2026). GPT-4o and Claude 3.5/4: 200K tokens (~150,000 words). Gemini 2.0 Pro: 2 million tokens (~1.5 million words - a full library of books). These numbers are massive, but larger windows do not eliminate the 'lost-in-the-middle' problem, and they cost significantly more per request. A well-designed retrieval system that sends 3,000 relevant tokens almost always beats dumping 100,000 tokens of vaguely related content.
Drag the slider to move a piece of key information (like a passcode or rule) through a long prompt. See how the model's focus decreases in the middle.
Information buried here is at high risk of being ignored or missed. Due to how transformer self-attention calculations scale, attention weights naturally decay in the middle of long contexts.
2. Retrieve, don't dump - send the 3–5 most relevant chunks, not the whole document.
3. Summarise long conversations - every 10–15 turns, compress earlier history into a short summary and keep that instead of the raw transcript.
4. Monitor token usage - log how many tokens each request consumes. Budget overruns are usually the cause of mysterious model behaviour.
5. Put critical context first - never bury the most important instruction in the middle of a long prompt.