Foundations 3 min

Training vs Inference

Two completely different operations under one hood. Know the difference to understand cost, speed, and why models can't just 'learn' your data on the fly.

Last updated July 3, 2026

The LLM you query every day went through two completely separate phases of its life: training and inference. They happen at different times, on different hardware, at radically different scales, and with fundamentally different purposes. Confusing the two is one of the most common reasons people ask the wrong question - 'Can I just train the model on my data?' - when the right question is something else entirely.

Medical school vs the operating room

Training is medical school. The student reads tens of thousands of case studies, dissects anatomy, runs simulations, gets corrected by supervisors, and slowly builds a deep mental model of medicine over 7 years. Every mistake, every correction, every repeated pattern reshapes their neural pathways. When they graduate, that knowledge is *in them* - embedded in their brain structure. Inference is performing surgery. The doctor applies their fixed knowledge to a specific patient. They do not re-learn surgery mid-operation. They do not update their medical school degree based on what they see on the table. Their knowledge is what it is - and the quality of the operation depends on the quality of the training that came before it. You cannot teach a surgeon new techniques by whispering facts to them mid-operation. You cannot teach an LLM new knowledge by mentioning it in a prompt.

What training actually does. During training, the model processes trillions of tokens from the internet, books, and code. For each sequence, it predicts the next token, compares its prediction to the actual next token, calculates the error (loss), and uses an algorithm called backpropagation to adjust billions of numerical weights to reduce that error. This process runs for weeks, on thousands of GPUs, at a cost of tens of millions of dollars for frontier models. When it is done, the weights are frozen. The model is shipped.

When you run inference - send a prompt and get a response - those frozen weights process your tokens through the transformer layers and generate a response. The weights do not change. The model is not learning. It is performing.

Why 'just tell it your data' does not work

Showing information in a prompt puts it in the context window - temporary, session-scoped, forgotten the moment the session ends. It is not training. The model does not update its weights. Nothing is retained.

Actual training requires:
- The data formatted correctly for the training pipeline
- Gradient descent running across the entire dataset (multiple passes)
- Hours to weeks of GPU compute
- Evaluation, fine-tuning adjustments, safety alignment

This is a completely separate engineering operation from 'ask the model a question'.

The three legitimate ways to give a model new knowledge

1. Context injection - paste the information directly into the prompt. Instant, free, effective for small documents. Forgotten after the session.

2. Retrieval-Augmented Generation (RAG) - at query time, retrieve the most relevant chunks from your knowledge base and inject them. Scales to millions of documents. Always current. No retraining required. This is the right tool for knowledge.

3. Fine-tuning - retrain the model on your domain data to change its behaviour, tone, or format. Not its factual knowledge. Fine-tuning is terrible at injecting precise facts reliably - the model still hallucinates the specifics. Use fine-tuning to teach style, not substance.

Interactive: Training vs Inference Playground

Inference is the execution phase. The weights are frozen. The prompt is processed through the network to generate the response instantly, cost-effectively, and with zero changes to memory.

Query Console

Choose a template prompt or type your own. Execute a forward pass through the static model weights.

LLM Output Response:

Submit a query to generate text...

Neural Net Weights State: FROZEN

● Input Layer● Weights (Synapses)● Output Layer

Operation Cost Dashboard

Weights Update0 (Locked)

Response Time0 ms

Tokens Generated0

Cost Per Query$0.00000

Inference parameters you actually control. At inference time, the weights are fixed, but you can shape how the model outputs through sampling parameters:
- Temperature: Randomness of token selection (0 = deterministic, 1 = proportional sampling)
- Top-p (nucleus sampling): Only sample from tokens that together cover p% of the probability mass
- Max output tokens: Hard cap on response length
- Stop sequences: Tell the model to stop generating when it hits a specific string

These parameters do not change what the model knows. They shape how it expresses what it knows.

What's next

You now understand that training is a one-time event and inference is the runtime. The natural follow-up: since knowledge is frozen at training time, and the world keeps moving, how do we give models access to current, private, or domain-specific information? That is the Prompt Engineering lesson - where we start learning to shape what goes into that context window deliberately.