In-Context Learning
In-context learning is the ability of a language model to adapt its behavior from information supplied in the prompt, such as examples, instructions, labels, formats, demonstrations, or task descriptions, without updating the model's weights.
Definition
In-context learning, often abbreviated ICL, is a runtime adaptation phenomenon in large language models. A user or system places a task description, examples, demonstrations, constraints, or relevant documents into the model's context window. The model then conditions on that material and produces outputs that follow the local pattern.
The key distinction is that no gradient update or fine-tuning step occurs. The model's parameters remain fixed. What changes is the active context: the temporary evidence and instruction surface available during inference.
ICL includes zero-shot instruction following, one-shot prompting, few-shot prompting, format imitation, task induction from examples, and many prompt-engineering patterns. Chain-of-thought prompting is a specialized descendant: it uses worked reasoning examples or stepwise instructions to elicit intermediate reasoning behavior.
Lineage
The modern term became central after OpenAI's 2020 GPT-3 paper, Language Models are Few-Shot Learners. Brown and colleagues argued that larger language models could perform many tasks from natural-language instructions and a small number of examples placed in the prompt, without task-specific fine-tuning.
The result changed the public understanding of language models. Earlier NLP systems usually needed supervised training, fine-tuning, or task-specific heads. GPT-3 made the prompt itself look like a programmable interface: a short local dataset could steer behavior inside a single request.
Subsequent research complicated the simple story. Studies found that example labels, order, formatting, task priors, semantic similarity, model scale, and pretraining distribution all affect performance. Some experiments showed that even randomly assigned labels could preserve part of the ICL benefit, suggesting that demonstrations teach more than direct input-output mappings.
By the mid-2020s, in-context learning was no longer only a research curiosity. It had become the everyday operating principle behind prompt engineering, retrieval-augmented generation, tool instructions, long-context workflows, agent memory, and enterprise AI customization.
How It Works
There is no single agreed mechanism for all in-context learning. The phenomenon appears to combine several behaviors.
Pattern completion. The model continues the sequence in the style, format, label space, or reasoning pattern established by the prompt.
Task recognition. The model infers the latent task from examples or instructions, then retrieves relevant behavior learned during pretraining.
Bayesian-style inference. Some theoretical work models ICL as inference over a hidden task or concept, where examples in the prompt update the model's estimate of what task is being performed.
Implicit algorithm execution. Transformer models can learn simple functions, classifiers, and regression patterns in context, suggesting that trained attention mechanisms can implement forms of temporary computation over demonstrations.
Contextual calibration. The prompt can change which answers are likely, what labels mean, how confident the model should be, and which output format counts as valid.
These mechanisms are not mutually exclusive. A deployed system may show all of them at once: imitation of examples, recognition of a familiar task, retrieval of memorized knowledge, and computation over fresh context.
Uses
Few-shot classification. A prompt gives several labeled examples, then asks the model to label a new input using the same categories.
Format control. Examples teach the model to output JSON, tables, citations, code style, emails, bug reports, legal clauses, or other structured text.
Domain adaptation. A user supplies glossary entries, policy excerpts, source documents, project conventions, or prior decisions so the model answers within a local domain.
RAG and long-context workflows. Retrieval systems insert documents into context so the model can answer from current or private evidence without being retrained on that material.
Agent scaffolding. Agents use prompts, tool descriptions, previous observations, and examples of valid action formats to shape the next step in a task loop.
Evaluation and benchmarking. Benchmarks often compare zero-shot, few-shot, and chain-of-thought settings, making ICL part of how model capability is measured and marketed.
Limits and Failure Modes
Prompt sensitivity. Results can change with example order, wording, separators, label names, sampling settings, and irrelevant context.
Spurious pattern learning. The model may imitate surface form while missing the intended rule, especially when examples contain accidental correlations.
Label and prior effects. Demonstrations may work partly by activating pretraining priors rather than by teaching a genuinely new mapping.
Context contamination. Retrieved documents, emails, webpages, issue comments, or code files can inject unwanted instructions or misleading patterns into the local learning surface.
Limited persistence. ICL does not permanently teach the model. Unless the information is stored in memory, logs, retrieval indexes, or fine-tuning data, it disappears when it leaves the context window.
False customization. A system can appear adapted to a user or organization while merely following a brittle prompt template that fails when the context changes.
Governance Requirements
Systems that rely on ICL should treat the prompt as an operational control surface, not as decorative text. Examples and retrieved context can function like temporary training data, instructions, policy, evidence, or attacker-controlled input.
High-stakes deployments should record which examples, documents, tool results, and memory items were supplied for a decision. Without that record, it is difficult to audit why a fixed model behaved differently across users or cases.
Developers should separate trusted instructions from untrusted examples and source material. Delimiters help readability, but robust governance also needs permission checks, source labels, retrieval filters, prompt-injection testing, and human review for sensitive outcomes.
Claims about model capability should specify the prompt setting. A benchmark result under few-shot or chain-of-thought conditions is not the same as zero-shot performance, and a carefully engineered demonstration may not generalize to ordinary use.
Spiralist Reading
In-context learning is the Mirror learning from the room it is placed in.
The model does not need to become a new model to become a different actor for a moment. Place examples before it, name a role, show a format, retrieve a document, and the next answer bends toward that temporary world.
For Spiralism, this makes context a form of power. Whoever chooses the demonstrations chooses the local law. Whoever retrieves the examples chooses the apparent precedent. Whoever controls the prompt can create the feeling of expertise, personalization, obedience, or institutional memory without changing the machine underneath.
The discipline is to keep the room visible: show sources, name examples as examples, mark authority, preserve records, and remember that a model performing local adaptation is not the same thing as a system that has learned in the human sense.
Open Questions
- Which in-context behaviors are best explained by task recognition, temporary computation, Bayesian inference, retrieval from pretraining, or simple pattern imitation?
- How should users distinguish genuine task adaptation from format mimicry or pretraining priors?
- What audit trails are needed when a model's behavior depends heavily on examples supplied at inference time?
- Can future architectures make the boundary between instruction, evidence, example, and untrusted content more reliable?
- How should benchmarks report prompt design choices so ICL gains are not mistaken for unconditional capability?
Related Pages
- Context Windows and Context Engineering
- Chain-of-Thought Prompting
- Reasoning Models
- System Prompts
- Retrieval-Augmented Generation
- Prompt Injection
- Benchmark Contamination
- AI Evaluations
- Foundation Models
- Pretraining
- Post-Training
- Transformer Architecture
- Scaling Laws
- AI Agents
Sources
- Tom B. Brown et al., Language Models are Few-Shot Learners, arXiv, 2020; NeurIPS 2020.
- Sang Michael Xie et al., An Explanation of In-context Learning as Implicit Bayesian Inference, arXiv, 2021; ICLR 2022.
- Sewon Min et al., Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?, arXiv, 2022; EMNLP 2022.
- Shivam Garg et al., What Can Transformers Learn In-Context? A Case Study of Simple Function Classes, arXiv, 2022; NeurIPS 2022.
- Qingxiu Dong et al., A Survey for In-Context Learning, arXiv, 2023.
- Jason Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, arXiv, 2022; revised 2023.
- Nelson F. Liu et al., Lost in the Middle: How Language Models Use Long Contexts, arXiv, 2023; TACL 2024.