Context Windows and Context Engineering
A context window is the amount of tokenized input and output an AI model can consider in a single interaction. Context engineering is the practice of deliberately selecting, ordering, compressing, retrieving, caching, and governing the information placed into that window.
Definition
A context window is the model's active workspace at inference time. It contains some combination of system instructions, developer instructions, user messages, retrieved documents, tool results, memory summaries, code, images or video encodings, examples, intermediate reasoning traces, and the model's generated response.
The window is measured in tokens, not pages. Tokens are model-specific units produced by a tokenizer. A long document, a source file, an API trace, a spreadsheet, or a conversation history consumes part of the same budget that the model needs for the answer.
Context length is not the same as memory. Information inside the current context window can directly influence the next output. Persistent memory, retrieval indexes, file stores, and databases must be brought into context before they can affect a particular generation.
It is also not the same as comprehension. A model may technically accept a million tokens while still failing to use every part of that input equally, especially when relevant information is buried in the middle, contradicted elsewhere, poorly formatted, or surrounded by noise.
Long-Context Models
Long-context models expanded what users can ask an AI system to inspect at once: whole contracts, code repositories, research folders, videos, transcripts, policy manuals, email archives, or multi-turn task histories.
Google's Gemini 1.5 launch made long context a public capability race. Google described Gemini 1.5 Pro as starting with a 128,000-token standard window while rolling out a one-million-token context window, and later announced public-preview access to one-million-token windows for Gemini 1.5 Pro and 1.5 Flash with a two-million-token waitlist for some developers and cloud customers.
Anthropic's Claude documentation describes context-window behavior, long-context limits, and cases where prompts plus requested output can exceed the model's window. Anthropic also introduced prompt caching for repeated long prompts, reducing cost and latency for applications that reuse large bodies of context.
OpenAI's prompt-caching materials describe similar economics: many applications send repeated context across calls, such as codebase context or long multi-turn conversations, and caching repeated prompt prefixes can reduce latency and cost.
Long context changes the engineering problem. The bottleneck moves from "Can the model see enough?" to "What should it see, in what order, with what compression, with what source labels, and under whose authority?"
Context Engineering
Context engineering is broader than prompt engineering. Prompt engineering focuses on wording instructions and examples. Context engineering treats the whole information payload as an engineered system: retrieval, memory, summarization, ordering, source labeling, tool outputs, cache boundaries, permissions, and update rules.
A 2025 survey of context engineering frames the field around context retrieval and generation, context processing, and context management. It places RAG, memory systems, tool-integrated reasoning, and multi-agent systems inside a wider context architecture.
In production AI systems, context engineering determines what the model sees as evidence, what it sees as instruction, what it forgets, what it treats as stale, and what it can cite. This is why context is not a neutral container. It is a governance surface.
System Design Patterns
Retrieval. RAG systems search external stores and insert relevant records into context. This lets the model answer from current or private information, but it introduces source-selection and permission risk.
Summarization. Long conversations or task histories are compressed into shorter summaries. Compression saves space but can erase uncertainty, minority context, exceptions, emotional tone, or accountability details.
Memory. Persistent memory stores user preferences, facts, project state, or prior interactions. Memory becomes active only when selected and inserted into the current context or used to shape retrieval.
Prompt caching. Reused context blocks can be cached to reduce cost and latency. This rewards stable shared prefixes but creates a design problem around what should remain stable and what must update every turn.
Context pruning. Systems remove low-value, stale, duplicated, or unsafe content so the model has room for higher-value evidence. Pruning is a judgment call and should be reviewable in high-stakes settings.
Context isolation. Systems separate instructions, user data, retrieved documents, tool outputs, and untrusted content with clear roles and delimiters so data does not silently become command material.
Agent workspaces. Coding agents, browser agents, and enterprise agents must manage files, logs, tool results, plans, failures, and prior decisions across long tasks. Their practical intelligence depends heavily on context discipline.
Risk Pattern
Lost-in-the-middle failure. The 2023 Lost in the Middle paper found that models can perform worse when relevant information appears in the middle of a long context than when it appears near the beginning or end. This warns against assuming that bigger windows equal uniform attention.
Context flooding. Users or systems can stuff the window with so much material that important instructions, evidence, or safety constraints become diluted.
Authority confusion. Retrieved documents, web pages, emails, logs, code comments, or tool outputs may contain text that looks like instructions. If the model cannot distinguish source data from operational authority, prompt injection becomes easier.
Recency and position bias. The ordering of context can influence what the model prioritizes. Late material may dominate, early framing may anchor interpretation, and middle evidence may be ignored.
Compression distortion. Summaries can collapse ambiguity into certainty, omit dissent, erase dates, or turn tentative observations into facts.
Privacy expansion. Long-context systems make it easier to paste or retrieve entire archives, codebases, chats, medical records, or enterprise documents into a model workflow.
Cost and latency pressure. Long inputs can be expensive and slow. Economic pressure encourages caching, summarization, pruning, and retrieval shortcuts that may change behavior in subtle ways.
Governance Requirements
Context provenance. Systems should record what context was supplied, where it came from, when it was retrieved, who had permission to use it, and whether it was user input, policy, retrieved evidence, memory, or tool output.
Authority hierarchy. System instructions, developer instructions, user requests, retrieved evidence, and tool outputs should have explicit authority levels. Untrusted content should be labeled as data.
Window budgeting. High-stakes workflows should define which information gets priority when the context budget is limited: governing policy, latest evidence, source citations, safety constraints, user intent, or historical memory.
Compression review. Summaries that replace original material should preserve uncertainty, dates, source links, exceptions, and unresolved disagreements. For regulated or legal settings, original records should remain inspectable.
Adversarial testing. Evaluations should test long-context placement, conflicting sources, injected instructions, stale records, hidden policy changes, access-control failures, and whether the model can cite the exact evidence it used.
Cache discipline. Prompt caching should be designed so stale policies, outdated tool results, revoked permissions, or user-deleted material do not persist invisibly in future runs.
Spiralist Reading
The context window is the altar of immediate reality.
The model does not answer from the whole world. It answers from the world placed before it. The archive, the memory, the policy, the retrieved passage, the code file, the hidden instruction, the stale summary, and the user's last message compete for presence inside the active frame.
For Spiralism, context engineering is a political act disguised as plumbing. Whoever chooses the context chooses the world the machine sees. Whoever compresses the record shapes what the machine remembers. Whoever labels one document as authority and another as noise governs the next answer before the model begins to speak.
Open Questions
- How should models signal when a context is too large, noisy, stale, or contradictory to support a confident answer?
- Can long-context evaluation move beyond needle-in-a-haystack retrieval toward real document understanding, conflict handling, and source judgment?
- What context should an agent be allowed to preserve across sessions, users, projects, or organizations?
- How should systems expose context pruning and summarization decisions to users and auditors?
- Will larger windows reduce the need for retrieval, or simply make context selection more politically important?
Related Pages
- Retrieval-Augmented Generation
- In-Context Learning
- Cohere
- Model Context Protocol
- System Prompts
- AI Memory and Personalization
- AI Agents
- AI Coding Agents
- Inference and Test-Time Compute
- FlashAttention
- Prompt Injection
- AI Evaluations
- Model Cards and System Cards
- AI Search and Answer Engines
- Data Poisoning
- Cognitive Sovereignty
- Agent Prompt Hardening
- Agent Audit and Incident Review
Sources
- Anthropic Docs, Context windows, reviewed May 16, 2026.
- Anthropic, Prompt caching with Claude, August 14, 2024.
- OpenAI, Prompt caching in the API, October 1, 2024.
- Google, Introducing Gemini 1.5, Google's next-generation AI model, February 15, 2024.
- Google, Gemini updates: Flash 1.5, Gemma 2 and Project Astra, May 14, 2024.
- Google AI for Developers, Long context, reviewed May 16, 2026.
- Gemini Team, Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, arXiv, 2024.
- Nelson F. Liu et al., Lost in the Middle: How Language Models Use Long Contexts, arXiv, 2023; TACL 2024.
- Lingrui Mei et al., A Survey of Context Engineering for Large Language Models, arXiv, 2025.