Context Engineering
The discipline of designing, curating, and managing the information an AI model receives, so that every token in its context window contributes to correct, reliable output.
Overview
Context Engineering
The discipline of deciding what information an AI model sees, and in what form, so it can complete a task correctly.
Context engineering is the practice of controlling what an AI model sees, when it sees it, and in what format, across every step of a multi-turn interaction. It extends far beyond writing a single prompt: it encompasses memory management, retrieval strategies, , compression, and agent orchestration.
Two identical models using the same API can produce radically different results depending on the context they receive. A scheduling assistant with access to calendar data, contact preferences, and availability tools will outperform one that only receives the user's raw text, even though the underlying is the same. The context determines the output quality.
Origin: The term was popularized in mid-2025 by Shopify CEO Tobi Lütke, who defined it as "the art of providing all the context for a task to be plausibly solvable by the LLM." Andrej Karpathy (ex-OpenAI, ex-Tesla AI) reinforced the concept by describing how modern AI systems require managing agents, , memory, tools, and permissions, all feeding into the .
Key Takeaway
Context engineering shifts the focus from "how do I phrase this prompt?" to "what does the model need to see right now?" It's the difference between crafting instructions and designing information systems.
The Context Window
A is the fixed-size buffer that holds everything an can consider during a single call. Instructions, conversation history, retrieved documents, tool descriptions, and the user's current request must all fit within this limit. LLMs have no persistent memory between calls. If information is not in the context window, the model cannot access it.
A useful analogy: the context window functions like RAM in a traditional computer. The LLM itself is the CPU, powerful but stateless. All relevant data must be loaded into the window (RAM) before each operation. Insufficient or incorrect data leads directly to incorrect output.
Token Mechanics
Context windows are measured in , subword units produced by a . Tokens do not map directly to words: common words like "the" are a single token, while longer or rarer words are split into multiple tokens (e.g., "unhappiness" → "un" + "happiness"). Source code is particularly token-dense. A single line of Python typically consumes 15–25 tokens due to symbols, indentation, and variable names.
Rule of Thumb: 1 token ≈ ¾ of an English word. 100 tokens ≈ 75 words. 1,000 tokens ≈ 1.5 pages of text. Use tiktoken (OpenAI) or the Anthropic Token Count API for precise measurements.
Modern Context Window Sizes
| Model | Context Window | Approx. Words |
|---|---|---|
| GPT-4.1 | 1,000,000 tokens | ~750,000 words |
| Claude Opus 4.6 / Sonnet 4.5 | 1,000,000 tokens (beta) | ~750,000 words |
| Gemini 2 | 2,000,000 tokens | ~1.5M words |
| o3 / o4-mini | 200,000 tokens | ~150,000 words |
| GPT-4o | 128,000 tokens | ~96,000 words |
Larger Windows Do Not Eliminate the Problem: Increasing context window size does not remove the need for context engineering. Larger windows introduce higher inference costs, greater attention dilution, and more surface area for irrelevant or contradictory information. The core challenge is not fitting enough information. It is fitting the right information.
Key Takeaway
The context window is finite. Everything competes for the same budget: system prompts, conversation history, retrieved documents, tool definitions, and output instructions. Context engineering is the discipline of deciding what gets in and what stays out.
Anatomy of Context
A typically contains seven distinct component types. All are measured in the same unit () and all draw from the same fixed budget:
| # | Component | Purpose | Typical Size |
|---|---|---|---|
| 1 | System Prompt | Sets behavior, role, constraints | 500–4,000 tokens |
| 2 | User Input | Current request from the user | 50–2,000 tokens |
| 3 | Short-Term Memory | Conversation history | 5K–60K tokens |
| 4 | Long-Term Memory | Persistent facts across sessions | 1K–10K tokens |
| 5 | RAG Documents | Retrieved knowledge | 5K–80K tokens |
| 6 | Tool Definitions | Available function descriptions | 2K–20K tokens |
| 7 | Output Instructions | Format constraints (JSON, schema) | 200–2,000 tokens |
Key Insight: These seven components compete for the same finite budget. A verbose conversation history can crowd out critical RAG documents. Unnecessary tool definitions reduce space available for retrieved knowledge. Context engineering is the discipline of deciding what enters the window, what stays out, and in what form.
Interactive parameter tuner — requires JavaScript to display
Key Takeaway
Seven component types compete for space. The system prompt and current user input are non-negotiable. Everything else — history, memory, RAG documents, tool definitions — must be actively managed to fit within budget.
Context Rot
Context Rot
The progressive degradation of AI performance as context grows. More context means more noise, attention dilution, and increased likelihood of errors.
Context rot refers to the degradation of model performance as context length increases. This is an inherent property of -based architectures: the computes pairwise relationships between all tokens. As the token count grows, each individual relationship receives proportionally less weight, diluting the model's ability to attend to critical information.
Lost in the Middle (Liu et al., Stanford, 2023): Retrieval accuracy follows a U-shaped curve: models recall information best when it appears at the beginning or end of the context, but accuracy drops sharply for information placed in the middle. In a 20-document setting, GPT-3.5-Turbo dropped to 45.6% accuracy for middle-positioned facts, below its own closed-book baseline. Anthropic's internal testing found that quality degrades non-linearly once context reaches 70–80% capacity.
This degradation is not model-specific. It has been observed across GPT, Claude, Gemini, and Llama model families. Practical effective context is often significantly smaller than the advertised maximum window size.
Key Takeaway
Context rot is universal. Every major LLM family exhibits it. The solution isn't bigger context windows — it's smarter context management. Fill 50% of capacity with high-quality, relevant information rather than 90% with everything you have.
The Four Context Failures
Beyond general context rot, four distinct failure modes commonly occur in production AI systems — including -driven poisoning. Each has a different root cause and requires a different mitigation strategy:
Interactive diagram — requires JavaScript to display
| Failure Mode | Cause | Example |
|---|---|---|
| Poisoning | A hallucination enters context and gets treated as ground truth | By turn 15, assistant references meetings that never existed |
| Distraction | Too much historical information accumulates | Model copies last time's approach even though the situation changed |
| Confusion | Irrelevant tools or documents crowd the window | send_email and send_slack have similar descriptions; model picks wrong one |
| Clash | Contradictory information in the same context | Memory says 'prefers mornings' but email says 'let's avoid mornings' |
Scaling Does Not Fix These Failures: Larger context windows do not eliminate these failure modes. They amplify them. More available space increases the surface area for stale data, contradictory signals, and irrelevant tool descriptions to accumulate.
Key Takeaway
The four WSCI strategies (next lessons) are direct responses to these four failure modes: Write to prevent poisoning, Select to prevent distraction and confusion, Compress to manage history, and Isolate to prevent cross-contamination.
Strategy: Write
The Write strategy persists information outside the in external storage (files, databases, key-value stores) so the model can retrieve it later without consuming active context space.
Scratchpads
A stores intermediate results during a task: partial decisions, constraints discovered mid-conversation, or intermediate calculations. These notes persist in structured state rather than in the raw message history, preventing important details from being lost during compression or truncation.
Long-Term Memory
Long-term memory systems store persistent user facts across sessions: preferences, past decisions, organizational context, and learned constraints. Products like ChatGPT, Cursor, and Claude implement memory features that write facts to external storage and selectively load them into future sessions. Anthropic's recommended pattern is structured note-taking through files (e.g., NOTES.md, todo lists) that survive context resets and compaction.
Design Principle: Treat the context window as working memory, not permanent storage. Anything that needs to persist beyond the current turn should be written to an external store and loaded on demand.
Key Takeaway
Write prevents context poisoning by externalizing state. Instead of trusting the model's memory of earlier turns, you write facts to structured storage and reload them when needed.
Strategy: Select
The Select strategy controls which information enters the for a given turn. Rather than loading everything available, the system retrieves only what is relevant to the current task.
RAG (Retrieval-Augmented Generation)
systems embed documents into a and retrieve the most semantically similar chunks at query time. Only the top-ranked results enter the context window, keeping the knowledge base accessible without consuming the entire budget.
Dynamic Tool Selection
When an AI system has access to many , exposing all of them simultaneously increases context confusion. Dynamic tool selection analyzes the user's query and exposes only the 3–5 most relevant tool definitions. Research demonstrates this improves tool selection accuracy by approximately 3x compared to loading all tools (arXiv:2411.15399).
Just-in-Time (JIT) Context Loading
Instead of pre-loading large datasets, JIT systems maintain lightweight references (file paths, query strings, URLs) and retrieve full content only when needed. Claude Code uses this pattern to navigate large codebases. It stores file references and loads specific files incrementally using search tools, rather than ingesting the entire repository.
The goal of Select is maximum relevance per token. Every token in the context window should contribute to the current task.
Key Takeaway
Select prevents context distraction and confusion. Load only what's relevant: top-K RAG results instead of all documents, 3–5 matched tools instead of all tools, and JIT file loading instead of pre-loading everything.
Strategy: Compress
The Compress strategy reduces count while preserving the essential signal within the context. This applies when relevant information exists but consumes too many tokens in its raw form.
Trimming
Trimming applies rule-based removal: drop messages older than N turns, remove tool call results after they have been processed, or truncate content beyond a token threshold. This is the simplest approach and is effective for preventing unbounded context growth in long-running sessions.
Summarization
Summarization uses the model itself (or a smaller model) to distill a long conversation or document into its key decisions, constraints, and unresolved items. The full history is then replaced with this condensed summary. Claude Code implements this as "auto-compact," triggered automatically at 95% context capacity, it summarizes the entire conversation trajectory while preserving architectural decisions and unresolved issues.
Compression Pitfalls:
- Brevity bias: Optimization pressure drives summaries toward shorter, more generic text, dropping domain-specific constraints that are critical for correct behavior.
- Context collapse: Repeated summarization across many iterations erodes information cumulatively. Each pass loses subtle details until the context becomes functionally useless.
Prompt Caching
does not reduce token count, but achieves an equivalent economic effect. Repeated prompt segments (system instructions, tool definitions) are cached server-side and reused at a fraction of the cost. Anthropic charges 0.1x the base input price for cache reads, reducing a $5/M-token system prompt to $0.50/M on subsequent calls. Cache write costs are 1.25x (5-minute TTL) or 2x (1-hour TTL).
| Cache Behavior | Cost Multiplier | TTL |
|---|---|---|
| Cache read (hit) | 0.1x base price | — |
| Cache write (5-min TTL) | 1.25x base price | 5 minutes |
| Cache write (1-hour TTL) | 2x base price | 1 hour |
Key Takeaway
Compress manages context growth. Trimming is crude but reliable. Summarization is powerful but risky (brevity bias, context collapse). Prompt caching saves money on repeated content. Use all three together.
Strategy: Isolate
The Isolate strategy splits work across separate boundaries, ensuring that no single agent's window becomes overloaded with unrelated information.
Sub-Agents
are specialized AI instances, each with their own clean context window, spawned to handle specific subtasks. A complex task such as "refactor the codebase and update documentation" can be decomposed into two sub-agents: one receives only code files and style guides, the other receives only documentation and API change logs. Each operates with a focused context. Upon completion, sub-agents return compressed summaries to the parent agent, not their full execution trace.
Sandboxing
Sandboxing isolates execution environments from the context window. Code agents execute code in a separate runtime, store results as variables, and return only the relevant output (return values, error messages). The full execution trace, intermediate print statements, and stack frames remain outside the context, preserving budget for higher-value information.
Interactive diagram — requires JavaScript to display
Key Takeaway
Isolate prevents cross-contamination between tasks. Sub-agents get clean context windows focused on their specific subtask. Sandboxes keep execution details out of the context. Return summaries, not traces.
Prompt Engineering vs. Context Engineering
focuses on crafting the instruction text itself: techniques like chain-of-thought reasoning, role assignment, and few-shot examples. These techniques remain important but represent only one component of a larger system.
encompasses the entire information architecture around the model: what data is loaded, when it is loaded, how it is formatted, and how it is managed across multi-turn interactions. The central question shifts from "how should I phrase this prompt?" to "what does the model need to see right now to succeed at this specific task?"
| Aspect | Prompt Engineering | Context Engineering |
|---|---|---|
| Scope | The instruction text | The entire information ecosystem |
| Question | "How do I phrase this?" | "What should the model see right now?" |
| Focus | Single-turn quality | Multi-turn agent reliability |
| Tools | Prompt templates | Memory, RAG, tool routing, sub-agents, compression |
| Failure mode | Bad wording | Context rot, poisoning, distraction, confusion, clash |
Example: Same Model, Different Context: Without context engineering: User says "Move my Tuesday meeting." Model responds "What time works for you?", requiring multiple follow-up turns.
With context engineering: Before inference, the system loads the user's calendar, relevant email threads, contact preferences, and scheduling tools. The model produces a complete, actionable response in a single turn: "Thursday 2pm is open. I've sent Jim an invite. Let me know if that works."
Key Takeaway
Prompt engineering is a subset of context engineering. It handles the instruction text. Context engineering handles everything else: memory, retrieval, compression, tool routing, and agent isolation. Both matter, but context engineering is where production systems succeed or fail.
Production Patterns
Production AI systems typically combine all four strategies. The table below maps how three widely-used tools implement each strategy:
| Tool | Write | Select | Compress | Isolate |
|---|---|---|---|---|
| Claude Code | CLAUDE.md memory files, todo lists | JIT file loading, codebase navigation | Auto-compact at 95% capacity | Sub-agents for parallel tasks |
| Cursor | .cursorrules, project context | Semantic code search, @file references | Conversation summarization | Background indexing agents |
| ChatGPT | Memory feature across sessions | Web search, file retrieval | Conversation truncation | Canvas for isolated editing |
Most context-related failures do not require multi-agent orchestration to solve. Start with the simplest effective strategy: audit what enters the context window, remove what does not contribute, and add structure to what remains. Treat context as a budget. Every token should justify its inclusion.
Context Engineering Exercises
Test your understanding of context windows, failure modes, and the WSCI strategies.
Warm-Up
1. Order Context Components by Size
Interactive exercise — requires JavaScript to display
2. Match Failure Mode to Scenario
Interactive quiz — requires JavaScript to display
Core
3. Complete the WSCI Strategies
Interactive exercise — requires JavaScript to display
4. Token Budget Challenge
Interactive parameter tuner — requires JavaScript to display
5. Visual Context Trimmer
Drag the token budget slider to see how a sliding window trimmer works. The system message is always pinned. Oldest non-system messages get trimmed first.
Interactive parameter tuner — requires JavaScript to display
Stretch
6. Diagnose the Failure Mode
Interactive quiz — requires JavaScript to display
7. Visual Tool Router
Type a user query and click Route to see which tools get selected and why. Keywords in the query light up and connect to matching tools. Unmatched tools stay dim, saving context window space.
Interactive exercise — requires JavaScript to display
Build a Context-Managed AI Assistant
From a naive chatbot to a production-grade context-engineered assistant, implementing all four WSCI strategies.
Tutorial progress tracker — requires JavaScript to display
Step 1: Environment Setup
Install the Anthropic SDK and set up a project structure for our context-managed assistant.
Interactive code editor — requires JavaScript
Step 2: System Prompt Design
A well-designed system prompt is the foundation. Keep it focused. Every token matters.
Interactive code editor — requires JavaScript
Step 3: Conversation History (Sliding Window)
Implement a sliding window that keeps recent messages within a token budget.
Interactive code editor — requires JavaScript
Step 4: Memory System
Implement a Write strategy: persistent memory that survives across sessions.
Interactive code editor — requires JavaScript
Step 5: RAG Context Selection
Implement a Select strategy: retrieve only relevant emails/documents.
Interactive code editor — requires JavaScript
Step 6: Dynamic Tool Routing
Only expose tools relevant to the current query. Fewer tools = better selection accuracy.
Interactive code editor — requires JavaScript
Step 7: Context Compression
Implement auto-compact: when context reaches a threshold, summarize the conversation.
Interactive code editor — requires JavaScript
Always preserve: (1) system prompt, (2) user constraints like "no morning meetings", (3) unresolved tasks. Brevity bias will drop these if you're not careful.
Step 8: Sub-Agent Orchestration
Isolate complex tasks into focused sub-agents, each with their own clean context.
Interactive code editor — requires JavaScript
Production Checklist:
- Token budget monitoring on every API call
- Memory persistence with versioning
- RAG retrieval with relevance scoring
- Dynamic tool routing (never expose all tools)
- Auto-compact with critical info preservation
- Sub-agent isolation for multi-step tasks