Context Engineering

The discipline of designing, curating, and managing the information an AI model receives, so that every token in its context window contributes to correct, reliable output.

Overview

Context Engineering

The discipline of deciding what information an AI model sees, and in what form, so it can complete a task correctly.

Context engineering is the practice of controlling what an AI model sees, when it sees it, and in what format, across every step of a multi-turn interaction. It extends far beyond writing a single prompt: it encompasses memory management, retrieval strategies, , compression, and agent orchestration.

Two identical models using the same API can produce radically different results depending on the context they receive. A scheduling assistant with access to calendar data, contact preferences, and availability tools will outperform one that only receives the user's raw text, even though the underlying is the same. The context determines the output quality.

Origin: The term was popularized in mid-2025 by Shopify CEO Tobi Lütke, who defined it as "the art of providing all the context for a task to be plausibly solvable by the LLM." Andrej Karpathy (ex-OpenAI, ex-Tesla AI) reinforced the concept by describing how modern AI systems require managing agents, , memory, tools, and permissions, all feeding into the .

Key Takeaway

Context engineering shifts the focus from "how do I phrase this prompt?" to "what does the model need to see right now?" It's the difference between crafting instructions and designing information systems.

The Context Window

A is the fixed-size buffer that holds everything an can consider during a single call. Instructions, conversation history, retrieved documents, tool descriptions, and the user's current request must all fit within this limit. LLMs have no persistent memory between calls. If information is not in the context window, the model cannot access it.

A useful analogy: the context window functions like RAM in a traditional computer. The LLM itself is the CPU, powerful but stateless. All relevant data must be loaded into the window (RAM) before each operation. Insufficient or incorrect data leads directly to incorrect output.

Token Mechanics

Context windows are measured in , subword units produced by a . Tokens do not map directly to words: common words like "the" are a single token, while longer or rarer words are split into multiple tokens (e.g., "unhappiness" → "un" + "happiness"). Source code is particularly token-dense. A single line of Python typically consumes 15–25 tokens due to symbols, indentation, and variable names.

Rule of Thumb: 1 token ≈ ¾ of an English word. 100 tokens ≈ 75 words. 1,000 tokens ≈ 1.5 pages of text. Use tiktoken (OpenAI) or the Anthropic Token Count API for precise measurements.

Modern Context Window Sizes

Model	Context Window	Approx. Words
GPT-4.1	1,000,000 tokens	~750,000 words
Claude Opus 4.6 / Sonnet 4.5	1,000,000 tokens (beta)	~750,000 words
Gemini 2	2,000,000 tokens	~1.5M words
o3 / o4-mini	200,000 tokens	~150,000 words
GPT-4o	128,000 tokens	~96,000 words

Larger Windows Do Not Eliminate the Problem: Increasing context window size does not remove the need for context engineering. Larger windows introduce higher inference costs, greater attention dilution, and more surface area for irrelevant or contradictory information. The core challenge is not fitting enough information. It is fitting the right information.

Key Takeaway

The context window is finite. Everything competes for the same budget: system prompts, conversation history, retrieved documents, tool definitions, and output instructions. Context engineering is the discipline of deciding what gets in and what stays out.

Anatomy of Context

A typically contains seven distinct component types. All are measured in the same unit () and all draw from the same fixed budget:

#	Component	Purpose	Typical Size
1	System Prompt	Sets behavior, role, constraints	500–4,000 tokens
2	User Input	Current request from the user	50–2,000 tokens
3	Short-Term Memory	Conversation history	5K–60K tokens
4	Long-Term Memory	Persistent facts across sessions	1K–10K tokens
5	RAG Documents	Retrieved knowledge	5K–80K tokens
6	Tool Definitions	Available function descriptions	2K–20K tokens
7	Output Instructions	Format constraints (JSON, schema)	200–2,000 tokens

Key Insight: These seven components compete for the same finite budget. A verbose conversation history can crowd out critical RAG documents. Unnecessary tool definitions reduce space available for retrieved knowledge. Context engineering is the discipline of deciding what enters the window, what stays out, and in what form.

Interactive parameter tuner — requires JavaScript to display

Key Takeaway

Seven component types compete for space. The system prompt and current user input are non-negotiable. Everything else — history, memory, RAG documents, tool definitions — must be actively managed to fit within budget.

Context Rot

The progressive degradation of AI performance as context grows. More context means more noise, attention dilution, and increased likelihood of errors.

Context rot refers to the degradation of model performance as context length increases. This is an inherent property of -based architectures: the computes pairwise relationships between all tokens. As the token count grows, each individual relationship receives proportionally less weight, diluting the model's ability to attend to critical information.

Lost in the Middle (Liu et al., Stanford, 2023): Retrieval accuracy follows a U-shaped curve: models recall information best when it appears at the beginning or end of the context, but accuracy drops sharply for information placed in the middle. In a 20-document setting, GPT-3.5-Turbo dropped to 45.6% accuracy for middle-positioned facts, below its own closed-book baseline. Anthropic's internal testing found that quality degrades non-linearly once context reaches 70–80% capacity.

This degradation is not model-specific. It has been observed across GPT, Claude, Gemini, and Llama model families. Practical effective context is often significantly smaller than the advertised maximum window size.

Key Takeaway

Context rot is universal. Every major LLM family exhibits it. The solution isn't bigger context windows — it's smarter context management. Fill 50% of capacity with high-quality, relevant information rather than 90% with everything you have.

The Four Context Failures

Beyond general context rot, four distinct failure modes commonly occur in production AI systems — including -driven poisoning. Each has a different root cause and requires a different mitigation strategy:

Interactive diagram — requires JavaScript to display

Failure Mode	Cause	Example
Poisoning	A hallucination enters context and gets treated as ground truth	By turn 15, assistant references meetings that never existed
Distraction	Too much historical information accumulates	Model copies last time's approach even though the situation changed
Confusion	Irrelevant tools or documents crowd the window	send_email and send_slack have similar descriptions; model picks wrong one
Clash	Contradictory information in the same context	Memory says 'prefers mornings' but email says 'let's avoid mornings'

Scaling Does Not Fix These Failures: Larger context windows do not eliminate these failure modes. They amplify them. More available space increases the surface area for stale data, contradictory signals, and irrelevant tool descriptions to accumulate.

Key Takeaway

The four WSCI strategies (next lessons) are direct responses to these four failure modes: Write to prevent poisoning, Select to prevent distraction and confusion, Compress to manage history, and Isolate to prevent cross-contamination.

Strategy: Write

The Write strategy persists information outside the in external storage (files, databases, key-value stores) so the model can retrieve it later without consuming active context space.

Scratchpads

A stores intermediate results during a task: partial decisions, constraints discovered mid-conversation, or intermediate calculations. These notes persist in structured state rather than in the raw message history, preventing important details from being lost during compression or truncation.

Long-Term Memory

Long-term memory systems store persistent user facts across sessions: preferences, past decisions, organizational context, and learned constraints. Products like ChatGPT, Cursor, and Claude implement memory features that write facts to external storage and selectively load them into future sessions. Anthropic's recommended pattern is structured note-taking through files (e.g., NOTES.md, todo lists) that survive context resets and compaction.

Design Principle: Treat the context window as working memory, not permanent storage. Anything that needs to persist beyond the current turn should be written to an external store and loaded on demand.

Key Takeaway

Write prevents context poisoning by externalizing state. Instead of trusting the model's memory of earlier turns, you write facts to structured storage and reload them when needed.

Strategy: Select

The Select strategy controls which information enters the for a given turn. Rather than loading everything available, the system retrieves only what is relevant to the current task.

RAG (Retrieval-Augmented Generation)

systems embed documents into a and retrieve the most semantically similar chunks at query time. Only the top-ranked results enter the context window, keeping the knowledge base accessible without consuming the entire budget.

Dynamic Tool Selection

When an AI system has access to many , exposing all of them simultaneously increases context confusion. Dynamic tool selection analyzes the user's query and exposes only the 3–5 most relevant tool definitions. Research demonstrates this improves tool selection accuracy by approximately 3x compared to loading all tools (arXiv:2411.15399).

Just-in-Time (JIT) Context Loading

Instead of pre-loading large datasets, JIT systems maintain lightweight references (file paths, query strings, URLs) and retrieve full content only when needed. Claude Code uses this pattern to navigate large codebases. It stores file references and loads specific files incrementally using search tools, rather than ingesting the entire repository.

The goal of Select is maximum relevance per token. Every token in the context window should contribute to the current task.

Key Takeaway

Select prevents context distraction and confusion. Load only what's relevant: top-K RAG results instead of all documents, 3–5 matched tools instead of all tools, and JIT file loading instead of pre-loading everything.

Strategy: Compress

The Compress strategy reduces count while preserving the essential signal within the context. This applies when relevant information exists but consumes too many tokens in its raw form.

Trimming

Trimming applies rule-based removal: drop messages older than N turns, remove tool call results after they have been processed, or truncate content beyond a token threshold. This is the simplest approach and is effective for preventing unbounded context growth in long-running sessions.

Summarization

Summarization uses the model itself (or a smaller model) to distill a long conversation or document into its key decisions, constraints, and unresolved items. The full history is then replaced with this condensed summary. Claude Code implements this as "auto-compact," triggered automatically at 95% context capacity, it summarizes the entire conversation trajectory while preserving architectural decisions and unresolved issues.

Compression Pitfalls:

Brevity bias: Optimization pressure drives summaries toward shorter, more generic text, dropping domain-specific constraints that are critical for correct behavior.
Context collapse: Repeated summarization across many iterations erodes information cumulatively. Each pass loses subtle details until the context becomes functionally useless.

Prompt Caching

does not reduce token count, but achieves an equivalent economic effect. Repeated prompt segments (system instructions, tool definitions) are cached server-side and reused at a fraction of the cost. Anthropic charges 0.1x the base input price for cache reads, reducing a $5/M-token system prompt to $0.50/M on subsequent calls. Cache write costs are 1.25x (5-minute TTL) or 2x (1-hour TTL).

Cache Behavior	Cost Multiplier	TTL
Cache read (hit)	0.1x base price	—
Cache write (5-min TTL)	1.25x base price	5 minutes
Cache write (1-hour TTL)	2x base price	1 hour

Key Takeaway

Compress manages context growth. Trimming is crude but reliable. Summarization is powerful but risky (brevity bias, context collapse). Prompt caching saves money on repeated content. Use all three together.

Strategy: Isolate

The Isolate strategy splits work across separate boundaries, ensuring that no single agent's window becomes overloaded with unrelated information.

Sub-Agents

are specialized AI instances, each with their own clean context window, spawned to handle specific subtasks. A complex task such as "refactor the codebase and update documentation" can be decomposed into two sub-agents: one receives only code files and style guides, the other receives only documentation and API change logs. Each operates with a focused context. Upon completion, sub-agents return compressed summaries to the parent agent, not their full execution trace.

Sandboxing

Sandboxing isolates execution environments from the context window. Code agents execute code in a separate runtime, store results as variables, and return only the relevant output (return values, error messages). The full execution trace, intermediate print statements, and stack frames remain outside the context, preserving budget for higher-value information.

Interactive diagram — requires JavaScript to display

Key Takeaway

Isolate prevents cross-contamination between tasks. Sub-agents get clean context windows focused on their specific subtask. Sandboxes keep execution details out of the context. Return summaries, not traces.

Prompt Engineering vs. Context Engineering

focuses on crafting the instruction text itself: techniques like chain-of-thought reasoning, role assignment, and few-shot examples. These techniques remain important but represent only one component of a larger system.

encompasses the entire information architecture around the model: what data is loaded, when it is loaded, how it is formatted, and how it is managed across multi-turn interactions. The central question shifts from "how should I phrase this prompt?" to "what does the model need to see right now to succeed at this specific task?"

Aspect	Prompt Engineering	Context Engineering
Scope	The instruction text	The entire information ecosystem
Question	"How do I phrase this?"	"What should the model see right now?"
Focus	Single-turn quality	Multi-turn agent reliability
Tools	Prompt templates	Memory, RAG, tool routing, sub-agents, compression
Failure mode	Bad wording	Context rot, poisoning, distraction, confusion, clash

Example: Same Model, Different Context: Without context engineering: User says "Move my Tuesday meeting." Model responds "What time works for you?", requiring multiple follow-up turns.

With context engineering: Before inference, the system loads the user's calendar, relevant email threads, contact preferences, and scheduling tools. The model produces a complete, actionable response in a single turn: "Thursday 2pm is open. I've sent Jim an invite. Let me know if that works."

Key Takeaway

Prompt engineering is a subset of context engineering. It handles the instruction text. Context engineering handles everything else: memory, retrieval, compression, tool routing, and agent isolation. Both matter, but context engineering is where production systems succeed or fail.

Production Patterns

Production AI systems typically combine all four strategies. The table below maps how three widely-used tools implement each strategy:

Tool	Write	Select	Compress	Isolate
Claude Code	CLAUDE.md memory files, todo lists	JIT file loading, codebase navigation	Auto-compact at 95% capacity	Sub-agents for parallel tasks
Cursor	.cursorrules, project context	Semantic code search, @file references	Conversation summarization	Background indexing agents
ChatGPT	Memory feature across sessions	Web search, file retrieval	Conversation truncation	Canvas for isolated editing

Most context-related failures do not require multi-agent orchestration to solve. Start with the simplest effective strategy: audit what enters the context window, remove what does not contribute, and add structure to what remains. Treat context as a budget. Every token should justify its inclusion.

Exercise•Intermediate

Practice Exercises

Test your understanding with 7 interactive challenges.

Tutorial•Intermediate

Build an Assistant

Build a context-managed AI assistant from scratch.

Concept•Beginner

How LLMs Think

Understand what happens inside the model you're engineering context for.

Build a Context-Managed AI Assistant

From a naive chatbot to a production-grade context-engineered assistant, implementing all four WSCI strategies.

Tutorial progress tracker — requires JavaScript to display