Context management for coding agents
Coding Agents Need Deep Context
For coding agents to work effectively, they need extensive context about the codebase they're working with. This goes well beyond just seeing the file being edited.
Consider what happens when you ask an agent to refactor a function. It needs to:
- See all files it needs to edit - Not just the function definition, but potentially multiple implementation files if the logic is spread across modules.
- Know where all the call sites are - If you're changing a function signature or behavior, every place that calls this function needs to be updated. Missing even one call site means broken code.
- Access tests for any code being modified - Tests define the expected behavior. Without seeing them, an agent might "fix" code in ways that break the intended functionality.
- Understand the architectural and design patterns - Is this codebase using dependency injection? Repository patterns? Event-driven architecture? Without recognizing these patterns, the agent will write code that feels foreign to the rest of the codebase.
Without this context, even the most capable models will produce code that technically works in isolation but breaks when integrated with the larger system. This is one of the most common failure modes for coding agents today. And even if the code doesn't actually break anything, this myopic view quickly leads to a very messy and bloated codebase.
The Full Codebase Approach: Powerful but Limited
The most straightforward solution is to give the agent the entire codebase as context. When this works, it works brilliantly—the agent can see all the interconnections, understand the patterns, and make changes that respect the existing architecture.
But there are significant limitations:
- Cost: At current pricing, processing hundreds of thousands of tokens per request becomes very expensive very quickly, even with substantial caching discounts.
- Latency: More context means longer processing times.
- Hard limits: Even with expanding context windows, large enterprise codebases with millions of lines of code simply won't fit.
That said, this approach is becoming more and more viable as models improve. Leading models can now handle 200-300k tokens well, which translates to roughly 20-30k lines of code. Gemini 2.5 Pro was the first model that could really reason over that much code effectively—not just fit it in the context window, but actually maintain coherence and draw connections across such a large span of text.
For many projects—microservices, libraries, small to medium applications—this is actually sufficient to include most or all of the codebase. But we still need solutions for when it's not.
Why Semantic Search Falls Short
When full codebase context isn't viable, semantic search is often the go-to solution. Embed the codebase, embed the query, find the most similar code.
This doesn't work nearly as well as one might expect. The problem is that codebase context management isn't fundamentally a semantic similarity problem.
Consider these examples of relevant code that semantic search would likely miss:
- You're modifying a data model class. The migration files that create the database schema are highly relevant, but they might share very few semantic similarities with the model definition
- You're implementing a new API endpoint. The authentication middleware that will process every request to that endpoint is critical context, but semantically it's about authentication, not about your specific business logic
- You're fixing a bug in a payment processor. The error handling code three layers up in the call stack that swallows exceptions might be the real problem, but it's semantically unrelated to payments
The code that's "relevant" to a task isn't necessarily semantically similar to what you're working on. Deciding what's relevant requires understanding causality, control flow, data flow, and architectural relationships—it requires intelligence, not just similarity matching. Semantic similarity only captures a subset of what makes code relevant, and often not the most important subset.
Tool-Based Approach
Claude Code demonstrated that tool-based context management is now a viable alternative. Instead of pre-selecting context through embeddings or including everything, the agent actively explores the codebase using tools—searching for files, grepping for patterns, and reading specific files.
This approach is more human-like. When a developer joins a new project, they don't read every file—they navigate through the codebase, following imports, searching for keywords, tracing through call stacks. They build up a mental model incrementally.
The tool-based approach works very well in practice, much better than semantic search. But there's a catch: it's slow. It feels like you're starting from scratch with each interaction. The agent doesn't "know" your codebase. Every time you ask it to make a change, it needs to rediscover the structure, re-learn the patterns, re-find the relevant files. This makes it frustratingly slow at times, and it can miss things—especially when it comes to matching higher-level architectural and design patterns that would be obvious to someone familiar with the codebase.
There are two specific challenges agents run into with the tool-based approach
The "You Don't Know What You Don't Know" Problem
When an agent is searching for context, it can only search for things it knows to look for. If you ask it to modify a user registration flow, it will search for files containing "register" or "user" or "signup". But what about:
- The rate limiting middleware configured in a completely different part of the codebase?
- The email verification service that gets triggered by an event?
- The analytics tracking that happens through a decorator pattern?
This is a fundamental challenge in many RAG contexts, not just code search. List and grep tools partially solve this by allowing broader exploration—the agent can list all files in certain directories or grep for patterns to discover unexpected connections. But it still requires the agent to think to look for these things.
The "Can't See the Forest for the Trees" Problem
Even when an agent successfully retrieves relevant files, viewing a few individual files isn't enough to recognize higher-level architectural and design patterns.
For example, if every service in your codebase follows a pattern of:
- Interface definition
- Implementation class
- Factory for dependency injection
- Unit tests in a parallel directory structure
- Integration tests in a separate test suite
The agent needs to see enough examples to recognize this pattern and follow it. Looking at just the one service being modified won't reveal that this is a consistent pattern that should be maintained. The agent needs a way to understand not just the specific code being changed, but the broader conventions and structures of the codebase.
How Runner Approaches Context Management
In Runner, we use a tool-based approach with several key modifications that address the challenges previously described:
- Dedicated sub-agent: When an agent has too many different responsibilities, it tends to skip steps or take shortcuts. By using a specialized sub-agent whose only job is context gathering, we ensure this critical step gets the attention it deserves. This sub-agent runs on every single user message, ensuring the main agent has the context it needs and can immediately begin addressing the user's request without making half a dozen tool calls first.
- Full files only: When context is retrieved, we always include complete files rather than snippets. This ensures the agent sees imports, class definitions, and full context rather than fragments that might miss critical details. We even allow the agent to open entire directories when required.
- Interfaces for initial context: We extract all interface details (imports, function signatures, etc.) from every file in your repository and include that as context for both the context gathering sub-agent and the planning agent. This gives the agents important high-level context about how the code files depend on each other, and it also provides the agents with important information about architectural and design patterns.
- Manual overrides: Users can view and explicitly specify additional context when the automated approach isn't sufficient. Sometimes you know exactly which files are relevant, and being able to tell the agent directly saves time and improves results. This is rarely necessary, because the sub-agent does such a good job, but it's a nice fallback.
This hybrid approach aims to balance the thoroughness of full-codebase context with the efficiency of selective retrieval, while giving users control when needed. It's not perfect, but it works very well in practice.