How AI agents find code in your repo - Writing

Agents read too much code. An agent asked to modify one function in a 3,000-line file loads the entire file. An agent exploring an unfamiliar repo reads dozens of files before it understands the structure. On a large codebase, a single exploratory task can burn 200,000 tokens before any real work starts.

The jCodeMunch post measured what that costs: finding a single function with raw file reads runs to roughly 40,000 tokens, and understanding a module’s API runs to 15,000. Five strategies have emerged for reducing that. They differ in what gets precomputed, how the agent queries the result, and how stale the answers go between edits.

Just search the repo

Claude Code, Cline, and Aider ship with no index. The model runs read_file, list_directory, and shell-level search (grep, rg, find), iterates on what comes back, and reasons over the results inside its native context window.

Amazon Science reported in February that keyword search via agentic tool use reaches over 90% of RAG-level accuracy on code retrieval benchmarks. Vadim’s writeup traces the Claude Code tool calls: file reads, exact-match search, and the 200K context as the working set.

There is no precompute step, and the answers are always current. The tradeoff is that every session pays the search cost up front, and large codebases burn more tokens on exploration than the embedding-based tools amortise across runs.

Embed and retrieve

Cursor’s setup is the canonical embedding pipeline. Cursor’s own writeup and a Towards Data Science teardown describe the moving parts: tree-sitter splits files at function and class boundaries, a Merkle tree of file hashes detects what changed, embeddings (OpenAI’s API or a custom embedder) go to Turbopuffer, and queries pull the top-k chunks at runtime. File paths are encrypted before they leave the machine.

Kilo Code does the same locally. Ollama serves nomic-embed-text or mxbai-embed-large, with LanceDB or Qdrant as the vector store. Nothing leaves the machine.

The structural weakness is the chunk boundary. A function call sits in one chunk and its definition sits in another. Cline’s team has argued explicitly that this tears apart the logic of the code. Stale indexes are the other failure mode. Cursor re-indexes every ten minutes by default, so a recent refactor can return outdated results.

Index the symbols

The AST symbol index sits between raw search and full graphs. jCodeMunch parses with tree-sitter, extracts every function, class, type, and constant, and stores them in a flat JSON index. Agents request a symbol by stable ID and the tool seeks to the byte offset in the source file.

SymDex layers semantic embeddings on top of the same idea so agents can search by intent rather than by exact name. It claims roughly 97% token reduction per lookup.

Symbols sit in isolation in this design. The index records where each function is defined; it does not record what calls what.

Build a code graph

Code graphs add the relationships. Prowl, CodeGraphContext, and Memgraph’s GraphRAG demo index code into a graph database where nodes are symbols and edges are calls, imports, and inheritance. An agent can ask for the callers of a function, the transitive dependencies of a module, or the type hierarchy of a class, without reading the source.

The Prowl team measured 90.5% token reduction on full-project understanding tasks. My earlier comparison goes deeper into the tradeoffs between Prowl, CodeGraphContext, and architectural-diagram tools like CodeBoarding.

These tools approximate the call graph through static analysis, which is incomplete on any language with dynamic dispatch, monkey patching, or computed imports.

Let the model write Python

The newest entry rebuilds retrieval as a programming task. The Recursive Language Models paper from Zhang et al. frames the input prompt as data inside a sandboxed Python REPL. The model writes code to peek into the variable, decompose it, search slices of it, and recursively call sub-LLMs over the pieces. Alex Zhang’s writeup shows the technique scaling to roughly two orders of magnitude beyond the model’s native context, and DSPy ships the pattern as a module.

Mitko Vasilev posted in May about replacing his last RAG pipeline with a DSPy-backed RLM against a CUDA codebase he maintains. He reported the agent moving from 15% to 40% on a 100-task long-horizon benchmark, with 80% on the logic subset and zero failures. His framing is the part I find sticky: “An RLM stores data in variables and passes references. Like a real programmer.”

This is the agentic-search strategy with the search tools swapped out for a full Python interpreter, and the repo loaded into memory as data.

Where each fits

Strategy	Precompute	Freshness	Best for
Agentic search	None	Always current	Small to mid repos, fast-moving code
Vector embeddings	Heavy	Re-index on a timer	Semantic queries over large repos
AST symbol index	Medium	File-watch updates	Surgical lookup by name
Code graph	Heavy	File-watch updates	Structural queries across files
RLM	None	Always current	Very large contexts and aggregation

What I use

I run Claude Code daily on the agentic-search end of the spectrum, with jCodeMunch as the supplement when I’m digging into an unfamiliar codebase. The vector-embedding tools never stuck for me because the freshness problem on a fast-moving repo costs more than the queries save.

RLM is the one I’m watching. The DSPy module is published, so trying it is cheap. The claim worth checking is whether the model holds up on multi-million-line codebases where every other strategy is already strained.