Code Agents - Cruxr.ai

Code Is the Ultimate Tool

In previous articles we saw agents that call pre-defined functions: search the web, query a database, look up the weather. Those tools are powerful, but they are fixed . Someone had to anticipate each capability the agent might need and write a tool for it. A weather agent cannot suddenly decide to parse a CSV file, because nobody gave it a CSV-parsing tool. A code agent breaks through this limitation entirely: instead of calling pre-built tools, it writes code to solve problems. Code can express arbitrary computation, which means a code agent can construct any tool it needs on the fly, as long as it can write the program to do it.

But code agents do more than generate standalone scripts. The most capable ones operate on existing codebases : they read source files, understand project structure, write changes across multiple files, run tests, interpret error messages, and iterate until the code works. This is fundamentally different from asking a model to write a function in a chat window. A code agent is embedded in a real development environment with access to the filesystem, the terminal, version control, and the full context of a software project.

Why is this the most powerful form of agency? Because every other type of agent action — calling an API, querying a database, transforming data, running a computation — can be expressed as code. A code agent with access to a terminal can do anything a software engineer can do from their laptop. It can install packages, write scripts, call APIs, process data, build and deploy applications. The ceiling of what it can accomplish is limited only by its ability to write correct programs and its access to the right environment.

The landscape of code agents has grown rapidly. Claude Code (Anthropic) is a terminal-based agent that operates directly on the local filesystem. Codex CLI (OpenAI) takes a similar approach with sandboxed execution. Cursor Agent embeds the agent inside an IDE. Devin (Cognition) runs in a full virtual environment with a browser, terminal, and editor. Despite different interfaces and models, all of these share the same core pattern: the edit-test loop .

The Edit-Test Loop

How does a code agent actually work? If you've read the ReAct article in this track, you already know the core pattern: think, act, observe, repeat. A code agent applies that exact loop to software engineering. The cycle has five phases, and understanding each one is crucial to understanding why code agents succeed (and where they fail).

1. Read: the agent searches the codebase to find relevant files. It uses tools like grep, glob (file pattern matching), and file reading to explore the project structure. A real codebase might have thousands of files — the agent cannot read all of them. It has to figure out which files matter for the task at hand, often starting with a broad search and narrowing down. This is where most of the context challenge lives: finding the right 5 files out of 5,000.
2. Plan: after reading the relevant code, the agent decides what changes to make. Which files need modification? What approach should it take? Does it need to create new files or only edit existing ones? This planning step often involves reasoning about the architecture of the project — how modules depend on each other, what the conventions are, where tests live.
3. Edit: the agent writes or modifies code. This could be a targeted one-line bug fix, a new function, or changes across multiple files. The edit itself is the action step of the ReAct loop.
4. Test: the agent runs the code or the project's test suite to check whether its changes work. This is the observation step: the agent gets concrete feedback from the real world (not from the model's own reasoning) about whether the code is correct. A passing test suite is strong evidence that the change works. A failing test is a precise error message that points to what went wrong.
5. Fix: if tests fail, the agent reads the error output, reasons about the cause of the failure, and goes back to step 3. This iterative self-correction is what distinguishes a code agent from a one-shot code generator. A model that generates code in a chat window gives you one attempt — if it's wrong, you have to manually debug it. A code agent debugs its own mistakes, often across several iterations, until the tests pass.

def code_agent_loop(task, codebase, max_iterations=10):
    """The core loop every code agent runs."""

    for i in range(max_iterations):
        # 1. READ — find relevant files
        relevant_files = search_codebase(codebase, task)

        # 2. PLAN — decide what to change
        plan = reason_about_changes(task, relevant_files)

        # 3. EDIT — write or modify code
        apply_changes(plan, codebase)

        # 4. TEST — run the test suite
        test_result = run_tests(codebase)

        if test_result.all_passed:
            return "Task complete"  # done!

        # 5. FIX — read the errors, loop back to step 3
        task = f"""Tests failed with:
{test_result.error_output}

Fix the failing tests."""

    return "Max iterations reached"

The tools a code agent needs are surprisingly simple. Most code agents work with just four or five core capabilities: file read/write (to view and modify source code), search (grep, glob, or some form of codebase search to find relevant files without reading everything), bash execution (to run tests, install packages, and execute arbitrary commands), and git operations (to commit changes, create branches, and manage version control). That's it. With those four tools, a code agent can do virtually anything a human developer does from the terminal.

The hardest part of the loop is step 1: finding the relevant code. A human developer who has worked on a project for months has a mental model of where everything lives. A code agent starts fresh every time. It has to navigate a codebase it has never seen before, figure out the project structure, and locate the exact files and functions that need to change. This is why codebase search tools are so critical — and why agents that can effectively explore a project tend to dramatically outperform those that can't.

💡 The edit-test loop is the key reason code agents outperform one-shot code generation. A 2024 study on SWE-bench showed that agents which ran tests and iterated on failures solved roughly 3-4x more issues than agents that generated a single patch without testing. The feedback from real execution is irreplaceable: it catches errors that no amount of reasoning can predict.

Claude Code

Claude Code is Anthropic's CLI agent for software engineering. It runs in the terminal with direct access to the local filesystem, which means it operates on the same codebase you're working on, in the same environment where you build and test your code. This is not a web-based sandbox or a chat window — it's an agent embedded in your development workflow.

The architecture is straightforward: a Claude model connected to a set of tools. The core tool set includes:

Read: read any file on the filesystem. Supports reading specific line ranges for large files, and can even view images and PDFs.
Edit: make targeted edits to existing files by specifying the exact text to replace. This is a surgical operation — the agent doesn't rewrite the entire file, it replaces specific sections.
Write: create new files or completely rewrite existing ones.
Grep: search file contents using regex patterns. This is how the agent finds relevant code across a large project without reading every file.
Glob: search for files by name pattern (e.g., all *.test.js files). Useful for understanding project structure.
Bash: execute any shell command. This is the most powerful tool — it lets the agent run tests, install packages, use git, start servers, and execute arbitrary programs.

A typical interaction looks like this: you give Claude Code a task ("fix the failing test in auth.test.js" or "add pagination to the /users API endpoint"). The agent then enters the edit-test loop. It searches the codebase to understand the project structure, reads the relevant files, plans its approach, makes changes, runs the tests, and iterates until the task is done. Often this involves multiple rounds of editing and testing, especially for non-trivial tasks.

User: Fix the bug in the login endpoint — users are getting 500 errors.

Claude Code:
  [Grep] Searching for "login" across the codebase...
  → Found: src/routes/auth.js, src/middleware/session.js, tests/auth.test.js

  [Read] Reading src/routes/auth.js...
  → The login handler calls db.findUser(email) but doesn't handle null

  [Read] Reading tests/auth.test.js...
  → Test "should return 401 for unknown user" is failing with 500

  [Edit] Adding null check in src/routes/auth.js:
    if (!user) return res.status(401).json({ error: "Invalid credentials" });

  [Bash] Running: npm test
  → 23 passed, 0 failed

  Done. The bug was a missing null check — when a user wasn't found in
  the database, the code tried to access user.passwordHash, which threw
  a TypeError. Added a null check that returns 401 instead.

Several capabilities make Claude Code effective for real-world software tasks:

Multi-file edits: real software changes rarely touch a single file. Adding a feature might require modifying a route handler, updating a database schema, writing a migration, adding tests, and updating documentation. Claude Code can coordinate changes across all of these in a single session.
Iterative test fixing: when tests fail, the agent reads the error output and makes targeted fixes. It might go through 3-5 rounds of test-fix cycles for a complex change, much like a human developer would.
Git operations: the agent can create commits, branches, and pull requests. This means it can deliver changes that are ready for code review, not just raw patches.
Project understanding via CLAUDE.md: a CLAUDE.md file at the root of a repository gives the agent persistent context about the project: build commands, code conventions, architecture notes, and anything else that a new developer would need to know. The agent reads this file at the start of a session, much like a new team member reading an onboarding guide.
Extended thinking: for complex problems, the model can reason through the approach before taking action. This is especially valuable for tasks that require understanding the relationships between multiple components or making architectural decisions.
MCP integration: Claude Code can connect to additional tools via MCP servers , extending its capabilities beyond the built-in tool set. For example, connecting to a Jira MCP server lets it read tickets and update their status, or a database MCP server lets it query production data to diagnose issues.

Codex CLI and Other Code Agents

Claude Code isn't the only code agent. The same edit-test loop pattern appears across a growing ecosystem of tools, each with different models, environments, and design trade-offs. Understanding the landscape helps clarify what's common to all code agents versus what's specific to any one product.

Codex CLI (OpenAI) is architecturally similar to Claude Code: a terminal-based agent that reads files, writes code, and runs commands. The key difference is its emphasis on sandboxed execution . Rather than running directly on the host filesystem, Codex CLI executes commands in isolated containers. This is a safety trade-off: sandboxing limits the damage from a misbehaving agent (it can't accidentally delete your home directory), but it also means the agent has less direct access to the full development environment. It uses OpenAI models and supports multi-file editing and test execution within its sandbox.

Cursor Agent takes a different approach by embedding the agent inside the Cursor IDE . The advantage is context: the IDE knows which files are open, what the project structure looks like, where the terminal is, and what errors the linter is reporting. The agent can see all of this and use it to make better decisions. When Cursor Agent makes changes, they appear directly in the editor — you can see the diff in real time, accept or reject individual edits, and provide feedback without switching to a terminal. The trade-off is that the agent is tied to the IDE environment; it can't run in CI pipelines or headless automation scenarios the way a CLI agent can.

Devin (Cognition, 2024) was introduced as "the first AI software engineer" and represents the most autonomous end of the spectrum. Devin runs in a full virtual environment with a web browser, terminal, and code editor. It can plan and execute multi-step software tasks autonomously: reading issue descriptions, exploring documentation in a browser, writing and testing code, and submitting pull requests. Its environment gives it capabilities that terminal-only agents lack — for example, it can visually verify a UI change by opening a browser and checking the rendered page.

Despite their differences, every code agent follows the same fundamental pattern. All of them execute the edit-test loop: read code, understand the problem, make changes, run tests, fix errors, and repeat. The differences are in the model (Claude, GPT, etc.), the environment (terminal, IDE, virtual machine), the tool set (file editing, browser, search), and the level of autonomy (human-in-the-loop vs. fully autonomous). The loop itself is universal.

Agent          Environment     Execution      Autonomy
─────────────  ──────────────  ─────────────  ──────────────────
Claude Code    Terminal/CLI    Local (host)   Human-in-the-loop
Codex CLI      Terminal/CLI    Sandboxed      Human-in-the-loop
Cursor Agent   IDE (Cursor)    Local (IDE)    Human-in-the-loop
Devin          Full VM         Sandboxed      Fully autonomous

SWE-bench: Measuring Code Agent Performance

How do we know whether a code agent is actually good? Generating plausible-looking code is easy — generating code that actually works on real projects is a much harder test. SWE-bench (Jimenez et al., 2024) provides exactly this test. It's a benchmark of real GitHub issues from popular open-source Python repositories like Django, Flask, scikit-learn, matplotlib, and sympy. Each task gives the agent a repository at a specific commit and an issue description (written by a real developer). The agent must produce a patch — actual code changes — that resolves the issue. The patch is evaluated by running the repository's own test suite: if the tests that were failing now pass (and the tests that were passing still pass), the issue is considered resolved.

This evaluation approach is what makes SWE-bench uniquely rigorous. There is no ambiguity about whether the agent "got the right answer" — the test suite is the judge. The agent can't game the metric by producing plausible-looking diffs that don't actually work. Either the tests pass or they don't.

The benchmark comes in several variants:

SWE-bench (full): 2,294 tasks drawn from 12 Python repositories. Some are ambiguous or underspecified, making the full set noisy.
SWE-bench Lite: a curated subset of 300 tasks selected for clarity and self-containedness. Each task has a clear issue description and a test that unambiguously validates the fix.
SWE-bench Verified: a human-verified subset of 500 tasks where annotators confirmed that the issue description is clear, the test is correct, and the task is solvable from the information given. This is the most widely reported variant as of 2025.

As of early 2025, the state of the art on SWE-bench Verified sits in the 50-70% range, with the top performing agents resolving roughly half to two-thirds of verified issues. To put this in perspective: human software engineers who are given these same issues without prior context about the repository resolve them at roughly comparable rates. These are not trivial tasks — they require understanding real codebases, navigating dependencies, and writing correct patches.

What do the results reveal about where code agents are strong and where they struggle?

Strong: localised bug fixes (a null check, an off-by-one error, a missing import), straightforward feature additions where the pattern is clear from existing code, and tasks where the failing test provides a clear signal for what needs to change.
Weak: tasks requiring deep architectural understanding ("refactor the authentication system to support OAuth2"), multi-file refactors where the changes have cascading effects, and tasks where the issue description is vague or requires domain-specific knowledge that isn't in the codebase.

Beyond SWE-bench, several other benchmarks measure different aspects of code capability. HumanEval (Chen et al., 2021) tests standalone function generation (164 Python problems with unit tests). MBPP (Austin et al., 2021) (Mostly Basic Python Problems) covers 974 basic programming tasks. LiveCodeBench (Jain et al., 2024) uses competitive programming problems posted after model training cutoffs, preventing data contamination. These benchmarks test code generation (writing a function from a description), whereas SWE-bench tests code agency (navigating a real codebase and producing a working patch). The distinction matters: a model can score perfectly on HumanEval while failing badly on SWE-bench, because the two tasks require fundamentally different capabilities.

💡 The gap between code generation benchmarks and code agency benchmarks reveals something important. Writing correct code in isolation is a solved problem for frontier models (HumanEval scores above 95%). The hard part is everything around the code: understanding a large codebase, finding the right files, making changes that are consistent with existing patterns, and verifying correctness through tests. That's what SWE-bench measures, and why it remains challenging.

The Future of Code Agents

Code agents have improved dramatically in a short time — from resolving less than 5% of SWE-bench tasks in early 2024 to over 50% by early 2025. But significant limitations remain, and understanding them is essential for using code agents productively today.

The biggest limitation is context . Even with context windows exceeding one million tokens, an agent cannot hold an entire large codebase in memory at once. A medium-sized project might have 500,000 lines of code across thousands of files. The agent must work with a partial view, reading files as needed, and that partial view means it can miss important dependencies, conventions, or side effects that a human developer who has worked on the project for months would know about. This is why code agents are better at localised fixes than architectural changes: a local fix only requires understanding a few files, while an architectural change requires understanding the whole system.

A related limitation is testing coverage . The edit-test loop only works if the test suite actually catches the bugs the agent introduces. If a project has spotty test coverage, the agent can make changes that pass all existing tests but break untested functionality. The agent doesn't know what it doesn't know: it can't write tests for edge cases it hasn't considered, and it can't verify behaviour that isn't tested. In practice, code agents work best on well-tested codebases, and struggle on codebases where tests are sparse or unreliable.

There is also the problem of long-horizon tasks . Current code agents excel at tasks that take a human 10 minutes to an hour: a bug fix, a small feature, a refactor of a single module. Tasks that take days — implementing a complex feature from scratch, migrating a codebase from one framework to another, debugging a subtle concurrency issue — remain largely out of reach. The agent's context degrades over long sessions, it loses track of earlier decisions, and the probability of making a compounding error increases with each step.

Where is this heading? Several directions are clear. Agents will get better at maintaining long-running development sessions as context windows grow and memory mechanisms improve. They will learn to write tests for their own changes rather than relying solely on existing test suites, closing the test coverage gap. They will develop better repository understanding by building persistent mental models of project architecture (analogous to how a human developer builds familiarity with a codebase over time). And they will improve at coordinating with humans — asking clarifying questions, presenting options for architectural decisions, and flagging when a task is beyond their confidence level rather than silently producing a bad solution.

But perhaps the most important insight about code agents today is about workflow . The most productive setup is not "agent alone" but "human + agent". The human provides direction (what to build), makes ambiguous decisions (which trade-off to pick), reviews critical code (security, correctness at the boundaries), and handles tasks that require deep contextual knowledge (production debugging, system design). The agent handles the mechanical work: finding the right files, writing the boilerplate, running the tests, fixing the typos, managing version control. This division of labour lets the human focus on what they're uniquely good at — judgment, creativity, and high-level reasoning — while the agent handles what it's uniquely good at: tireless, systematic, and fast execution of well-specified tasks.

The comparison to LoRA fine-tuning is instructive here. Fine-tuning teaches a model how to do something (a new skill, a new output format). Code agency lets a model do things with the skills it already has. You don't need to fine-tune a model to fix bugs — you need to give it the right tools and the right feedback loop. That's what code agents provide: not new knowledge, but a way to apply existing knowledge to real-world software engineering tasks.

Quiz

Test your understanding of code agents, the edit-test loop, and how agent performance is measured.

What makes a code agent fundamentally more powerful than an agent with a fixed set of pre-defined tools?

Code agents use larger language models

Code agents can write arbitrary programs, effectively constructing any tool they need on the fly

Code agents are faster because they skip the reasoning step

Code agents don't need access to external systems

In the edit-test loop, what is the primary role of the 'test' step?

To generate training data for fine-tuning the agent

To measure how fast the agent writes code

To provide concrete feedback from real execution about whether the code changes are correct

To check that the agent hasn't exceeded its context window

How does SWE-bench evaluate whether a code agent has successfully resolved a GitHub issue?

Human reviewers grade the quality of the generated code

An LLM judges whether the patch looks correct

The repository's own test suite is run: the patch must make failing tests pass without breaking passing tests

The patch is compared character-by-character to the ground-truth fix

Why do current code agents perform better on localised bug fixes than on large architectural refactors?

Bug fixes require less compute than refactors

Localised fixes require understanding only a few files, while architectural changes require a system-wide mental model that exceeds the agent's effective context

Code agents are explicitly trained only on bug-fix examples

Architectural refactors cannot be evaluated with test suites