Why Agents? - Cruxr.ai

From Chatbot to Agent: What's the Difference?

A chatbot generates text. You type a question, the model produces an answer, and the interaction is over. That's all a chatbot does: it maps an input string to an output string, however impressively. An agent , by contrast, takes actions . It doesn't just tell you the answer — it goes and does things. It searches the web, runs code, reads files, calls APIs, and makes decisions based on the results. The shift from chatbot to agent is the shift from a model that talks about the world to a model that acts in it.

The key architectural distinction is that an agent has a loop . A chatbot interaction is a single pass: user asks, model responds, done. An agent interaction is iterative: the user asks a question, the model thinks about what to do, calls a tool, reads the result, thinks again, maybe calls another tool, reads that result, and continues until the task is complete. Consider the difference between asking "What's the weather in Tokyo?" to a chatbot (which can only guess based on its training data, and that data is months old) versus asking the same question to an agent (which calls a weather API, reads the JSON response, and gives you the actual current temperature). The chatbot generates text about the weather; the agent checks the weather.

This loop is what makes agents both powerful and dangerous. Because the agent can chain together arbitrary sequences of actions — search, then read, then compute, then write, then verify — it can accomplish tasks that no single model call could handle. But it also means the agent can take unexpected actions, compound errors across steps, or spiral into unproductive loops. A chatbot can hallucinate a wrong answer; an agent can hallucinate a wrong answer and then act on it . We'll address safety and evaluation in the final article of this track, but it's worth flagging the stakes early: agency amplifies both capability and risk.

The Agent Loop

Every agent, no matter how sophisticated, runs the same fundamental loop. Whether it's a coding assistant that edits files, a research agent that searches the web, or a customer-service bot that queries databases and issues refunds, the pattern is identical:

Observe: receive input. This could be the user's original query, the result of a tool call, an error message, or any new information from the environment.
Reason: decide what to do next. The LLM examines the current state — what has been asked, what has been tried, what has been learned — and chooses an action. This might be calling a specific tool with specific arguments, or it might be deciding that enough information has been gathered and it's time to respond to the user.
Act: execute the chosen action. Call a function, run a code snippet, send an API request, click a button in a browser, or write to a file.
Observe again: see the result of the action. Did the API return data? Did the code throw an error? Did the file write succeed?
Repeat: loop back to the reasoning step with the new observation, and continue until the task is complete or a stopping condition is met (such as a maximum number of iterations, to prevent infinite loops).

If this reminds you of the observe-decide-act loop from reinforcement learning and robotics, that's no coincidence. The structure is the same — an entity perceives its environment, makes a decision, takes an action, and observes the outcome — except that here the "brain" is an LLM rather than a policy network, and the "environment" is the digital world (APIs, files, databases, websites) rather than a physical one. The LLM's strength is that it can handle the reasoning step in natural language, making it flexible enough to work across wildly different tasks without task-specific training.

Crucially, the agent has access to a set of tools — functions it can call. Each tool has a name, a description, and a schema defining its parameters. The LLM decides which tool to call and with what arguments based on the current state of the conversation. The tools are what connect the model to the outside world. Without tools, the LLM is just a text generator; with tools, it becomes an agent. The pseudocode below shows the core loop:

def agent_loop(user_query, tools, llm, max_steps=10):
    """The fundamental agent loop."""
    messages = [{"role": "user", "content": user_query}]

    for step in range(max_steps):
        # REASON: LLM decides what to do next
        response = llm.generate(messages, tools=tools)

        # If the LLM wants to respond to the user (no tool call), we're done
        if response.type == "text":
            return response.content

        # ACT: execute the tool the LLM chose
        tool_name = response.tool_call.name
        tool_args = response.tool_call.arguments
        tool_result = tools[tool_name].execute(**tool_args)

        # OBSERVE: add the tool result to the conversation
        messages.append({"role": "assistant", "content": response})
        messages.append({"role": "tool", "content": tool_result})

        # Loop back to REASON with the new observation

    return "Max steps reached without a final answer."

Notice how simple this is. The entire intelligence lives in the LLM's ability to look at the conversation history (which now includes tool results) and decide what to do next. The loop itself is trivial plumbing. This simplicity is deceptive, though: the hard problems are all in the details. How does the LLM know which tool to call? How do you define tools so the LLM can use them reliably? What happens when a tool call fails? How do you prevent the agent from going off the rails? These are the questions the rest of this track will answer.

What Can Agents Do That Chatbots Can't?

The agent loop unlocks entire categories of capability that are simply impossible for a pure text-generation model. Understanding these categories helps clarify why agents are worth the added complexity.

Access real-time information. A chatbot's knowledge is frozen at training time (as we discussed in the RAG track ). An agent can search the web, query a database, or call an API to get information that is seconds old. "What's the current price of NVIDIA stock?" is a question a chatbot can only answer with stale data, but an agent can call a stock API and return the live price.

Take actions in the world. This is the most consequential difference. A chatbot can describe how to send an email; an agent can send the email. A chatbot can explain how to fix a bug; an agent can open the file, edit the code, run the tests, and commit the fix. The gap between "here's what you could do" and "it's done" is enormous in practice, and agents close it.

Use specialised tools. LLMs are bad at arithmetic, struggle with precise date calculations, and can't natively generate images. But an agent with access to a calculator tool, a calendar API, and an image generation model can handle all of these correctly. The LLM doesn't need to be good at everything — it just needs to be good at deciding which tool to use and how to use it . This is a form of cognitive division of labour: the LLM handles reasoning and language, while specialised tools handle domains where precision matters.

Multi-step reasoning with error recovery. Real tasks rarely complete in one step. "Book me a flight from NYC to London under $500" requires searching for flights, comparing prices, filtering by budget, selecting the best option, and completing a booking — with potential failures at each step (no flights available, price changed, payment declined). An agent can break this into subtasks, handle each one, and recover from errors by trying alternative approaches. A chatbot can only give you a single response and hope it covers everything.

Maintain state across actions. Because the agent loop accumulates tool results in the conversation history, the agent remembers what it has done, what it has learned, and what remains. It can avoid repeating work, build on previous results, and track progress toward a goal. "Find all the bugs in this codebase and fix them" requires reading files, identifying issues, editing code, running tests, checking results, and iterating — a stateful process that can span dozens of steps.

💡 A useful mental model: a chatbot is a function (input -> output), while an agent is a program (input -> loop of decisions and actions -> output). Functions are predictable and fast; programs are flexible and powerful.

The Three Eras of LLM Capability

To understand why agents are emerging now and not five years ago, it helps to look at how LLM capabilities have evolved in three distinct eras. Each era built on the previous one, and the agent era specifically required capabilities that didn't exist before 2023.

Era 1: Text generation (GPT-3, 2020). The first large language models could generate remarkably coherent text, complete prompts, and even perform few-shot learning. But they had no mechanism for interacting with the outside world. They could write a convincing paragraph about the weather, but they couldn't check the weather. They could produce plausible-looking code, but they couldn't run it. The model's entire world was the text it had been trained on, and all it could do was produce more text in the same distribution.

Era 2: Instruction following (ChatGPT, 2022). With RLHF and instruction tuning, models learned to follow user instructions, maintain coherent multi-turn conversations, refuse harmful requests, and format outputs in specific ways. This was a massive usability improvement — it turned raw text generators into useful assistants — but the fundamental limitation remained: the model could only produce text. It could explain how to do things, but it couldn't do them.

Era 3: Tool use and agency (2023–present). This is the era we're in now. Models can call functions, use tools, and take multi-step actions in the real world. A user says "find the cheapest flight to London next Tuesday" and the model doesn't just describe what you'd need to do — it calls a flight search API, parses the results, compares prices, and tells you the answer (or books it, if given permission). This isn't a minor upgrade; it's a qualitative shift in what LLMs can be used for.

What enabled Era 3? Several developments had to converge:

Function calling training: models had to learn to produce structured tool calls (not just free-form text) during post-training. This required new training data formats, new loss functions, and careful fine-tuning to teach models when and how to invoke tools.
Better reasoning: agents need to plan multi-step actions, which requires chain-of-thought reasoning and extended thinking capabilities. Early models couldn't reliably decide which tool to call or what arguments to pass; newer models can reason through complex sequences of actions before executing them.
Infrastructure: agents need sandboxed execution environments, permission systems, and standardised protocols for connecting to tools. The Model Context Protocol (MCP), which we'll cover in article 4, is one such protocol that lets any model connect to any tool through a standardised interface.
Longer context windows: agents accumulate information over the course of a task — the original query, each tool call and result, reasoning traces, error messages. A 10-step agent interaction might consume tens of thousands of tokens. Early models with 2K–4K context windows couldn't sustain multi-step agent loops; modern models with 128K–1M token windows can.

Two foundational papers made the case for tool-using LLMs. ReAct (Yao et al., 2023) showed that interleaving reasoning traces ("I need to search for X because...") with actions ("Search[X]") dramatically improved performance on multi-step tasks compared to reasoning or acting alone. The key insight was that reasoning helps the model plan and interpret results, while acting grounds the reasoning in real information. Toolformer (Schick et al., 2023) demonstrated that language models can learn to use tools (calculators, search engines, translators) in a self-supervised way: the model decides when and where to insert tool calls into its own text, trained on examples where tool use improved prediction accuracy. Together, these papers established that LLMs are not just text generators — they can be tool-using reasoners.

The Rest of This Track

This track covers the full landscape of LLM agents, from the low-level mechanism that makes tool use possible to the high-level architectures that coordinate multiple agents on complex tasks. Here's what's ahead:

Article 2: Function calling. The mechanism that lets LLMs use tools. How models learn to output structured tool calls, how you define tool schemas, and how the API plumbing works end to end.
Article 3: ReAct and reasoning loops. How agents decide what to do. The ReAct framework, chain-of-thought planning, and the patterns that make agent reasoning reliable (or unreliable).
Article 4: MCP (Model Context Protocol). The protocol that connects agents to tools at scale. How MCP standardises the interface between models and tools, enabling a plug-and-play ecosystem.
Article 5: Computer use. Agents that interact with graphical user interfaces — clicking buttons, filling forms, navigating websites. How vision-language models enable agents to use software the same way humans do.
Article 6: Code agents. Claude Code, Codex CLI, Devin, and the breed of agents that write, debug, and deploy software. How they work, what they can do, and where they fail.
Article 7: Multi-agent systems. When one agent isn't enough. Architectures for coordinating multiple specialised agents, including supervisor patterns, debate, and swarm approaches.
Article 8: Safety, evaluation, and the frontier. How to evaluate agents, prevent them from going off the rails, and what the cutting edge looks like.

Each article builds on the previous ones, but they're also designed to be readable independently if you're interested in a specific topic. If you want to understand how function calling works under the hood, you can jump straight to article 2. If you're building a multi-agent system and want to understand coordination patterns, article 7 will be most relevant.

Quiz

Test your understanding of what makes agents different from chatbots.

What is the key architectural difference between a chatbot and an agent?

Agents use larger models than chatbots

Agents have a loop: they observe, reason, act, and repeat, whereas chatbots produce a single response

Agents are trained with reinforcement learning, chatbots are not

Agents can maintain multi-turn conversations, chatbots cannot

In the agent loop pseudocode, what determines whether the agent should call a tool or return a final answer to the user?

A separate classifier model decides when to stop

The agent always calls exactly one tool before responding

The LLM itself decides: if it generates a tool call, the loop continues; if it generates text, the loop ends

The loop always runs for the maximum number of steps

Which development was NOT necessary for the emergence of tool-using agents (Era 3)?

Function calling training during post-training

Longer context windows to sustain multi-step interactions

Pre-training on 10x more data than Era 2 models

Infrastructure like sandboxing, permissions, and standardised tool protocols

What did the ReAct paper (Yao et al., 2023) demonstrate about combining reasoning and acting?

Acting alone is always better than reasoning alone

Reasoning traces and actions should be kept in separate models

Interleaving reasoning traces with actions improved multi-step task performance compared to either alone

Agents should always reason for at least 5 steps before taking any action