LLMs vs AI Agents: A Practical Mental Model for Developers

It is becoming increasingly clear that coding with AI is a real game changer.

At this point, it is becoming increasingly difficult for sceptics to argue that this is just another short-lived trend, similar to what happened with crypto/blockchain/NFT hype cycles. I do not believe that is the case here.

Because of that shift, even many engineers who were previously sceptical have recently started exploring AI agents. In my opinion, the turning point happened around December last year, when Anthropic and OpenAI released extremely capable LLMs for coding. We're now seeing a tipping point where AI agents are slowly becoming a mainstream reality. The companies who adapted AI-first workflows could reap the benefits of this shift. However, one question remains: will LLM inference costs be sustainable in the long run?

As more and more developers jump onto the AI train, we are seeing a new type of confusion emerge. People hear terms like LLM, AI agent, or coding agent, but often do not clearly understand what these things actually are.

In this post, I want to explain, in simple terms:

what an LLM actually is
what an AI agent is
why two agents using the same LLM can behave very differently

This article is for engineers who want a clearer mental model of LLMs, AI agents, and modern AI coding tools. It is not a deep research paper, but a practical framework for understanding how these systems work and why tools built on similar models can behave very differently in practice.

If you already work with tools like Claude Code, Codex, Cursor, or OpenCode, this post should help you build a clearer mental model of what is happening under the hood and why these tools often feel so different in practice.

What is an LLM?

Before we talk about AI agents, we need to understand the component that powers them: the Large Language Model, usually abbreviated as LLM.

A useful way to think about an LLM is this:

An LLM is a neural network trained to predict the next token in a sequence of text or code.

At first glance this may sound trivial. However, when a model is trained on enormous datasets containing books, websites, documentation, and source code, the ability to predict the next token becomes surprisingly powerful.

For example, if the model sees a prompt like:

"The capital of France is"

the most statistically likely continuation is Paris.

In programming contexts the behaviour is similar. If the model sees something like:

def factorial(n):

it has likely encountered thousands of implementations of factorial functions during training. As a result, it can generate a correct implementation even if it has never seen this exact prompt before.

This ability emerges because the model has learned patterns that exist in human language and software. It recognises how ideas are usually expressed, how functions are structured, and what pieces of code typically follow one another.

Modern LLMs are usually built using a neural network architecture called the Transformer, which was introduced in 2017. Transformers rely on a mechanism known as self-attention, allowing the model to understand relationships between tokens across long sequences of text. Instead of processing text word by word like older recurrent neural networks, a Transformer can analyse many tokens at once and determine which parts of the input are most relevant to each other. This makes it possible to handle long documents, large code files, and complex reasoning tasks more effectively.

A base model is simply good at continuing text, not necessarily at solving problems or helping users. If you asked a raw model a question, it might respond with something unrelated or produce a confusing continuation. The model understands language patterns, but it has not yet been trained to behave like a helpful collaborator.

To transform a base language model into something useful for developers, companies apply several additional training stages. These stages guide the model toward producing answers that are helpful, structured, and aligned with human expectations.

How an LLM is trained to become a good assistant

Training a modern LLM is a multi-stage process. Each stage builds on the previous one, gradually transforming a raw statistical model into a capable assistant that can answer questions, generate code, and reason about complex tasks.

In practice, this pipeline usually consists of three major stages:

Pre-training – the model learns the structure of language and code from massive datasets.
Instruction tuning – the model is trained to follow instructions and produce useful answers.
Alignment and safety training – the model is further optimised to produce responses that humans prefer.

The diagram below shows a simplified mental model of this training pipeline.

graph LR
    %% Class Definitions for Visual Hierarchy
    classDef data fill:#e1f5fe,stroke:#01579b,stroke-width:1px;
    classDef process fill:#fff3e0,stroke:#e65100,stroke-width:2px,stroke-dasharray: 5 5;
    classDef model fill:#f3e5f5,stroke:#4a148c,stroke-width:2.5px,font-weight:bold;
    classDef feedback fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px;

    %% STAGE 1: PRE-TRAINING
    subgraph ST1 [1. Pre-training]
        D1[(Massive Web<br/>& Code Corpus)]:::data --> P1[[Self-Supervised<br/>Learning]]:::process
        P1 --> M1(Base Model):::model
    end

    %% STAGE 2: SFT
    subgraph ST2 [2. Instruction Tuning]
        M1 --> P2[[Supervised<br/>Fine-Tuning]]:::process
        D2[(High-Quality<br/>Q&A Pairs)]:::data --> P2
        P2 --> M2(Assistant Model):::model
    end

    %% STAGE 3: ALIGNMENT
    subgraph ST3 [3. Alignment & Safety]
        M2 --> P3{Alignment Strategy}
        
        %% The "How" - Feedback Sources
        F1[Human Feedback<br/>'RLHF']:::feedback -.-> P3
        F2[AI Constitution<br/>'Constitutional AI']:::feedback -.-> P3
        
        %% The "Engine" - Optimization
        P3 --> DPO[[Direct Preference<br/>Optimization]]:::process
        
        DPO --> M3(Final Aligned<br/>Production Model):::model
    end

    %% Legend-style box styling
    style ST1 fill:#fafafa,stroke:#ccc
    style ST2 fill:#fafafa,stroke:#ccc
    style ST3 fill:#f5f5f5,stroke:#999

Pre-training

The first stage is called pre-training, and it is by far the most computationally expensive.

During this stage the model is trained on extremely large datasets collected from many sources. These typically include web pages, source code repositories, books, research papers, and technical documentation. In practice, these datasets are combined into a single massive training corpus before training begins.

graph TD
    %% Class Definitions
    classDef data fill:#e1f5fe,stroke:#01579b,stroke-width:1px;
    classDef process fill:#fff3e0,stroke:#e65100,stroke-width:2px,stroke-dasharray: 5 5;
    classDef model fill:#f3e5f5,stroke:#4a148c,stroke-width:2.5px,font-weight:bold;

    D1[(Massive Web Corpus)]:::data --> D_ALL[Combined Dataset]
    D2[(GitHub/Code)]:::data --> D_ALL
    D3[(Books & Papers)]:::data --> D_ALL

    D_ALL --> SSL[[Self-Supervised Learning]]:::process
    
    subgraph Mechanics [Internal Process]
        SSL --> NTP[Next Token Prediction]
        NTP --> ATN[Attention Mechanism Tuning]
    end

    Mechanics --> M1(Base Model):::model

    style Mechanics fill:#fafafa,stroke:#ccc,stroke-dasharray: 5 5

The training objective is very simple:

Given a sequence of tokens, predict the next token.

A token can be a word, part of a word, punctuation, or a piece of code. For example, the phrase:

Machine learning models

might be represented internally as several tokens rather than individual words.

By repeating this prediction task billions or trillions of times, the model gradually learns statistical relationships between tokens. It begins to recognise grammatical structures, common reasoning patterns, and programming idioms. Over time the model develops an internal representation of language that allows it to generate coherent text and code.

However, a pretrained model is still not very useful for interactive tasks. It can generate text that looks realistic, but it does not yet know how to respond appropriately to user instructions.

Instruction Tuning

To make the model more useful, developers perform a second stage called supervised fine-tuning (SFT).

The training data typically includes carefully curated question–answer pairs, multi-turn conversations, and structured reasoning tasks. These examples demonstrate how a helpful assistant should respond to different types of user requests.

graph TD
    classDef data fill:#e1f5fe,stroke:#01579b,stroke-width:1px;
    classDef process fill:#fff3e0,stroke:#e65100,stroke-width:2px;
    classDef model fill:#f3e5f5,stroke:#4a148c,stroke-width:2.5px;

    M1(Base Model):::model --> SFT_PROC[[Supervised Fine-Tuning]]:::process
    
    subgraph Data_Inputs [Instruction Data]
        D_QA[(Golden Q&A Pairs)]:::data
        D_INST[(Multi-turn Dialogues)]:::data
        D_TASK[(Reasoning Tasks)]:::data
    end

    Data_Inputs --> SFT_PROC
    SFT_PROC --> M2(Assistant Model):::model

    style Data_Inputs fill:#fafafa,stroke:#ccc

During this phase the model is trained on curated examples of prompts and high-quality responses created by humans. These examples teach the model how a helpful assistant should behave.

For example, the dataset might contain examples such as:

Prompt:

Explain what a REST API is.

Response:

A REST API is an interface that allows systems to communicate
over HTTP using standard methods such as GET, POST, PUT, and 
DELETE.

In coding contexts, the examples might include bug fixes, explanations of algorithms, refactoring suggestions, or implementations of common programming tasks.

This stage teaches the model several important behaviours:

how to follow instructions
how to produce structured answers
how to provide explanations
how to solve typical developer problems

After supervised fine-tuning, the model starts behaving more like a helpful assistant rather than a random text generator.

Alignment & Safety

Even after supervised fine-tuning, models can still produce responses that are technically correct but not necessarily helpful, safe, or aligned with human expectations.

For example, a model might generate an answer that is factually plausible but misleading, overly verbose, or structured in a way that is difficult for users to understand. In other cases, the model may confidently produce incorrect information. These issues arise because the model has learned statistical patterns of language, not necessarily what humans consider a good answer.

To address this problem, modern LLMs undergo an additional training phase known as alignment. The goal of alignment is to shape the model’s behaviour so that it produces responses that better match human preferences — such as being helpful, clear, honest, and safe.

Instead of training the model purely on raw text, alignment training introduces preference signals that indicate which answers are better than others.

These signals can come from several sources. The most widely known approach is Reinforcement Learning from Human Feedback (RLHF), where human reviewers evaluate and rank different model responses. Another approach is AI-based feedback, such as Anthropic’s Constitutional AI, where models critique and improve their own responses according to predefined principles.

The high-level structure of this process is shown below.

graph TD
    classDef process fill:#fff3e0,stroke:#e65100,stroke-width:2px;
    classDef model fill:#f3e5f5,stroke:#4a148c,stroke-width:2.5px;
    classDef feedback fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px;

    M2(Assistant Model):::model --> OPT[[Optimization Engine]]:::process

    subgraph Feedback_Loops [Preference Evaluation]
        HF[Human Feedback / RLHF]:::feedback
        CAI[Constitutional AI / RLAIF]:::feedback
        HF & CAI --> RM[Reward Model]
    end

    RM --> OPT
    
    subgraph Strategy [Alignment Methods]
        OPT --> DPO[Direct Preference Optimization]
        OPT --> PPO[Proximal Policy Optimization]
    end

    Strategy --> M3(Final Production Model):::model

    style Feedback_Loops fill:#f1f8e9,stroke:#2e7d32
    style Strategy fill:#fafafa,stroke:#999

In simplified terms, the alignment process typically works like this:

The model generates multiple possible answers to the same prompt.
Human reviewers rank those answers from best to worst based on quality, usefulness, and safety.
A separate model, often called a reward model, is trained to predict those human preferences.
The language model is then further optimised so that it produces responses that score highly according to that reward model.

This optimisation step can be performed using techniques such as Proximal Policy Optimisation (PPO) or newer approaches like Direct Preference Optimisation (DPO).

Over time, this process nudges the model toward producing responses that humans consider better. It improves qualities such as helpfulness, clarity, reasoning structure, and safety.

Some organisations also experiment with alternative approaches. For example, Anthropic's Constitutional AI introduces a set of guiding principles that the model uses to critique and revise its own responses. Instead of relying entirely on human reviewers, the model learns to evaluate its answers against a written “constitution” of rules and improve them automatically.

Popular LLM families used today

After these stages — pre-training, instruction tuning, and alignment — the result is the type of model we interact with today: an aligned assistant capable of answering questions, generating code, and reasoning through complex problems.

Today, several companies develop large language models, each with its own model family and design philosophy. While the underlying training techniques are broadly similar, the models can differ in reasoning ability, context length, cost, and optimisation for specific tasks such as coding.

Some of the most widely used LLM families include:

OpenAI GPT models

OpenAI’s GPT series is one of the most widely adopted model families. These models power tools such as ChatGPT and many developer APIs. Some variants are specifically optimised for reasoning and software engineering tasks, such as the Codex models used in OpenAI’s coding tools.

Modern GPT models are particularly strong at multi-step reasoning, code generation, and tool usage within agent systems.

Anthropic Claude models

Anthropic develops the Claude model family, including variants such as Haiku (fast and inexpensive), Sonnet (balanced performance and cost), and Opus (the most capable but also the most expensive).

Claude models are widely used in developer tools and coding agents because of their strong instruction-following behaviour and large context windows, which allow them to process large codebases in a single prompt.

Google Gemini models

Google’s Gemini models are designed as multimodal systems capable of working with text, images, code, and other types of input. They are integrated into Google’s ecosystem and are often optimised for large-scale reasoning tasks and enterprise workloads.

Gemini models are also popular because they support multimodal inputs (text, images, and code) and offer competitive pricing through Google’s developer APIs.

Open-weight models

In addition to proprietary systems, there are also open or open-weight models such as Llama, Mistral, and others. These models can be run locally or hosted by third-party providers, making them attractive for teams that require more control over infrastructure or data privacy.

Despite these differences, it is useful to remember that all of these systems share the same core idea: they are large neural networks trained to predict tokens and refined through several training stages to behave like helpful assistants.

However, even the most advanced LLM is still fundamentally a reasoning engine that generates text. It does not have the ability to read files, run commands, or interact with external systems on its own.

To build systems that can perform real tasks in a development environment, we need to combine LLMs with tools and orchestration logic.

This is where AI agents enter the picture.

What is an AI agent?

At its core, an AI agent is not just a large language model chatting with you.

A useful mental model looks like this:

graph TD
    classDef core fill:#f3e5f5,stroke:#4a148c,stroke-width:2px,font-weight:bold;
    classDef component fill:#e1f5fe,stroke:#01579b,stroke-width:1px;
    classDef process fill:#fff3e0,stroke:#e65100,stroke-width:2px;

    USER[User Request] --> LLM[LLM<br/>Reasoning Engine]:::core

    LLM --> LOOP[Reasoning Loop<br/>Plan → Act → Observe]:::process

    LOOP --> TOOLS[Tools<br/>File system<br/>Terminal<br/>APIs<br/>Search]:::component
    LOOP --> ENV[Environment<br/>Codebase<br/>Internet<br/>Runtime]:::component

    TOOLS --> LOOP
    ENV --> LOOP

    LOOP --> RESULT[Final Response / Action]

Each piece plays a different role.

The LLM: the reasoning engine

The LLM is the reasoning engine that understands instructions, interprets context, and generates plans. Large language models are very good at analysing problems, summarising information, and proposing possible solutions.

However, on their own they can only produce text. A raw model can explain how to fix a bug or describe how to call an API, but it cannot actually perform those actions on its own. This is an important distinction, because many people casually use the words LLM and agent interchangeably, even though they are not the same thing.

Tools: giving the model hands

That is where tools come in. Tools are the capabilities that allow an agent to interact with the outside world.

These might include reading and editing files, running terminal commands, querying a database, calling APIs, controlling a browser, or executing a test suite. When an agent decides to use a tool, it generates a structured request that the surrounding system executes, and the result is then returned to the model.

In effect, tools give the agent “hands” to accompany its reasoning. Without tools, the model can only talk about actions. With tools, it can start performing them.

The environment: the world the agent works in

The environment is the world in which the agent operates. In software engineering this might be a code repository, a local development environment, a CI pipeline, or a running application.

The environment provides state and feedback. When the agent modifies a file, runs a command, or calls an API, the environment changes. Those changes then become new information that the agent must reason about.

This is what makes agent behaviour dynamic. The agent is not working in a vacuum. It is operating inside a system that reacts to its actions.

The reasoning loop: what makes it agentic

flowchart LR
    classDef step fill:#fff3e0,stroke:#e65100,stroke-width:2px;
    classDef action fill:#e1f5fe,stroke:#01579b,stroke-width:1px;
    classDef result fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px;

    R[Reason<br/>Analyse problem<br/>Plan next step]:::step
    A[Act<br/>Use tool<br/>Call function<br/>Run command]:::action
    O[Observe<br/>Read output<br/>Update context]:::result

    R --> A --> O --> R

Many modern agent systems follow a pattern known as the ReAct loop (Reason + Act).

Instead of producing a single answer, the system repeatedly goes through three steps:

Reason – analyse the current state and decide what to do next
Act – execute an action using a tool (for example reading a file or running a command)
Observe – analyse the result and update the plan

The loop continues until the task is completed or the agent determines that it needs human input.

This iterative reasoning process is what allows agents to handle complex tasks that require multiple steps and feedback from the environment.

Realistic coding-agent loop (tool execution)

This shows what actually happens in AI agents like Cursor, Codex, Claude Code.

flowchart TD
    classDef model fill:#f3e5f5,stroke:#4a148c,stroke-width:2px;
    classDef tool fill:#e1f5fe,stroke:#01579b,stroke-width:1px;
    classDef system fill:#fff3e0,stroke:#e65100,stroke-width:2px;

    USER[User Task] --> LLM[LLM Reasoning Engine]:::model

    LLM --> PLAN[Plan next action]:::system

    PLAN --> TOOLCALL[Generate Tool Call<br/>JSON Function Request]:::system

    TOOLCALL --> TOOL[Execute Tool<br/>read_file / run_tests / search_repo]:::tool

    TOOL --> RESULT[Tool Result / Output]

    RESULT --> LLM

    LLM --> DONE{Task complete?}

    DONE -->|No| PLAN
    DONE -->|Yes| FINAL[Return final solution]

A simple example: fixing a bug

A coding agent fixing a bug might start by searching the repository for the failing function. It could then read the relevant files to understand the implementation, generate a patch that addresses the issue, and run the test suite to verify the change.

If the tests still fail, the agent analyses the new output, adjusts the patch, and tries again. The process continues until the problem is resolved or the agent determines that it needs human input.

Instead of answering once, the agent continuously reasons and acts until the task is complete.

This simple combination of reasoning, tools, environment, and iteration is what makes modern AI agents feel fundamentally different from traditional chat-based interactions.

How Function Calling Powers AI Agents

One of the most important ideas to understand in modern AI agents is function calling.

This is the point where many developers start feeling that the system is doing something magical. You ask the agent to investigate a codebase, spawn a subagent, search the web, or run a command, and somehow it appears to “decide on its own” to call the right tool.

But under the hood, the mechanism is much less mysterious.

The model does not literally execute a tool by itself. It does not reach into your machine, invoke a shell, or directly contact an external service. What it actually does is generate a structured request saying, in effect:

Given the tools you told me are available, this is the one that should be run next, and these are the arguments it should receive.

That request is then executed by the surrounding application, and the result is returned back to the model as fresh context. This is the same general pattern described in Claude’s tool-use documentation: the developer supplies tool definitions, Claude decides whether a tool is useful, emits a properly formatted tool-use request, and the client executes it before returning the result.

The key demystification

A useful way to say this very plainly is:

The LLM does not “call a function” in the traditional software sense. It generates tokens that represent a request for the host application to call a function.

From the model’s point of view, tool invocation is still just token generation. The same neural network that writes prose or code is also capable of producing a structured object that matches a tool schema. Claude’s tooling docs make this explicit: tools are defined with names, descriptions, and input schemas, and the model decides when and how to use them.

Why the model knows how to do this

This also explains why function calling should not be treated as some mysterious emergent superpower. The model can do this because it was trained to do it.

Earlier in this article, we already covered pre-training, instruction tuning, and alignment. Function calling fits naturally into that story. During post-training, model creators teach the assistant that when certain kinds of tasks appear, and when tools are available, it should emit a structured request in the required format instead of answering only with plain text. In other words, the model learns that generating a JSON-like tool request is often the preferred continuation. Claude even support strict schema conformance for tool inputs, which shows how strongly modern systems optimise for predictable, machine-readable tool calls.

So when a model “reads” a function description and appears to understand when to use it, what is really happening is:

The model receives tool definitions in the prompt or system context
The model recognises that one of those tools could help solve the task
The model generates a structured request matching that tool’s schema
The host application executes the request
The result is sent back to the model
The loop continues until the task is done

A mental model: function calling as structured next-token prediction

It means that:

Function calling is still next-token prediction, except the next tokens happen to form a structured tool request rather than ordinary prose.

That is why it feels both impressive and surprisingly mundane at the same time. The impressive part is that the model has learned when a tool is appropriate. The mundane part is that, mechanically, it is still just producing the next most likely tokens.

What the flow looks like in practice

In other words, an AI agent is an application controlled by an LLM that can invoke tools, execute actions, and iteratively respond to results.

flowchart TD
    U[User request] --> M[LLM]
    M --> T[Structured tool request]
    T --> H[Host application executes tool]
    H --> R[Tool result]
    R --> M
    M --> F[Final answer or next tool request]

And here is the same idea in a more explicit agent loop:

flowchart LR
    A[Reason] --> B[Choose tool]
    B --> C[Generate JSON request]
    C --> D[Client executes tool]
    D --> E[Return result to model]
    E --> A

This is the basic engine behind coding agents, browsing agents, IDE assistants, and many “AI agent” products. The surrounding system may be sophisticated, but the core loop is still: reason, emit request, execute externally, observe result, repeat.

A concrete example of tool calling

Suppose the application exposes a tool like this:

{
  "name": "search_codebase",
  "description": "Search the repository for files or symbols relevant to the user's task",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": { "type": "string" },
      "path": { "type": "string" }
    },
    "required": ["query"]
  }
}

If the user says:

Find where authentication tokens are validated.

the model might respond not with prose, but with a tool request like this:

{
  "type": "tool_use",
  "name": "search_codebase",
  "input": {
    "query": "authentication token validation",
    "path": "."
  }
}

The important thing to notice is that the model has not searched the codebase itself. It has only generated the request. The client or agent runtime performs the actual search, then sends the result back. Claude’s tool-use docs describe exactly this separation between model decision and client execution.

Subagents in Claude Code: function calling at a higher level

Claude Code provides a particularly good real-world example because it exposes the same pattern at a higher level of abstraction.

In Claude Code, a subagent is effectively a specialised worker with its own prompt, model choice, tool permissions, and optional MCP servers. Anthropic’s documentation says subagents are useful for preserving context, enforcing constraints, specialising behaviour, and controlling costs. Claude Code includes built-in subagents such as Explore, Plan, and General-purpose. The Explore subagent is read-only and optimised for codebase discovery; Plan is used during plan mode to gather context; and subagents cannot recursively spawn other subagents.

That makes subagents a perfect example for explaining function calling, because from the main agent’s perspective, delegating to a subagent is just another tool-like action.

The parent agent is not “becoming” multiple minds. Instead, it emits a structured request that says, in effect:

Start another agent instance with this role, this model, these tool restrictions, and this prompt. Then give me the result.

Example: a subagent definition in Claude Code

Claude Code subagents are defined in Markdown files with YAML frontmatter. Anthropic documents fields such as description, tools, model, and mcpServers, plus allowlists for which subagents another agent may spawn.

A simplified example might look like this:

---
name: api-researcher
description: Investigate API usage patterns and return a concise summary
tools: Read, Grep, Glob, Bash
disallowedTools: Write, Edit
model: haiku
---
You are a read-only research agent.
Search the codebase for API client usage, authentication flow, and retry logic.
Return only the findings that matter for implementation.

The crucial point is that this definition becomes part of the environment the main agent can work with. Claude then uses the subagent’s description to decide when delegation is appropriate.

What a subagent tool call conceptually looks like

Claude Code does not show users raw API JSON in the interface, but conceptually the parent model may generate a tool-use request like this:

{
  "type": "tool_use",
  "name": "Agent",
  "input": {
    "agent_type": "api-researcher",
    "prompt": "Inspect the repository and explain how external API retries are implemented.",
    "model": "haiku"
  }
}

It is important to be precise about what happens next. The model does not execute this JSON itself. It only generates a structured request, token by token, in the same way it generates ordinary text. The surrounding runtime detects that this output is a valid tool invocation, parses it, and executes the actual action on the model’s behalf.

In this case, the runtime starts a subagent configured with the requested role, prompt, and model. That subagent then runs its own reasoning loop, can use its own allowed tools, and eventually returns a result to the parent agent. The parent does not directly “become” the subagent. Instead, it delegates work through a structured tool request and then continues reasoning once the result comes back.

This is why subagents are such a useful mental model for understanding function calling. They show that a “function” does not have to be a simple utility like get_weather() or search_web(). A tool can represent a much higher-level capability, including launching another specialised agent instance.

How to talk about and assess an AI agent

One of the easiest ways to have confused conversations about AI coding tools is to treat the agent and the model as if they were the same thing.

They are not.

An AI agent depends on an LLM to function, because the LLM is the part that interprets context, decides what to do next, and generates the tool-use requests that drive the agent loop. But the agent is still a larger system than the model alone. It also includes the surrounding runtime, tool access, permissions, prompts, memory, and execution logic. That means when we evaluate an agent, we are usually evaluating both the orchestration layer and the model being used underneath it.

Because of that, it is much more useful to say:

“This agent works well for us with this model configuration for these tasks”

than to say only:

“This agent is good” or “this agent is bad”.

That extra detail matters because the same agent can behave very differently depending on which model is driving it.

Always mention the model when discussing an agent

If we want discussions about agents to be meaningful, we should get into the habit of reporting which LLM we used when making a judgement.

For example, if someone says that Claude Code works very well for their workflow, the useful follow-up is not just which tool? but also which model for which kind of task?

A much better evaluation sounds like this:

we use Opus for planning and harder architectural reasoning
we use Sonnet for most implementation work
we use Haiku for lightweight exploration, searching for references, or scanning a repository

That kind of description is far more informative than a generic opinion about the tool, because it separates the workflow design from the branding.

The same principle applies to Codex-style tools

The same issue appears in tools built around OpenAI models. OpenAI’s current documentation for GPT-5-family reasoning models and Codex models shows configurable reasoning effort levels such as low, medium, high, and in some cases xhigh, specifically for coding and agentic tasks. OpenAI’s own Codex prompting guide recommends medium as a good default interactive coding setting, while high or xhigh are positioned for harder tasks and longer autonomous work.

So if someone says that Codex works brilliantly, it is much more useful if they specify something like:

we use GPT-5.4 high for planning
we use GPT-5.4 medium for implementation
we use GPT-5.4 low for lightweight iterations

That makes the claim testable and much easier to compare with someone else’s setup.

Why agent evaluation is so difficult

This is also why evaluating products such as Cursor, Copilot, Claude Code, Codex, or OpenCode is much harder than it first appears.

In many of these systems, the “agent” is really a flexible shell around model choice and runtime behaviour. If users can switch between multiple underlying models, or even bring their own provider, then two people claiming to use “the same agent” may in reality be using very different systems. Claude Code, for example, supports multiple model selections and subagent-specific model choices. OpenAI’s coding stack also exposes configurable reasoning levels that materially change behaviour.

That means broad statements such as:

“Cursor is better than Copilot”
“Codex is better than Claude Code”
“this agent is amazing”
“that agent is unreliable”

are often missing the most important variable: which model, with which reasoning or autonomy settings, for which task? Without that information, many agent comparisons are only partially meaningful.

TL;DR: The Developer’s Mental Model

1. The LLM is a Reasoning Engine: It is a statistical model trained to predict the next token. It doesn't "know" things; it predicts how a helpful assistant would respond based on its training.

2. Tools are the "Hands": An LLM can only output text. An Agent becomes "agentic" when the host application gives it tools (filesystem access, terminal, APIs) and a way to describe how to use them via function calling.

3. The Loop is the Secret Sauce: Agents don't just answer once. They follow a Reason → Act → Observe loop. They make a plan, try a command, see it fail, and then try a different approach.

4. Function Calling is just JSON: When an agent "runs a command," it is actually just generating a structured JSON string. The runtime (your IDE or CLI) is what actually does the heavy lifting of executing that command.