The Current State of Agentic Software Development

Recently I migrated my blog awesome-testing.com to a more modern stack, moving from static Jekyll pages to Next.js with Convex as the backend. I rebuilt everything from scratch, made the site look more modern (at least in my opinion, hopefully yours too), and let Codex handle the vast majority of implementation work. I did the same for my wife’s website gosiaradzyminska.pl. Both projects were completed in a matter of hours. That still sounds absurd to me, but it is true.

At work I primarily use Cursor and Codex. At home, I experiment more freely. After more than three years of hands-on use across professional and personal contexts, I feel ready to describe what this moment actually looks like.

"Something" has undeniably shifted. But the shift is more complex than the hype cycle suggests.

The Breakthrough We Can No Longer Deny

The inflection point happened in December 2025, when models such as Claude Opus 4.5 became affordable enough to use seriously and GPT-5.2 Codex matured into a credible coding agent. The cost barrier (Opus 4.0 was extremely expensive) dropped while capability rose, and suddenly the entry barrier for serious experimentation was lowered. Arguably, the rather complex workflow described in AI Vibe Coding Notes from the Basement (gitingestion; planning with the most capable model via the web to cut costs; phased implementation) is no longer necessary. It helps, but we often achieve the same results simply by stating what we want and collaboratively working out the best possible implementation plan with the AI agent.

Since December 2025, Opus 4.6 and GPT-5.3 Codex have become available. While they are only slightly better than their predecessors, they show that the quality of code generation is still on the rise. I've seen a few people claiming that improvement has stalled, but the evidence is not there.

The first time you allow an agent to meaningfully traverse your codebase, propose architectural adjustments, refactor across modules, and write tests without constant micromanagement, you realise this is no longer autocomplete. It is collaborative pair programming. I'm under the impression that LLMs can generate higher-quality code than I would have been able to write myself. Note that the capability to do so doesn't automatically mean that they will do so all the time. Collaborative planning, intelligent feedback loops, and other techniques are still required to achieve high-quality outcomes.

If you don't believe me, try these prompts on your website (connect Chrome DevTools MCP first):

Using Chrome DevTools MCP perform performance audit of awesome-testing.com. List all issues and suggest improvements. Implement the top 3 the most impactful suggestions.

Using Chrome DevTools MCP perform SEO audit of awesome-testing.com. List all issues and suggest improvements. Implement the top 3 the most impactful suggestions.

Using Chrome DevTools MCP perform accessibility audit of awesome-testing.com. List all issues and suggest improvements. Implement the top 3 the most impactful suggestions.

Codex can also push back on changes which it believes are not good.

At this point, you may ask yourself a question: "If it is a revolution, why can't I see it in my daily work?". The reality is more nuanced than that.

Between Doom and Hype

It is worth maintaining a sense of proportion. Alongside the excitement, we are once again seeing the rise of AI “doomers”. One of the most visible voices recently has been Matt Schumer, whose post “Something Big is Happening” reignited the familiar narrative that everything is about to change overnight. Matt is currently wearing a FOMO crown, although I don't expect him to be wearing it for long. The competition is fierce in this space.

To be fair, the AI discussion has always oscillated between two extremes:

Dream sellers, who declare “it’s over” every time a new model is released (someone really should make a meme of this).
Hardened sceptics, who insist nothing meaningful has changed.

Interestingly, many of the latter group have grown quieter. Some of them, I suspect, simply switched from older tools like GitHub Copilot (which struggles a lot to be honest, the only improvements in Copilot are related to more capable models powering it). The contrast can produce a genuine “wow” effect.

However, we need to acknowledge an important structural bias: Twitter (or X) is dominated by early adopters. These are people who enjoy bleeding-edge tools, thrive on novelty, and often benefit (financially or reputationally) from rapid technological shifts. When technology accelerates, they have more to write about, more to demonstrate, more to monetise.

I include myself here. Recently I have posted more simply because there has been more to explore.

But beyond novelty lies daily reality.

YOLO mode vs corporate mode

A helpful way to understand the gap between online demos and corporate reality is through what some casually call “YOLO mode”: granting an AI agent full access to your machine and letting it operate with minimal friction.

For freelancers, indie hackers, and hobbyists building projects on their own laptops, this can be perfectly rational. The blast radius is limited. If something breaks, you fix it. If a credential leaks, you rotate it. If the agent deletes a directory, the consequences are annoying rather than catastrophic. In that setting, wide permissions translate directly into speed. The agent can install dependencies, run migrations, modify infrastructure, and refactor freely. Velocity increases dramatically because there is almost no procedural resistance.

Corporate environments operate under a completely different logic.

In an enterprise setting, an agent cannot simply execute whatever it proposes. Commands require explicit approval. Execution environments must be isolated. Sandboxes need to be provisioned and maintained. Permissions must be carefully scoped and whitelisted. Credentials must be stored and accessed securely, often through dedicated secret managers. Rotation procedures are formalised and sometimes audited. Regulatory and compliance requirements add further constraints.

Even something as mundane as rotating an API key illustrates the difference. In a personal side project, you can rotate a key instantly and move on. In a production system used by multiple teams or customers, key rotation can involve coordination across services, downtime considerations, and formal change management processes. What is trivial at home becomes procedural in the enterprise.

This is where the first serious friction appears.

Stakeholders consume the same online narratives as engineers. They see claims that “it’s over” and that productivity has multiplied overnight. Expectations rise accordingly. However, most public benchmarks and viral demonstrations are performed under idealised conditions: full agent autonomy, unrestricted access to credentials, no approval loops, and minimal governance constraints.

Enterprise reality is different. Humans remain firmly in the loop. Engineers must read every suggested command, inspect generated scripts (Codex tends to generate JavaScript or Python scripts), understand their impact, and sometimes even validate them with another model. Security implications cannot be waved away. Compliance teams cannot be bypassed. Risk cannot be ignored.

The romantic story of “start the task and go for coffee” rarely survives contact with corporate risk management.

This does not mean agentic tools lack impact in companies. It means the impact is moderated. The larger the organisation and the greater the risk exposure, the more friction is introduced into the workflow. As a result, the visible productivity boost is often smaller than what we observe in personal experiments.

An interesting dynamic emerges: smaller companies and early-stage startups can extract disproportionate advantage from these tools, while large enterprises proceed more cautiously under heavier procedural weight. This asymmetry is rarely emphasised in online discussions, yet it fundamentally shapes how the revolution unfolds in practice.

The Emerging Productivity Divide

At the same time, a different kind of shift is becoming increasingly visible: a widening productivity gap in the labour market.

There is now a tangible difference between engineers who deeply integrate state-of-the-art agents (Claude Code, Codex, Cursor) into their daily workflow, and those who do not use them at all. In some cases, the latter group is restricted by company policy. In others, it is a matter of disinterest or scepticism.

It is important to recognise that many of us operate within a bubble. In certain circles, advanced AI tooling feels ubiquitous. In reality, a significant portion of professionals either lacks access or has not yet engaged seriously with these systems.

I have previously argued in Learning AI that engineers should cultivate three complementary competencies: a conceptual understanding of how AI systems function, practical fluency with top-tier agents, and foundational knowledge of how to build AI-powered products. These are not optional curiosities; they are becoming structural components of professional competence.

This is not about fear of missing out. It is about long-term positioning and employability.

As early as 2023, I suggested that employers would begin to expect AI literacy. That expectation is not receding; if anything, it is strengthening. Ignoring this domain, or dismissing it as another overhyped technological fad, seems strategically risky. The downside of disengagement is asymmetric: those who adapt gain leverage, while those who abstain may find themselves at a disadvantage.

Importantly, the barrier to entry is not insurmountable. Many tools, like KiloCode or OpenCode provide generous free tiers. Others are inexpensive. In my own case, private access to Codex costs roughly $20 per month. For an IT professional, this is not an extraordinary expense; it is often less than common entertainment subscriptions.

Serious experimentation does not require reckless investment. It requires curiosity, discipline, and time.

For those who cannot yet use these tools at work, the appropriate reaction is not panic. It is reflection. The market is shifting. Remaining deliberately detached from that shift may be comfortable in the short term, but over time it risks becoming a strategic liability.

Balanced scepticism remains healthy. Complete disengagement does not. It's in your best interest to be seen as AI pioneer, not a skeptic.

The Illusion of Engineer Replacement

Every wave of technological acceleration revives the same question: are engineers about to be replaced?

Recently, the narrative has intensified. We see demonstrations of non-technical founders shipping products in days. Designers deploy full-stack applications (via tools like v0). Solo builders operate what previously required small teams. From the outside, it can look as though deep technical expertise has suddenly become optional.

There is some truth in this perception. The floor has risen dramatically. A motivated non-engineer can now scaffold an application, deploy it, connect a database, integrate payments, and iterate. All with the assistance of capable agents. That level of speed and completeness was not realistically achievable a few years ago.

What often goes unnoticed in viral examples is that we are evaluating artefacts, not systems.

Most of these impressive demonstrations showcase a working application, a polished UI, perhaps even a deployed URL. What they rarely include is the accompanying codebase. We do not see the repository structure, the dependency graph, the configuration files, the test coverage, or the CI pipeline (if one exists at all). We are invited to admire the outcome, but we are not given the materials to inspect the foundations.

Without access to the underlying code, we have no meaningful way to evaluate non-functional qualities. Is the system maintainable, or is it a dense accumulation of generated fragments that only “happen” to work together? Are abstractions coherent and intentional, or accidental side effects of iterative prompting? Is the security model robust, or dependent on defaults that may not hold under scrutiny? How does the system behave under load? Are there hidden performance cliffs? Can it be extended safely, or will each new feature increase fragility?

Generated systems often perform impressively along the happy path. The real test emerges under load, partial failure, concurrency issues, dependency outages, or changing compliance requirements. These are precisely the conditions where deep engineering experience matters most. AI can raise walls quickly; it does not automatically guarantee that the pillars are load-bearing.

This is not an argument against agentic development. It is an argument for architectural literacy.

Architecture still matters. System boundaries still matter. Data modelling still matters. Cost trade-offs still matter. Deployment strategy, caching layers, observability, failure modes, and security posture have not vanished. AI can propose solutions, but it does not assume responsibility for their long-term consequences.

Strong architectural thinking remains a differentiator. Knowing why to choose one pattern over another, when to introduce a queue, when to split a service, when to resist premature abstraction, and how to keep operational costs under control are not trivial capabilities. They are developed through experience, exposure to failure, and sustained engagement with real systems under pressure.

Product Engineering

Architecture does not exist in isolation.

The moment you ask why a system should be built a certain way, you are no longer operating purely at the technical level. You are operating at the intersection of business, user needs, and long-term strategy. This is where product engineering becomes central.

The engineers who will thrive are those who understand not only how to implement a feature, but why it should exist at all. They participate in discovery. They challenge requirements. They consider user impact, cost impact, operational burden, and long-term maintainability. They think in terms of trade-offs rather than tickets.

Writing code has never been the hardest part of software development. Deciding what code should be written (and what should not) has always been harder. That distinction is now becoming economically visible.

Engineers whose role is limited to executing well-described tasks may feel increasing pressure. If the job consists purely of translating detailed specifications into syntax, agentic systems are remarkably capable of performing exactly that function. The narrower and more mechanical the responsibility, the easier it is to automate.

Several years ago, many organisations had developers who preferred precisely defined tasks and avoided business context. They wanted clean tickets, clear acceptance criteria, and minimal ambiguity. That model is becoming fragile. The comparative advantage of humans is shifting away from mechanical implementation and towards judgement, synthesis, and ambiguity management.

This does not mean senior engineers are safe by default. It means that depth combined with breadth (often described as T-shaped skills) becomes even more valuable. Broad system awareness paired with deep expertise in at least one domain creates resilience.

Fear-driven narratives miss the point. AI does not eliminate the need for engineers. It changes what engineering must encompass. The craft expands upwards: towards architecture, towards product thinking, towards cost-awareness, towards quality strategy and security discipline. At the same time, it compresses the lower layers of purely mechanical work.

If your professional identity is tightly coupled to typing code, the shift may feel threatening. If it is tied to designing systems, solving business problems, evaluating trade-offs, and safeguarding quality under constraints, the shift looks more like leverage.

There is no need for panic. But there is a need for clarity.

The era of being paid primarily to produce syntax is fading. The era of being paid to make sound technical and product decisions, while critically evaluating the foundations beneath rapidly generated systems, is intensifying.

Patterns that actually stuck: plan first, create feedback loops

By February 2026, a lot of “AI coding advice” has already gone stale. But two patterns survived the churn and break into something close to mainstream practice:

Planning before implementation
Deliberate feedback loops for the agent

What’s interesting is that both patterns increase the value of engineering judgement. They don’t remove it.

Plan first

The most consistent mistake I see in agentic workflows is not “bad prompting”, but premature execution. That still works for small changes, yet it breaks down the moment the codebase becomes non-trivial, the domain is subtle, or the blast radius matters. Mature agentic development treats planning as an actual engineering phase: a short burst of research and design that narrows ambiguity before any files are touched.

Anthropic’s own guidance for Claude Code effectively frames this as giving the model clear success criteria and a way to verify its work, but the prerequisite is that the task is well-shaped in the first place: what “done” means must be explicit, and the constraints must be stated up front. Otherwise the agent will happily fill the vacuum with assumptions that look plausible, compile in its head, and fail in yours.

OpenAI’s Codex prompting guide points in the same direction, even at the level of mechanics: read first, form a plan, then act. When you force the agent to explore and outline before it writes, you’re not slowing it down, you’re preventing it from spending your budget on the wrong branch of the solution tree. This is the real meaning of “agentic development punishes ambiguity”. It’s not that models cannot handle fuzzy inputs. It’s that fuzziness explodes the space of reasonable implementations, and the agent will often commit to one confidently. Planning is how you compress that space into a smaller, relevant slice that matches your intent.

In practice, I increasingly prompt agents as designers rather than typists. I ask them to propose a couple of approaches, name trade-offs, and choose one with justification; or to map which files they would touch and what tests they would add, while explicitly forbidding code changes until the plan is agreed. This “go slow to go fast” pattern shows up repeatedly in the culture around Claude Code: explore, plan, code, then commit, with the claim that skipping the first two steps is exactly what causes agents to jump into brittle edits and local maxima.

Planning also forces a conversation: stack choice is no longer only about human ergonomics. It is also about agent ergonomics. Tools like Cursor explicitly recommend shaping tasks so the agent can iterate against tests and keep changes coherent; the unstated implication is that repos with strong conventions, predictable tooling, and fast local workflows are simply easier for agents to operate in. A TypeScript/Next.js codebase with a well-lit path—typechecking, linting, a stable test runner, a reproducible dev environment, behaves like guardrails. A messy repo behaves like a swamp. When an agent is doing most of the keystrokes, that difference is not aesthetic; it is throughput.

And this is the quiet rebuttal to the “non-technical people will replace engineers” story. The more capable agents become, the more valuable it is to have someone who can articulate goals, constraints, and trade-offs clearly. Someone who can smell the future maintenance costs, performance cliffs, security edges, and operational drag before they ship. Agents can generate options, but they do not own the consequences. That ownership still sits with the engineer who planned the work.

Deliberate feedback loops for the agent

If planning is how you aim the agent, feedback loops are how you keep it honest.

Modern coding agents are, in a very particular way, optimistic: they produce output that often looks coherent and well-intentioned, yet subtly disagrees with the realities of your repository. They might invent an API, misread a type, assume a script exists, or implement the happy path while ignoring the edge that your system falls off in production. The cure is not more eloquent prose in the prompt; it is instrumentation. You give the agent fast, local signals—type checks, linters, unit tests, build steps, a small number of high-value integration checks—and you force it to iterate until those signals turn green.

This is not just folk wisdom. Claude Code’s own best-practices documentation says the highest-leverage thing you can do is give it a way to verify its work: run tests, validate outputs, compare screenshots, confirm behaviour instead of merely describing it. Without that, you become the only feedback loop, and every mistake demands your attention. Addy Osmani makes the same point from the “self-improving agents” angle: tests passing become the proxy for “done”, and the loop is what turns naive generation into an engineering workflow with reliability properties.

Cursor’s agent guidance is strikingly blunt here: ask the agent to write code that passes the tests, tell it not to modify the tests, and keep iterating until everything passes. It sounds almost too simple, but it captures the key dynamic: prompting narrows the solution space once, and feedback narrows it again with objective constraints. Once you make “must typecheck” and “must pass tests” non-negotiable, you are no longer selecting for eloquence, you are selecting for correctness within your project’s rules. The agent stops being a prose engine and becomes a loop-driven optimiser. This correctness can be further enhanced by custom AGENTS.md steering.

This is where test engineers (and anyone with a quality mindset) suddenly feel “future-proof” in a very concrete way. The instincts you build through years of testing: designing cheap regression detection, prioritising a minimal but high-signal suite, making feedback fast and deterministic, choosing what belongs locally versus what belongs in CI, map almost perfectly onto agentic development. Agents do not get tired of repetition; they do not resent rerunning a typecheck ten times. They will happily grind through a loop as long as the loop exists and the signals are clear. That is why the shape of your test and tooling strategy matters more than it did when humans were the bottleneck.

The loop is also where corporate reality bites hardest, because the very conditions that make feedback loops powerful are often the first things enterprises restrict. In an idealised demo, the agent can iterate freely: run lint, run type checks, run tests, tweak the patch, repeat, essentially brute-forcing its way to green signals. In many real organisations, that autonomy is deliberately throttled. Terminal execution is gated behind manual approvals, and “computer-use” behaviours are treated as an attack surface rather than a productivity feature. Cursor, for example, documents that it requires user approval before executing terminal commands by default, explicitly to protect against destructive operations.

Put together, planning and feedback loops quietly redefine what “being good at coding” means in 2026. The act of producing syntax is cheaper than ever; the expensive parts are judgement, validation, and risk management. The winners are not the people who can coax the prettiest diff out of an agent, but the people who can design the work before implementation, build fast verification harnesses, keep changes reviewable, and enforce quality gates that scale. In other words: use agents to generate code, but use engineering discipline to decide what should exist, where it should live, and how it proves it works.

Review and processes

Once you start shipping with agents, you quickly discover an uncomfortable truth: we didn’t remove work from software delivery, we moved it. The keystrokes got cheaper, the diffs got larger, and the thing that now costs real time (and real money) is not “writing code”, but deciding whether that code deserves to exist in your main branch.

This is why review becomes the central discipline of agentic development. When an agent can refactor a directory, touch ten files, and confidently tell you it “improved architecture”, the team’s ability to evaluate change becomes the safety rail. Not because engineers suddenly mistrust tools, but because the economics have changed: code production is abundant, while attention, responsibility, and risk tolerance are scarce.

The most useful review advice I’ve seen is also the least glamorous, and it predates this whole wave by years. Google’s internal review practices put it plainly: keep changes small, and separate refactorings from feature work, because reviewers can only really understand intent when the change is cohesive. They explicitly recommend splitting refactors into a separate change list from behavioural changes for precisely that reason. In an agentic world, that guidance turns from “nice to have” into survival.

Agents are very good at doing two kinds of work at once. If you ask for a new feature, you often get a feature plus “clean-ups”, naming changes, dependency upgrades, and micro-refactors sprinkled throughout. On a personal project, that can feel delightful. In a team setting, it’s a trap: it makes review slower, it makes reasoning about risk harder, and it makes rollback less surgical. The discipline is to insist on shape. A business change should read like a business change. A refactor should read like a refactor. That separation is not bureaucracy; it is kindness to the future you who will debug production at 2 a.m.

This is also where the story “AI will write all the code, so engineers become optional” quietly falls apart. The hard part is not producing a patch; it is understanding what the patch means in context: what behaviour changed, what assumptions were introduced, what risk moved around, what security surface expanded, what future maintenance burden was created. OpenAI’s Codex guidance even frames “review” as a posture: prioritise bugs, risks, regressions, and missing tests over summaries and vibes. That’s exactly right. The most dangerous output from an agent is not obviously broken code. It’s a clean-looking diff that subtly changes behaviour and feels “reasonable” enough to slip through.

In practice, this pushes teams towards stronger process hygiene, not weaker. Not in the sense of heavyweight ceremonies, but in the sense of sharper constraints: clearer definitions of “done”, stronger expectations around commit/PR structure, and a more intentional division of labour between generation and judgement. In many organisations, every engineer is expected to participate in review as a core responsibility, not a peripheral activity, because review is where quality and ownership actually live. When agents accelerate output, that expectation stops being cultural polish and becomes throughput-critical.

There is, however, an additional corporate twist that doesn’t show up in most online demos: process is often shaped by policy, not preference. Enterprises routinely enforce controls that make “agent productivity” look different to what you see on X. Tools may be limited to certain environments, or require explicit approvals to use in production contexts. Even large tech organisations can end up in internal debates about how much autonomy to grant, precisely because governance and reputational risk scale differently than hobby projects. The practical consequence is that review isn’t just about code quality; it’s also where compliance reality shows up: who can run what, where the code is allowed to go, what data the tool is permitted to see, and what audit trail must exist.

All of this changes what “good process” feels like day-to-day. It is no longer enough to have a CI pipeline and a vague hope that reviewers will catch issues. You need review to be designed for an environment where diffs can be generated faster than humans can comfortably absorb them. That means making diffs easier to read (small, cohesive, purpose-driven), making intent explicit (why this change exists, what it deliberately avoids), and keeping a high bar for merging while resisting the temptation to treat generated code as inherently trustworthy. I believe that companies should invest also in AI Code Review tools.

Agentic development, paradoxically, makes us more conservative in the places that matter. Not conservative about experimenting (most of us are experimenting more than ever), but conservative about what we merge, how we slice work, and how we communicate intent. The tooling is new. The fundamentals are not. And the teams that will look “most productive” in 2026 won’t be the ones producing the most code; they’ll be the ones whose review and delivery processes can absorb the flood without losing their ability to reason, to recover, and to stay safe.

Closing thoughts

My closing advice is simple and deliberately unglamorous. Treat agents as leverage, not magic. Use them aggressively in the parts of the workflow where speed is cheap, and be conservative where mistakes are expensive. Optimise for clarity: in planning, in code boundaries, and especially in review. The teams who thrive in 2026 won’t be the ones who “let the agent do everything”. They’ll be the ones who design a way of working where agents can do a lot without the organisation losing its ability to reason about what it is shipping.

And if you are reading this from the “late mover” side, don’t panic... but don’t opt out either. The distance between “occasionally uses AI” and “fully agentic” is large, yet the first step is small: make the tools part of your daily habits, even if only at home, and learn what good looks like. The hype cycle will continue. The doom cycle will continue. But beneath both, the direction is clear: software development is evolving towards humans focusing on what is worth building, while AI increasingly handles tactical implementation.

Strategy still remains in human hands.