Sustainable AI in Software Engineering

Jun 19, 2026 · 2850 words · 14 minute read

I’ve been writing software professionally for over 25 years, mastering platform engineering for 15 years, and I’m very good at what I do. I can say with confidence that during that time there has been no more revolutionary change to the practice of software engineering than LLM-based agentic AI. Anyone who tells you otherwise has their head in the sand.

For those 25+ years, Python has been my primary programming language. It has been among the handful of most popular programming languages every year for the last decade-plus. My expectation is that by the end of this decade, the two most common programming languages will be English and Chinese.

Agentic AI using LLMs is naturally adept at software engineering applications because, especially compared to natural language expression, the syntax & grammar of programming languages are astonishingly simplistic, the problem domains are limited in scope, and the logical reasoning is transparent & often deterministic. AI agents can generate, refactor, and debug code while comprehending the intent and context, and that code can be tested & mathematically verified to produce accurate results. This makes them naturally suited to the iterative, problem-solving nature of software engineering where tasks involve turning well-structured requirements into solutions, adapting based on feedback.

At the same time, I admit I have actively resisted drinking the Kool-Aid around AI. In the frenzy of hype around any new technology, I seek instead to separate the wheat from the chaff, which, depending on the level of hype, can be particularly tricky to do. And there has been so much hype around LLMs.

It did not help that the elites and leaders of the largest tech companies most loudly championing LLMs are, to put it quite charitably, not people whose judgment I trust or whose success I'm eager to contribute to. But it takes less than an hour of dabbling with Claude Code to quickly conclude that, regardless of the quality of the people foisting AI upon the world, unlike [many](https://theconversation.com/are-nfts-really-dead-and-buried-all-signs-point-to-yes-214145) [previous](https://www.bbc.com/news/world-us-canada-63685131) [hype](https://mitsloan.mit.edu/teaching-resources-library/sam-bankman-frieds-ftx) [cycles](https://newrepublic.com/article/160299/wework-book-review-billion-dollar-loser-rise-fall-adam-neumann), this one will create lasting seismic changes.

Whether those changes will impact software engineering for the better or for the worse is up to us, not them.

Looking past the hype 🔗

The hype around agentic AI in software focuses on the rate and volume of output that are made possible compared to the cost when it’s Claude or Cursor or Copilot generating the syntax. LinkedIn and Substack are peppered with promises for how agentic AI can upend roadmaps, obliterate backlogs, and all but eliminate time-to-market — with no strings attached and with minimal headcount. These testimonials, often from VC partners, tech executives, and startup founders who are highly invested in the success of the hype, promise that agentic AI can eliminate your backlog or that “the cost of software production is trending towards zero” because agentic AI is “like a senior software engineer whom you can just tell what to do, and it’ll do it.”

Underneath these pieces are an implied threat: go all-in on agentic AI or go out of business. Highly visible RIF events from companies like Microsoft, Salesforce, or Block/Square, claiming that the reductions were due to AI agents creating greater efficiencies (and they may not have been), add gravity and imminence to that threat.

To me this only underscores the need for sober analysis of the technology: what does it actually change, where can that be useful, what are the risks intrinsic to it, and how might those risks be practically contained. In my estimation, in addition to the industry revolutionizing and extreme business growth catalyzing potential of agentic AI for software engineering, there are three areas of risk which, left unmanaged, are so unsustainable as to actually threaten business continuity.

Risk #1: Technical debt 🔗

Code is a form of capital. Every line of code in a codebase is both asset and liability, and each line’s value as an asset and cost as a liability can vary wildly, as well as the dividends it can offer and technical debt interest it can accrue.

Agentic AI for software engineering fundamentally changes the economics. Code is now cheap to generate, whereas code generation had previously been the bottleneck. With Agentic AI, the bottleneck in engineering then shifts toward the burden of review, where human-in-the-loop is still a critical control to manage risk.

That risk quotient varies depending on where in an application stack the code lives.

Platform engineering, fundamentally, is the practice of organizing code & systems within an enterprise to manage risk: the more foundational code and infrastructure that underpin the reliability of the whole system must move slower and more deliberately, so that other portions of the stack can “move fast and break things” while containing the blast radius of failure and limiting risk exposure.

In the view of “code as capital”, platform layers built well are likely to be a long-lived asset, paying higher dividends. With AI, now more than ever, wise organizations must choose where to emphasize quality & craft, and where to create tolerance for liability in order to maximize velocity and experimentation. Otherwise the lowered cost of code generation will create more technical debt than marketable asset.

Agentic AI coding tools have well-documented shortcomings. An LLM by its very nature is non-deterministic, and its capabilities are constrained by a limited context window. The most common approach to fencing in that non-deterministic behavior involves adding more to the agent’s context to try to give it direction, but “context stuffing” has rapidly diminishing returns: the more directives you inject into the context, the less likely any of them will generate enough pressure on the model to comply. The limited context window gets filled with noise, not rigid controls.

Code review has been made even more challenging by the generation of code by agentic AI systems, due to these shortcomings. The psychological pitfall of vigilance decrement — “deterioration in the ability to remain vigilant for critical signals with time” — plagues such review, as most of the time the code looks “right”. Reviewers who have trained their whole career to review code generated by humans with known failure profiles are having to re-calibrate for the failure profiles of robots.

Those failure profiles can undermine the security posture of the system in subtle and not-so-subtle ways. Hallucinated package names introduce supply-chain poisoning vectors through typosquatting or dependency confusion. While strong practices and defenses exist in frameworks to treat user-submitted values as untrustworthy and to sanitize them appropriately, equivalent practices are not in place for prompt-injection attacks when agents react to content pulled from websites and MCP output. Even beyond hallucinations or prompt injection attacks, LLMs display a well-documented tendency to produce code that passes surface review but contains subtle logic vulnerabilities, and automation bias — the companion psychological pitfall to vigilence decrement — leaves reviewers over-confident in the quality of AI authored code and thus under-prepared to identify vulnerabilities before shipping them.

Automated review suffers the same limitations as automated code authoring: a limited context window and non-deterministic behavior means that the AI reviewer suffers cognitive deficit about the full context of the code it’s reviewing and its non-deterministic nature means review outcome will vary wildly based simply on chance. This variability is so substantial that the best-practice in combatting it is to employ an ensemble of LLM judges to propose findings, review among them, and come to verdicts from consensus.

Successful deployments of agentic AI for software engineering will confront the limitations of these systems and control for their risks — in enforceable, deterministic ways, such that the variability is lower, the outcomes are more consistent, and compliance with architectural, security, quality, stylistic, and performance guidelines is baked in before the pull request is even opened for review. Such deployments bind their AI agents within a harness that provides enforced, deterministic guardrails. Greater design planning as inputs to a coding agent and more thorough review of its outputs as a quality control can be focused on more foundational layers of the stack, in order to maintain the stability and security that underpin all of the layers above it.

Risk #2: Cognitive debt 🔗

When the role of developer shifts from software engineer to prompt engineer, their relationship to the code changes substantially. Code that they’ve themselves written, refactored, tested, and worked closely with gets encoded in a correspondingly rich mental map of the software system. Having an agent write the code limits the building of that mental map, resulting in cognitive debt.

With cognitive debt, a developer may be unfamiliar with how the code works, how to operate it, how to debug or troubleshoot it, or what risk trade-offs went into its construction. When things go wrong, in a pre-agentic-AI era, the debugging process and root cause analysis for an incident were learning experiences that fortified that mental model. Today, more often than not, when an incident occurs, developers are firing up the agents again to call up the Datadog MCP and do the triage & remediation for them, perhaps with little understanding of whether the agent is correct or why the proposed remedy would fix it. There become fewer opportunities for developers to develop intimate understanding of the systems they build and steward.

This cognitive debt can compound with interest over time, most visibly by atrophying in the company’s professional development pipeline. While it’s true that most software companies have had little in the way of formalized internal professional development, it often occurred anyway osmotically through working closer with more senior engineers, who provided deeper design review, answered questions, and unblocked the less-experienced developers, as they built their own mental models of the code and their skill in their craft. With agentic AI making those design decisions and drafting solutions, a more junior developer may commit them not knowing why that approach is correct or even if it is.

The most promising risk management strategy for cognitive debt is a culture of mindfulness practice, at the individual and company level. Ironically, this requires us to slow down, in the face of a technology whose core value proposition is letting us move faster.

The “flow state” is a well-known ideal mode of cognition for software engineers. Almost 20 years ago, Paul Graham opined on the essential quality of mindful presence in the flow state: “Your code is your understanding of the problem you’re exploring. So it’s only when you have your code in your head that you really understand the problem.” The practice of writing code in a programming language forces a developer to intimately understand that code, its structure, its control flow, and its operation. Using natural language to construct a prompt for a coding agent, doesn’t automatically lend itself to the same presence of mind, but to prevent cognitive debt, it must. The prompt may be in English, but the best prompting practice requires directing the coding agent toward not just what the code should do but how it should be shaped to do it, with a clear picture of what the final artifact should look like.

This strategy only works if mindfulness operates at the company cultural level as well. Though many purveyors of agentic AI market themselves as a technology that commoditizes knowledge workers, the mental model that fuses code, systems, processes, and the business developed by those humans in the course of their work vastly outsizes what can be fit into an agent session context window, in ways no RAG or skills database can yet approach.

The most important source of lasting competitive advantage any company builds is in the relationships forged between its team members.

The two hardest problems in Computer Science are

Human communication

Getting people in tech to believe that human communication is important
— Hazel Weakly (@hazelweakly.me) December 24, 2025 at 7:42 AM

When AI agents are developing more of our code, teams must invest more deeply in ceremony, around reviews, around incidents, and around planning & design. Whether an agent co-authored the code or not, every pull request must be fully comprehended by the engineer who opened it. They should write (by hand!) the full description of the pull request, to document what changed and more importantly why. Incidents deserve meaningful post-mortems. Planning & design become opportunities to build the blueprint of what becomes the mental model of the software, and LLMs should not be the ones driving the drafting process of a technical specification.

These ceremonies and their outputs can also become the input feeds for cultivating databases of learnings, skills, and rules for incorporation into the agentic AI development harness and automated code reviewers too! But those artifacts flow best from your team mindfully reflecting upon their work together, distilling what was learned, and supporting each other with accountability and mentorship.

Risk #3: Financial debt 🔗

If code is cheap to generate, and not all code is a net-asset, then the worst case scenario is that, recklessly employed, a coding agent is a tool that burns tokens to generate tech debt as fast as possible.

Nevertheless, many enterprises openly encourage it. If the simplest definition of your company culture is “the set of behaviors that get you ahead”, when the leadership of a company encourages profligate spending on tokens, integrating LLM agents into every business process, employees will burn tokens without care. What a company’s leadership rewards, they get more of.

OpenAI and Anthropic are grossly underpricing their models. One recent analysis found underpricing by 20x or more compared to true cost. The output value for a $99/month Claude Max plan is an extreme value, easy to perceive at present. What happens when it costs $2000/mo? There’s rampant speculation about the solvency of OpenAI, including speculation that their CFO doubts their financial picture would make for a successful IPO. The Silicon Valley start-up model prescribes underpricing a service out of the gate in order to gain marketshare and create lock-in, and marketshare grows to become the market leader with entrenched customers, re-price according to actual costs.

Wise organizations are already planning for ways to reduce their dependency and expenditure on major frontier models, even beyond encouraging employees to pick the right-sized model for the task they’re working on. Many of the open-source models — some only trailing the frontier models in quality by mere months, according to some estimates — which can be leveraged for exceptional cost savings and AI independence.

Are you issuing your developers Macbook hardware with Apple Silicon processors? LM Studio can run models specially scaled and optimized to take advantage of that architecture and its GPU, like the MLX-optimized Qwen3.5 model I run on my Macbook Pro, and wiring an LM Studio hosted model into locally running Claude Code lets developers leverage their local hardware for everyday inference tasks for zero incremental cost.

Other organizations are running highly configurable proxies like LiteLLM that translate to the OpenAI and Anthropic LLM protocols, offering cost tracking, guardrails, and model direction to run agents like Claude Code against full scale open-source models running in Amazon Bedrock or other cloud providers. Compared to per-million-token costs for Anthropic’s Opus 4 of $5.00 input / $25.00 output, using Qwen3 Coder instead costs $0.45 input / $1.80 output (over 90% cheaper!).

Even more simply, token spend can be contained through agentic harness controls and judicious context management practices. The agentic harness can use hooks to steer the agent toward more efficient tool use, to trap it when it gets stuck cycling on a problem, and to enforce policies at the earliest possible stage. For example, my Claude Code harness yields better efficiency exploring the codebase by steering the agent toward ast-grep, rg, fd, and LSP servers instead of grep, find, or other inefficient code discovery mechanisms. Furthermore, AGENTS.md files, skills, and other non-deterministic controls injected into the context should be experimentally validated; do they actually guide the output in routinely, measurably better ways, or do they pollute the context without lending much value? Just as your code has automated testing to validate performance, so too should your prompting artifacts.

Plus ça change, plus c’est la même chose 🔗

As much as agentic AI continues to revolutionize the practice of software engineering and the industries that rely upon it, what strikes me the most between what constitutes a sustainable deployment of the technology and an unsustainable disaster are the same fundamental principles that differentiated great developer teams and cultures before agentic AI.

Well-constituted platform engineering practices to isolate domains, minimize cognitive load, and empower rapid iteration while containing change risk
Strong, human-first cultural norms that balance the near-term & long-term goods, focus on relationships over deadlines, and encourage learning & growth
Judicious use of tooling to balance utility and cost

My anticipation over the next twelve to eighteen months is that the organizations that allowed themselves to become out-of-balance around these principles of positive engineering will find themselves with running code they can’t confidently change or operate with stability; with underdeveloped teams who lack the skills and understanding of their running code to remedy that; and maxed out budgets precluding the token expense it would take to remedy.

Now is the time to ensure you’re calibrated toward sustainability.