In February 2025, Andrej Karpathy coined “vibe coding”: describe what you want, let AI write the code, forget the code exists. It caught fire. Everyone wanted to believe coding had become as easy as talking.
One year later, Karpathy renamed it. The new term: “agentic engineering.” His explanation was pointed. “‘Engineering’ to emphasize that there is an art and science and expertise to it.” He’d gone from 80% manual coding to 80% agent coding in weeks, and discovered the hard way that models are “jagged” — brilliant at hard problems, then tripping over the obvious.
The data backs him up. GitClear’s 2025 code quality study found that AI-coauthored pull requests have 1.7x more issues than human-only PRs. Copy-pasted code lines rose from 8.3% to 12.3% between 2021 and 2024. Meanwhile, AI now writes 41% of all code on GitHub, with 4.7 million paid Copilot subscribers. — Read More
Tag Archives: DevOps
What Is the Best Local LLM for Coding in 2026?
We’ve all gone through the process of trying to run a multi-billion parameter model on our local machines. You spend the time downloading the weights and loading them into memory, only to have your machine freeze up completely when you actually try to prompt it. It usually ends with some broken output, and the realization that it’s just easier to stick to API keys.
I think the best local coding model is not the one with the highest math score. It is the one your machine can actually run without freezing. It is the tool that fits your specific daily workflow and respects your exact tolerance for latency. — Read More.
Macro Evals for Agentic Systems
When an agentic system fails, the problem is often larger than a single bad response. A handoff may happen too late, a specialist agent may miss the same signal across many runs, or a review process may trigger for the wrong class of cases. To improve the system, teams need to see recurring behavior across the whole population of traces.
This cookbook walks through a macro-eval workflow for a multi-agent system. We use a synthetic EV order workflow where specialist agents handle pricing, compliance, supply, factory routing, scheduling, and release decisions while market and operational conditions change.
The notebook uses precomputed synthetic traces and saved lower-level eval labels, so you can run the full workflow without an OpenAI API key. — Read More
What’s Easy Now? What’s Hard Now?
This is the fourth in a series about how AI is changing software development, after It’s time to be right., What about juniors?, and My heuristics are wrong. What now?. It stands alone, but if you found this interesting you may also find those interesting.
I’ve been spending a lot of time thinking about the shape of the capabilities of coding agents. What they’re good at now, what they’re going to be good at. What they’re bad at now, how much of that is inherent and how much is transient. This is worth thinking about, because it’s the most important question shaping the future of software, and of software engineering. I don’t pretend to have an answer, but am coming to a conclusion that may be deeply counter-intuitive.
Coding agents are becoming very good indeed, and can build meaningful and correct software very quickly and at transformatively low cost. They have super-human abilities on some coding tasks. Of course, computer systems have had super human abilities for at least 85 years1. I think we’re going to find, as we have over those nine decades, that this new technology we’re building is vastly super-human in some areas2, and not nearly as capable as humans in others. — Read More
Terraform Enterprise 2.0: Evolving infrastructure operations for scale
At the core of Terraform Enterprise 2.0 is support for Stacks, a new infrastructure orchestration capability that allows teams to manage collections of infrastructure as a single unit. Terraform Stacks are available on all plans based on resources under management.
As organizations scale, infrastructure evolves from isolated configurations into systems of interconnected components. Stacks reflect this shift by introducing a configuration layer that enables teams to define and manage infrastructure across environments, regions, and accounts in a consistent, repeatable way. — Read More
Spec-Driven Development Isn’t Broken. It will collapse.
… “prompting has split into four skills” — Context, Intent, Specification, Prompt. Everyone matched a tension one of us had brought into the room. And once they had names, something else clicked: the four crafts mapped cleanly onto P-CAM — Perception, Cognition, Agency, Manifestation.
…For the last eight months, the argument has been spec versus vibe. Structure versus flow. Waterfall versus emergence.
…Every standard critique of SDD, and every standard critique of vibe, traces back to the same thing. Not two sets of failures. One failure, surfacing on both sides of the debate. The three-layer collapse.
…Vibe coding collapsed because it had no contract. Spec-driven development is collapsing because it has three contracts pretending to be one. What rises from the fusion isn’t a new brand. It isn’t a better tool. It’s a separation of concerns — the oldest principle in software engineering — applied one layer up, to the documents we use to instruct the machines that write the documents. — Read More
Beyond the Coding Assistant: A Series on AI-Assisted Software Engineering
This is the first article of Beyond the Coding Assistant, a multi-part series on AI-assisted software engineering at enterprise scale. The full series is available here.
The last few years of AI-assisted development have been remarkable. Coding assistants have crossed real quality bars. Engineers can now produce working code, in unfamiliar languages, against unfamiliar systems, at speeds that would have looked like science fiction in 2022. There are real productivity gains, real new affordances, and a real shift in what an individual developer can do in an afternoon.
And yet — when the conversation turns to the team and the organization — the picture is more complicated. The dramatic gains many leaders were promised haven’t shown up on every team. Some teams ship more. Some teams ship the same. Some teams have actually gotten slower, with the AI helping at the keystroke while the wider delivery metrics regress.
That gap, between what’s possible at the keystroke and what’s actually showing up in delivery, is what this series is about. The question I want to ask, and try to answer over the next several articles, is simple: what has changed, and what changes could take us so much farther than where current AI coding assistants have brought us? — Read More
How Claude Code works in large codebases: Best practices and where to start
Claude Code is running in production across multi-million-line monorepos, decades-old legacy systems, distributed architectures spanning dozens of repositories, and at organizations with thousands of developers. These environments present challenges that smaller, simpler codebases don’t, whether that’s build commands that differ across every subdirectory or legacy code spread across folders with no shared root.
This article covers the patterns we’ve observed that have led to successful adoption of Claude Code at scale. We use “large codebase” to refer to a wide range of deployments: monorepos with millions of lines, legacy systems built over decades, dozens of microservices across separate repositories, or any combination of the above. That also includes codebases running on languages that teams don’t always associate with AI coding tools, such as C, C++, C#, Java, PHP. (Claude Code performs better than most teams expect it to in those cases, particularly as of recent model releases.) While every large codebase deployment is shaped by its specific version control, team structure, and accumulated conventions, the patterns here generalize across them and are a good starting point for teams considering adopting Claude Code. — Read More
Multi-Agent Systems: When 2 Agents Beat 1 (and When They Don’t)
You see the word multi-agent everywhere right now. People build systems with five different AI personas talking to each other in a simulated chat room just to scrape a website and write a blog post. They give them names like Researcher, Writer, and Editor and watch the terminal output scroll by as the agents debate with each other. It all looks impressive but is not the right way you build software.
Adding more agents to a system does not automatically make it smarter. It actually multiplies your failure rate. Think about the basic math of probability. If you have a single agent that executes its task correctly 90% of the time, your naive system reliability is 0.90.
If you chain three of those agents together, you multiply those probabilities. Your baseline reliability just dropped to 72%. You doubled your latency, tripled your API cost, and made the final output no better.
… We will see exactly why the single agent misses a critical billing logic flaw, and why the two-agent system catches it. — Read More
Im going back to writing code by hand
Here is k10s: https://github.com/shvbsle/k10s/tree/archive/go-v0.4.0
234 commits. ~30 weekends. Built entirely on vibe-coded sessions with Claude, whenever my tokens lasted long enough to ship something.
I’m archiving my TUI tool and rewriting it from scratch.
…I built it in Go with Bubble Tea [1] and it worked.
For a while… 😦
[What] I learned over these 7 months is worth more than the 1690 lines of model.go I’m throwing away.
….AI writes features, not architecture. The longer you let it drive without constraints, the worse the wreckage gets. The velocity makes you think you’re winning right up until the moment everything collapses simultaneously. — Read More