Harness engineering for coding agent users

The term harness has emerged as a shorthand to mean everything in an AI agent except the model itself – Agent = Model + Harness. That is a very wide definition, and therefore worth narrowing down for common categories of agents. I want to take the liberty here of defining its meaning in the bounded context of using a coding agent. In coding agents, part of the harness is already built in (e.g. via the system prompt, or the chosen code retrieval mechanism, or even a sophisticated orchestration system). But coding agents also provide us, their users, with many features to build an outer harness specifically for our use case and system.

A well-built outer harness serves two goals: it increases the probability that the agent gets it right in the first place, and it provides a feedback loop that self-corrects as many issues as possible before they even reach human eyes. Ultimately it should reduce the review toil and increase the system quality, all with the added benefit of fewer wasted tokens along the way. — Read More

#devops

What Is Claw Code? The Claude Code Rewrite Explained

… On March 31, 2026, security researcher Chaofan Shou noticed something odd in the npm registry. Version 2.1.88 of @anthropic-ai/claude-code had shipped with a 59.8 MB JavaScript source map file attached.

… Within hours of the exposure, mirrored repositories appeared on GitHub. Anthropic began issuing DMCA takedowns. The internet did not wait.

Sigrid Jin (@instructkr) — a Korean developer who had attended Claude Code’s first birthday party in San Francisco in February — published what became claw-code. The repo reached ​50,000 stars in two hours​, one of the fastest accumulation rates GitHub has recorded.

The important distinction:​ ​claw-code​ is not an archive of the leaked TypeScript. It’s a clean-room Python rewrite, built from scratch by reading the original harness structure and reimplementing the architectural patterns without copying Anthropic’s proprietary source. Jin built it overnight using oh-my-codex, an orchestration layer on top of OpenAI’s Codex, with parallel code review and persistent execution loops.

… The real value here — for builders — isn’t the drama. It’s what the exposed architecture tells us about how production-grade agentic coding systems are actually structured. — Read More

#architecture, #devops

When agents hit the walls

For decades, structural engineers and IT teams have shared the same testing logic: apply controlled pressure, find where things give way and fix. In IT, that means a server that buckles at scale, a query that times out under load or a process that degrades when pushed past its limits.

Agentic AI could upend the way we approach testing. When an agent stops, there is no bug to fix, no threshold to raise. The agent is at a dead end: a system it can’t reach, an approval with no interface, a data handoff that lived in someone’s morning routine instead of in the architecture. This becomes about not a flaw in what was built, but of what wasn’t.

Humans filled those gaps without anyone noticing until now. An agent can’t. And every place it stops is a precise record of where the enterprise assumed a connection that was never made. These gaps were always load-bearing, patched up and held up by hand. Now you have a blueprint. — Read More

#devops

The Feedback Loop Is All You Need

So Claude Code added CRON a few days ago. Recurring tasks, native, built right in. The thing we’ve been dreaming about since the first AI coding demos — schedule an agent, go to sleep, wake up to merged PRs. An engineer that works while you don’t.

And I’m sitting here like… I can’t even use this. Not on the real codebase. Not at work.

The old loop: write or review code, spot smells by experience, leave comments explaining intent, promise to fix things “later” — which usually meant never.

The new loop: encode rules once, let agents iterate against them, observe what fails, tighten the constraints. Less “remember this next time,” more “this literally cannot happen.”

Agents break the old loop completely. When code can be produced nonstop, manual review becomes the weakest link. — Read More

#devops

Meta-Harness: End-to-End Optimization of Model Harnesses

The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering. — Read More

#devops

Building an AI-Powered Prompt Optimizer Using LLMs

Have you ever asked a question to an AI and received a disappointing answer? It’s not because the AI wasn’t smart enough, but because your question wasn’t quite accurate and you’re not alone.

The quality of answers we get from Large Language Models (LLMs) depends heavily on how we ask our questions.

Today, we’re going to build something interesting: An AI system that automatically improves your questions before answering them.

Think of it as having a smart assistant who rephrases your questions to help you get better answers.Read More

#devops

How Agentic RAG Works?

The main problem with standard RAG systems isn’t the retrieval or the generation. It’s that nothing sits in the middle deciding whether the retrieval was actually good enough before the generation happens.

Standard RAG is a pipeline where information flows in one direction, from query to retrieval to response, with no checkpoint and no second chance. This works fine for simple questions with obvious answers.

However, the moment a query gets ambiguous, or the answer is spread across multiple documents, or the first retrieval pulls back something that looks good but isn’t, RAG starts losing value.

Agentic RAG attempts to fix this problem. It is based on a single question: what if the system could pause and think before answering? — Read More

#devops

App Store | Age of Agent

The App Store Won’t Survive the Age of Agents

When Steve Jobs launched the iPhone in 2007, there was no App Store. His plan was for developers to build web apps accessed through Safari. That lasted about a year. Developers demanded native access, and in 2008 Apple launched the App Store — bundling discovery, distribution, trust, and payment into a single controlled layer.

That bundle has generated hundreds of billions of dollars. But it was built for humans who browse, tap, and swipe. AI agents don’t do any of that. And this mismatch is about to reshape the platform economy. — Read More

#devops

Designing Agentic AI Systems

How do you build an agentic system that works? And how do you spot potential problems during development that can snowball into massive headaches for future you when they go into production?

To answer these questions, you need to break agentic systems into three parts: tools, reasoning, and action. Each layer comes with its own challenges. Mistakes in one layer can ripple through the others, causing failures in unexpected ways. Retrieval functions might pull irrelevant data. Poor reasoning can lead to incomplete or circular workflows. Actions might misfire in production.

An agentic system is only as strong as its weakest link and this guide will show you how to design systems that avoid these pitfalls. The goal: build agents that are reliable, predictable, and resilient when it matters most. Read More:

Part 1 – Architecture
Part 2 – Modularity
Part 3 – Agent 2 Agent Interactions
Part 4 – Data & RAG
Part 5 – Vectorize MCP

#devops

Agent Memory: Why Your AI Has Amnesia and How to Fix It

Today’s AI agents forget everything between conversations. Every interaction starts from zero, with no recall of who you are or what you’ve discussed before.

Agent memory isn’t about bigger context windows. It’s about a persistent, evolving state that works across sessions.

The field has converged on four memory types (working, procedural, semantic, episodic) that map directly to how human memory works.

Building agent memory at enterprise scale is fundamentally a database problem. You need vectors, graphs, relational data, and ACID transactions working together. — Read More

#devops