Generalization Dynamics of LM Pre-training

People typically assume that LMs stably mature from pattern-matching parrots to generalizable intelligence during pre-training. We build a toy eval suite and show this mental model is wrong: throughout pre-training, LMs frequently and suddenly hop between parrot-like and intelligence-like modes, i.e. distinct algorithms implemented by distinct circuits. We call this mode-hopping. Across our suite, LMs can suddenly latch onto memorized or in-context patterns instead of in-context learning, use System 1 instead of System 2 thinking, pick up what sounds true instead of what is true, fail at multi-hop persona QA, out-of-context reasoning, and emergent misalignment — then just as suddenly revert and generalize. Mode-hopping is not explained by standard optimization dynamics: it is locally stable and can not be fixed by checkpoint averaging. We instead think of it as a capacity allocation problem: in a capacity-bounded model, generalizable circuits must compete with the shallow ones learned early in training, and the data in each pre-training window decides which circuits win. Our suite provides a cheap set of pre-training monitors and a new lens on generalization. Building upon our insights, we demonstrate three applications: (i) select intermediate pre-training checkpoints that strongly generalize reasoning and alignment, better than the final pre- or mid-training checkpoints, (ii) select pre-training data that controls and stabilizes generalization dynamics, and (iii) test prior generalization predictors, falsifying the monolithic belief that “simpler solutions generalize better”. — Read More

#training

Agent Evaluation: A Detailed Guide

Evaluation is one of the most important research areas for large language models (LLMs). Recently, patterns in LLM usage and evaluation have drastically changed. Whereas we previously evaluated LLMs using benchmarks composed of static questions or short conversations, we now have agent systems that operate over long time horizons and interact with the environment. Agents are difficult to properly evaluate due to their complexity and autonomy. To accurately measure the capabilities of an agent system, we must build harnesses that are realistic and capable of testing agents similarly to how they are used in practice. Building such evaluation capabilities is now more important than ever due to the growing adoption of agents in high-stakes applications like coding and medicine.

This overview will provide a detailed guide of how current agent systems are evaluated. We will begin by developing an understanding of agents in general, covering everything from basic concepts to multi-agent systems. We will then provide a clear framework for the agent evaluation process based upon common patterns observed in practice. Building upon this knowledge, we will end with several case studies of recent agent benchmarks and provide a roadmap that outlines how to build our own agent evaluation by applying similar concepts. Although evaluation is time-consuming and difficult, learning how to properly evaluate agents is incredibly valuable. By rigorously measuring performance and not relying on anecdotal checks, we can rapidly improve agent capabilities. — Read More

#accuracy

Claude Code as a Data Analyst: From Zero to First Report

As data analysts we’ve all been there, the dreaded request for the monthly/yearly [insert topic] report, an essential task that’s also a massive time sink.

My thoughts for the last week? “Can’t AI just… do this?” Surely, it can whip up a simple data analysis report. Right? — Read More

#data-science

Spec-Driven Development Isn’t Broken. It will collapse.

“prompting has split into four skills” — Context, Intent, Specification, Prompt. Everyone matched a tension one of us had brought into the room. And once they had names, something else clicked: the four crafts mapped cleanly onto P-CAM — Perception, Cognition, Agency, Manifestation.

…For the last eight months, the argument has been spec versus vibe. Structure versus flow. Waterfall versus emergence.

…Every standard critique of SDD, and every standard critique of vibe, traces back to the same thing. Not two sets of failures. One failure, surfacing on both sides of the debate. The three-layer collapse.

…Vibe coding collapsed because it had no contract. Spec-driven development is collapsing because it has three contracts pretending to be one. What rises from the fusion isn’t a new brand. It isn’t a better tool. It’s a separation of concerns — the oldest principle in software engineering — applied one layer up, to the documents we use to instruct the machines that write the documents. — Read More

#devops

If You Had To Read Only 5 AI Papers, This Should Be It.

The five papers that shape how every working AI engineer in 2026 thinks — what each one actually said, why it still matters, and what to read once you’ve read it.

… Five papers and one essay. Read them in this order, and the rest of the field becomes legible. —  Read More

#artificial-intelligence

Beyond the Coding Assistant: A Series on AI-Assisted Software Engineering

This is the first article of Beyond the Coding Assistant, a multi-part series on AI-assisted software engineering at enterprise scale. The full series is available here

The last few years of AI-assisted development have been remarkable. Coding assistants have crossed real quality bars. Engineers can now produce working code, in unfamiliar languages, against unfamiliar systems, at speeds that would have looked like science fiction in 2022. There are real productivity gains, real new affordances, and a real shift in what an individual developer can do in an afternoon.

And yet — when the conversation turns to the team and the organization — the picture is more complicated. The dramatic gains many leaders were promised haven’t shown up on every team. Some teams ship more. Some teams ship the same. Some teams have actually gotten slower, with the AI helping at the keystroke while the wider delivery metrics regress.

That gap, between what’s possible at the keystroke and what’s actually showing up in delivery, is what this series is about. The question I want to ask, and try to answer over the next several articles, is simple: what has changed, and what changes could take us so much farther than where current AI coding assistants have brought us? — Read More

#devops

The Modern Data Stack is Overcomplicated

… This series is the guide I wish someone had handed me at the start.

Over the next nine posts, I’m going to walk you through every layer of the Modern Data Stack. Not just which tool does what – you can read their docs for that. I want to talk about the decisions: why you’d choose one approach over another, what the real trade-offs are once you’re six months down the line, and where “best-practice” advice falls apart in the real world.

Here’s the series at a glance:

1. Architecture Overview: You are here
2.Data Ingestion: Connectors, event streams, custom pipelines
3. Data Warehousing: Where your data lives and why it matters more than you think
4. Transformation: dbt and beyond
5. Orchestration: Keeping everything running without losing your mind
6. Infrastructure as Code: The upfront cost that pays for itself (eventually)
7. Data Quality & Testing: What actually catches problems in production
8. Access Control & Governance: The boring stuff that will bite you if you ignore it
9. AI & ML Readiness: What “AI-ready” actually means from an engineering perspective
10. Lessons Learned: What I’d do differently if I started again tomorrow

Read More

Read the Series

#architecture

How Claude Code works in large codebases: Best practices and where to start

Claude Code is running in production across multi-million-line monorepos, decades-old legacy systems, distributed architectures spanning dozens of repositories, and at organizations with thousands of developers. These environments present challenges that smaller, simpler codebases don’t, whether that’s build commands that differ across every subdirectory or legacy code spread across folders with no shared root.

This article covers the patterns we’ve observed that have led to successful adoption of Claude Code at scale. We use “large codebase” to refer to a wide range of deployments: monorepos with millions of lines, legacy systems built over decades, dozens of microservices across separate repositories, or any combination of the above. That also includes codebases running on languages that teams don’t always associate with AI coding tools, such as C, C++, C#, Java, PHP. (Claude Code performs better than most teams expect it to in those cases, particularly as of recent model releases.) While every large codebase deployment is shaped by its specific version control, team structure, and accumulated conventions, the patterns here generalize across them and are a good starting point for teams considering adopting Claude Code. — Read More

#devops