The Token Economy pt2: The Intelligence Company Gets Built

Some companies are rebuilding themselves around AI. Everyone else is waiting for a lab, vendor, owner, or competitor to do it for them.

Token Economy Part 1 said tokens don’t create productivity. The operating model does.

This week shows what happens next: if you can’t build that operating model yourself, someone will install it for you. — Read More

#strategy

How Non-Technical PMs Are Building Products Without Engineers

Read More


#videos

Polymarket launches private company trading so investors can speculate on Anthropic, OpenAI

Polymarket is moving deeper into private markets — and this time, the contracts are tied to companies most investors can talk about, but still cannot actually buy.

The company is launching prediction markets tied to private company milestones, including valuations, IPO timing and secondary-market activity for names like OpenAI and Anthropic.

Nasdaq Private Market will serve as the exclusive resolution data provider, supplying the information that determines whether these contracts pay out. — Read More

#investing

OpenAI co-founder Andrej Karpathy joins Anthropic’s pre-training team

Andrej Karpathy, the AI researcher who co-founded and formerly worked at OpenAI and previously led AI at Tesla, has joined Anthropic.

“I’ve joined Anthropic,” Karpathy posted on X Tuesday. “I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D.” — Read More

#strategy

Generalization Dynamics of LM Pre-training

People typically assume that LMs stably mature from pattern-matching parrots to generalizable intelligence during pre-training. We build a toy eval suite and show this mental model is wrong: throughout pre-training, LMs frequently and suddenly hop between parrot-like and intelligence-like modes, i.e. distinct algorithms implemented by distinct circuits. We call this mode-hopping. Across our suite, LMs can suddenly latch onto memorized or in-context patterns instead of in-context learning, use System 1 instead of System 2 thinking, pick up what sounds true instead of what is true, fail at multi-hop persona QA, out-of-context reasoning, and emergent misalignment — then just as suddenly revert and generalize. Mode-hopping is not explained by standard optimization dynamics: it is locally stable and can not be fixed by checkpoint averaging. We instead think of it as a capacity allocation problem: in a capacity-bounded model, generalizable circuits must compete with the shallow ones learned early in training, and the data in each pre-training window decides which circuits win. Our suite provides a cheap set of pre-training monitors and a new lens on generalization. Building upon our insights, we demonstrate three applications: (i) select intermediate pre-training checkpoints that strongly generalize reasoning and alignment, better than the final pre- or mid-training checkpoints, (ii) select pre-training data that controls and stabilizes generalization dynamics, and (iii) test prior generalization predictors, falsifying the monolithic belief that “simpler solutions generalize better”. — Read More

#training

Agent Evaluation: A Detailed Guide

Evaluation is one of the most important research areas for large language models (LLMs). Recently, patterns in LLM usage and evaluation have drastically changed. Whereas we previously evaluated LLMs using benchmarks composed of static questions or short conversations, we now have agent systems that operate over long time horizons and interact with the environment. Agents are difficult to properly evaluate due to their complexity and autonomy. To accurately measure the capabilities of an agent system, we must build harnesses that are realistic and capable of testing agents similarly to how they are used in practice. Building such evaluation capabilities is now more important than ever due to the growing adoption of agents in high-stakes applications like coding and medicine.

This overview will provide a detailed guide of how current agent systems are evaluated. We will begin by developing an understanding of agents in general, covering everything from basic concepts to multi-agent systems. We will then provide a clear framework for the agent evaluation process based upon common patterns observed in practice. Building upon this knowledge, we will end with several case studies of recent agent benchmarks and provide a roadmap that outlines how to build our own agent evaluation by applying similar concepts. Although evaluation is time-consuming and difficult, learning how to properly evaluate agents is incredibly valuable. By rigorously measuring performance and not relying on anecdotal checks, we can rapidly improve agent capabilities. — Read More

#accuracy

Claude Code as a Data Analyst: From Zero to First Report

As data analysts we’ve all been there, the dreaded request for the monthly/yearly [insert topic] report, an essential task that’s also a massive time sink.

My thoughts for the last week? “Can’t AI just… do this?” Surely, it can whip up a simple data analysis report. Right? — Read More

#data-science

Spec-Driven Development Isn’t Broken. It will collapse.

“prompting has split into four skills” — Context, Intent, Specification, Prompt. Everyone matched a tension one of us had brought into the room. And once they had names, something else clicked: the four crafts mapped cleanly onto P-CAM — Perception, Cognition, Agency, Manifestation.

…For the last eight months, the argument has been spec versus vibe. Structure versus flow. Waterfall versus emergence.

…Every standard critique of SDD, and every standard critique of vibe, traces back to the same thing. Not two sets of failures. One failure, surfacing on both sides of the debate. The three-layer collapse.

…Vibe coding collapsed because it had no contract. Spec-driven development is collapsing because it has three contracts pretending to be one. What rises from the fusion isn’t a new brand. It isn’t a better tool. It’s a separation of concerns — the oldest principle in software engineering — applied one layer up, to the documents we use to instruct the machines that write the documents. — Read More

#devops

If You Had To Read Only 5 AI Papers, This Should Be It.

The five papers that shape how every working AI engineer in 2026 thinks — what each one actually said, why it still matters, and what to read once you’ve read it.

… Five papers and one essay. Read them in this order, and the rest of the field becomes legible. —  Read More

#artificial-intelligence

Beyond the Coding Assistant: A Series on AI-Assisted Software Engineering

This is the first article of Beyond the Coding Assistant, a multi-part series on AI-assisted software engineering at enterprise scale. The full series is available here

The last few years of AI-assisted development have been remarkable. Coding assistants have crossed real quality bars. Engineers can now produce working code, in unfamiliar languages, against unfamiliar systems, at speeds that would have looked like science fiction in 2022. There are real productivity gains, real new affordances, and a real shift in what an individual developer can do in an afternoon.

And yet — when the conversation turns to the team and the organization — the picture is more complicated. The dramatic gains many leaders were promised haven’t shown up on every team. Some teams ship more. Some teams ship the same. Some teams have actually gotten slower, with the AI helping at the keystroke while the wider delivery metrics regress.

That gap, between what’s possible at the keystroke and what’s actually showing up in delivery, is what this series is about. The question I want to ask, and try to answer over the next several articles, is simple: what has changed, and what changes could take us so much farther than where current AI coding assistants have brought us? — Read More

#devops