Measuring LLMs’ ability to develop exploits

Claude Mythos Preview’s ability to develop exploits is a step-change over previous frontier models. This was one of our primary motivations for rolling out the model carefully through Project Glasswing rather than through a general release. Mythos Preview is capable of finding complex vulnerabilities, but what concerned us most in our internal testing was that Mythos Preview could both turn vulnerabilities into exploit primitives, and combine those primitives together into complete end-to-end attack chains.

When we published our Mythos Preview results, we measured its capabilities by having it search for novel zero-days and then build exploits for them. Qualitative evaluations like this are helpful for showcasing a model’s capabilities—but ideally, we would have high-quality quantitative benchmarks that let us measure them precisely. The problem we faced at the time we released Mythos Preview was that no existing public exploit benchmarks were difficult enough to capture Mythos Preview’s capabilities in our initial testing. — Read More

#cyber

Cisco: AI traffic is radically reshaping WANs

… Cisco’s new study, AI Impact on Wide Area Networks 2026, finds AI and agentic AI will not only increase traffic volume but also, “they will change traffic shape, symmetry, duration, and criticality,” the study reports. “AI inference paths will become strategic network assets, requiring high levels of resilience, observability, and differentiated treatment, for example, Quality of Service (QoS) and path security.” — Read More

#cyber

AI’s Plummeting Prices Are a Software Story, Not a Hardware One

Why is model inference getting cheaper? How did I drop a soon-to-be $2,000+/month bill for AI agents to next to nothing? And why are local models on commodity hardware potentially “good enough” for most people?

There are two macro trends here that feed directly into each other.

… costs are dropping for the same capacity (same model, same query), and we’re constantly ramping up what we use (bigger model, more expensive query). — Read More

#strategy

Alibaba’s Metis Agent Reduces AI Tool Calls and Enhances Accuracy

Alibaba’s Metis agent represents a significant advancement in AI operational efficiency by utilizing a novel reinforcement learning framework called Hierarchical Decoupled Policy Optimization (HDPO). This innovative approach has drastically cut down redundant tool invocations from 98% to just 2%, while simultaneously improving reasoning accuracy across industry benchmarks. This breakthrough not only tackles operational bottlenecks but also challenges traditional AI training methods, promoting a more cost-effective and responsive AI deployment in business applications. — Read More

#performance

NO SECURITY METER FOR AI

Let’s say you wanted to make sure that your AI is secure. Can you just maximize the security and privacy benchmark
and call it a day? Nope, because benchmarks don’t actually work for measuring AI capabilities (even when they are
NOT emergent systemic properties like security). So let’s take a step back: how do you measure security in the first
place? Good question. Over the last 30 years, security engineering for software evolved from black box penetration testing, through whitebox code analysis and architectural risk analysis to de facto process-driven standards like
the Building Security In Maturity Model (BSIMM). Software had a very deep impact on business operations, and it
appears that AI is going to have an even deeper impact. Will a software security-like measurement move work for
AI? Probably. In the meantime we can make real progress in AI security by cleaning up our WHAT piles and managing risk by identifying and applying good assurance processes. (Spoiler alert: no matter what we do, we still don’t
get a security meter for AI, so we need to be extra vigilant about security.) — Read More

#cyber

Cheap AI could derail OpenAI and Anthropic’s IPOs

This earnings season, the cost of AI started showing up in the numbers. MetaShopifySpotify, and Pinterest all flagged rising AI and inference costs as a drag on margins. Shopify said economies of scale were “partially offset by increased LLM costs.”

This is the bill coming due for the pricing model that underpins OpenAI’s and Anthropic’s expected IPO valuations, both projected north of $800 billion. Those numbers assume OpenAI and Anthropic will hold their market share and pricing power — that competitors can’t easily catch up, and that enterprise customers will keep paying a premium because there’s no real alternative.

But increasingly the data is pointing the other way. Cutting-edge AI is becoming abundant and cheap. Chinese labs are charging a fraction of what American labs do for comparable work, while a wave of Western challengers — Nvidia, Cohere, Reflection, Mistral — are building cheaper, smaller, more efficient alternatives for enterprises that won’t touch a Chinese model. By the time OpenAI and Anthropic file their prospectuses, with OpenAI’s confidential filing coming as soon as this week, the central premise of their valuations may already be gone. — Read More

#china-vs-us

Stanford’s 2026 AI Index Report

At Stanford HAI, we believe AI is poised to be the most transformative technology of the 21st century. But its benefits won’t be evenly distributed unless we guide its development thoughtfully. The AI Index offers one of the most comprehensive, data-driven views of artificial intelligence. Recognized as a trusted resource by global media, governments, and leading companies, the AI Index equips policymakers, business leaders, and the public with rigorous, objective insights into AI’s technical progress, economic influence, and societal impact. — Read More

#strategy

What’s Easy Now? What’s Hard Now?

This is the fourth in a series about how AI is changing software development, after It’s time to be right.What about juniors?, and My heuristics are wrong. What now?. It stands alone, but if you found this interesting you may also find those interesting.

I’ve been spending a lot of time thinking about the shape of the capabilities of coding agents. What they’re good at now, what they’re going to be good at. What they’re bad at now, how much of that is inherent and how much is transient. This is worth thinking about, because it’s the most important question shaping the future of software, and of software engineering. I don’t pretend to have an answer, but am coming to a conclusion that may be deeply counter-intuitive.

Coding agents are becoming very good indeed, and can build meaningful and correct software very quickly and at transformatively low cost. They have super-human abilities on some coding tasks. Of course, computer systems have had super human abilities for at least 85 years1. I think we’re going to find, as we have over those nine decades, that this new technology we’re building is vastly super-human in some areas2, and not nearly as capable as humans in others. — Read More

#devops

Accelerating scientific discovery with Co-Scientist

Scientific discovery is driven by scientists generating novel hypotheses for complex problems that undergo rigorous experimental validation. To augment this process, we introduce Co-Scientist, a multi-agent AI system built on Gemini for structured scientific thinking and hypothesis generation. Co-Scientist aims to help scientists discover new original knowledge. Conditioned on their research objectives and prior scientific evidence, it formulates demonstrably novel research hypotheses for experimental verification. The system’s design involves agents continuously generating, critiquing and refining hypotheses accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute scaling, improving hypothesis quality over time. While general purpose, we focus the validation in three biomedical applications: drug repurposing, novel target discovery 1, and explaining mechanisms of anti-microbial resistance 2. Specifically, Co-Scientist helped identify new drug repurposing candidates and synergistic combination therapies for acute myeloid leukemia, which were validated through in vitro experiments. These real-world validations demonstrate the potential of Co-Scientist to accelerate scientific discovery and usher in an era of AI empowered scientists. — Read More

#big7

Geometric AI does not need attention

I got the idea for this post when I had a virtual coffee with an engineer who builds AI models for one of the big airplane builders. And he hasn’t built a model that writes your emails or hallucinates your legal documents, but his model does something different. It looks at, say, a winglet — that’s the little upturned fin at the tip of every commercial aircraft wing — and with it he is able to predict the turbulence it will generate with 98% accuracy.

Let that sit for a moment.

… I walked away from that coffee thinking about wave interference. Because turbulence is, at its core, a wave problem. Pressure waves, superimposed, creating chaotic but geometrically structured patterns. And if a model can learn those patterns in aerodynamics, the obvious question is, where else do superimposed wave systems produce instability that we desperately need to control? — Read More

#performance