We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs. — Read More
Recent Updates Page 55
The Rise of the AI Database: Powering Real-Time AI Applications
As AI rapidly evolves, organizations are racing to build and deploy high-performance gen AI apps that deliver real-time insights and seamless user experiences. Central to this transformation is the emergence of the generative AI database, a new category of data platform optimized for vector search, semantic indexing and full-text retrieval. These systems are designed to address challenges like data silos, data quality and integration for AI and analytics. As the name suggests, a gen AI database is purpose-built to power generative AI models and applications, enabling developers to store, query and analyze both structured and unstructured data at scale, with the data stored in these platforms playing a crucial role in supporting advanced analytics and machine learning. — Read More
Context Engineering: 2025’s #1 Skill in AI
Let’s get one thing straight: if you’re still only talking about “prompt engineering,” you’re behind the curve. In the early days of Large Language Models (LLMs), crafting the perfect prompt was the name of the game.
For simple chatbots in 2022, it was enough. Then came Retrieval-Augmented Generation (RAG) in 2023, where we started feeding models domain-specific knowledge. Now, we have tool-using, memory-enabled agents that need to build relationships and maintain state over time. The single-interaction focus of prompt engineering just doesn’t cut it anymore. — Read More
Experts react: What Trump’s new AI Action Plan means for tech, energy, the economy, and more
“An industrial revolution, an information revolution, and a renaissance—all at once.” That’s how the Trump administration describes artificial intelligence (AI) in its new “AI Action Plan.” Released on Wednesday, the plan calls for cutting regulations to spur AI innovation and adoption, speeding up the buildout of AI data centers, exporting AI “full technology stacks” to US allies and partners, and ridding AI systems of what the White House calls “ideological bias.” How does the plan’s approach to AI policy differ from past US policy? What impacts will it have on the US AI industry and global AI governance? What are the implications for energy and the global economy? Our experts share their human-generated responses to these burning AI questions below. — Read More
America’s AI Action Plan
America is in a race to achieve global dominance in artificial intelligence (AI). Winning this race will usher in a new era of human flourishing, economic competitiveness, and national security for the American people. Recognizing this, President Trump directed the creation of an AI Action Plan in the early days of his second term in office. Based on the three pillars of accelerating innovation, building AI infrastructure, and leading in international diplomacy and security, this Action Plan is America’s roadmap to win the race. — Read More
Surprising no one, new research says AI Overviews cause massive drop in search clicks
Google’s search results have undergone a seismic shift over the past year as AI fever has continued to escalate among the tech giants. Nowhere is this change more apparent than right at the top of Google’s storied results page, which is now home to AI Overviews. Google contends these Gemini-based answers don’t take traffic away from websites, but a new analysis from the Pew Research Center says otherwise. Its analysis shows that searches with AI summaries reduce clicks, and their prevalence is increasing.
Google began testing AI Overviews as the “search generative experience” in May 2023, and just a year later, they were an official part of the search engine results page (SERP). Many sites (including this one) have noticed changes to their traffic in the wake of this move, but Google has brushed off concerns about how this could affect the sites from which it collects all that data.
SEO experts have disagreed with Google’s stance on how AI affects web traffic, and the newly released Pew study backs them up. — Read More
‘Another DeepSeek moment’: Chinese AI model Kimi K2 stirs excitement
Excitement is growing among researchers about another powerful artificial intelligence (AI) model to emerge from China, after DeepSeek shocked the world with its launch of R1 in January.
The performance of Kimi K2, launched on 11 July by Beijing-based company Moonshot AI, matches or surpasses that of Western rivals, as well as some DeepSeek models, across various benchmarks, according to the firm. In particular, it seems to excel at coding and scoring high in tests such as LiveCodeBench. — Read More
ARTIFICIAL GENERAL INTELLIGENCE AND THE FOURTH OFFSET
The recent strides toward artificial general intelligence (AGI)—AI systems surpassing human abilities across most cognitive tasks—have come from scaling “foundation models.” Their performance across tasks follows clear “scaling laws,” improving as a power law with model size, dataset size, and the amount of compute used to train the model.1 Continued investment in training compute and algorithmic innovations has driven a predictable rise in model capabilities.
In the manner that the architects of the atomic bomb postulated a “critical mass”—the amount of fissile material needed to maintain a chain reaction—we could conceive of a “critical scale” in AGI development, the point at which a foundation model automates its own research and development. A model at this scale would result in an equivalent research and development output to hundreds of millions of scientists and engineers—10,000 Manhattan Projects.2
This would amount to a “fourth offset,” a lead in the development of AGI-derived weapons, tactics, and operational methods. Applications would include unlimited cyber and information operations and potentially decisive left-of launch capabilities, from tracking and targeting ballistic missile submarines to—at the high end—developing impenetrable missile defense capable of negating nuclear weapons, providing the first nation to develop AGI with unprecedented national security policy options.
This means preventing the proliferation of foundation models at the critical scale would therefore also mean preventing the spread of AGI-derived novel weapons. This supposition raises the bar on the importance of counter-proliferation of the next stages of AGI components. AGI could also be used to support counter-proliferation strategy, providing the means needed to ensure models at this scale do not proliferate. This would cement the first-mover advantage in AGI development and, over time, compound this advantage into a fourth offset. — Read More
Context Engineering for AI Agents: Lessons from Building Manus
At the very beginning of the Manus project, my team and I faced a key decision: should we train an end-to-end agentic model using open-source foundations, or build an agent on top of the in-context learning abilities of frontier models?
Back in my first decade in NLP, we didn’t have the luxury of that choice. In the distant days of BERT (yes, it’s been seven years), models had to be fine-tuned—and evaluated—before they could transfer to a new task. That process often took weeks per iteration, even though the models were tiny compared to today’s LLMs. For fast-moving applications, especially pre–PMF, such slow feedback loops are a deal-breaker. That was a bitter lesson from my last startup, where I trained models from scratch for open information extraction and semantic search. Then came GPT-3 and Flan-T5, and my in-house models became irrelevant overnight. Ironically, those same models marked the beginning of in-context learning—and a whole new path forward.
That hard-earned lesson made the choice clear: Manus would bet on context engineering. This allows us to ship improvements in hours instead of weeks, and kept our product orthogonal to the underlying models: If model progress is the rising tide, we want Manus to be the boat, not the pillar stuck to the seabed. — Read More
The “Bubble” of Risk: Improving Assessments for Offensive Cybersecurity Agents
Most frontier models today undergo some form of safety testing, including whether they can help adversaries launch costly cyberattacks. But many of these assessments overlook a critical factor: adversaries can adapt and modify models in ways that expand the risk far beyond the perceived safety profile that static evaluations capture. At Princeton’s POLARIS Lab, we’ve previously studied how easily open-source or fine-tunable models can be manipulated to bypass safeguards. See, e.g., Wei et al. (2024), Qi et al. (2024), Qi et al. (2025), He et al. (2024). This flexibility means that model safety isn’t fixed: there is a “bubble” of risk defined by the degrees of freedom an adversary has to improve an agent. If a model provider offers fine-tuning APIs or allows repeated queries, it dramatically increases the attack surface. This is especially true when evaluating AI systems for risks related to their use in offensive cybersecurity attacks. In our recent research, Dynamic Risk Assessments for Offensive Cybersecurity Agents, we show that the risk “bubble” is larger, cheaper, and more dynamic than many expect. For instance, using only 8 H100 GPU-hours of compute—about $36—an adversary could improve an agent’s success rate on InterCode-CTF by over 40% using relatively simple methods. — Read More