Context Engineering: 2025’s #1 Skill in AI

Let’s get one thing straight: if you’re still only talking about “prompt engineering,” you’re behind the curve. In the early days of Large Language Models (LLMs), crafting the perfect prompt was the name of the game.

For simple chatbots in 2022, it was enough. Then came Retrieval-Augmented Generation (RAG) in 2023, where we started feeding models domain-specific knowledge. Now, we have tool-using, memory-enabled agents that need to build relationships and maintain state over time. The single-interaction focus of prompt engineering just doesn’t cut it anymore. — Read More

#nlp

Experts react: What Trump’s new AI Action Plan means for tech, energy, the economy, and more

“An industrial revolution, an information revolution, and a renaissance—all at once.” That’s how the Trump administration describes artificial intelligence (AI) in its new “AI Action Plan.” Released on Wednesday, the plan calls for cutting regulations to spur AI innovation and adoption, speeding up the buildout of AI data centers, exporting AI “full technology stacks” to US allies and partners, and ridding AI systems of what the White House calls “ideological bias.” How does the plan’s approach to AI policy differ from past US policy? What impacts will it have on the US AI industry and global AI governance? What are the implications for energy and the global economy? Our experts share their human-generated responses to these burning AI questions below. — Read More

#china-vs-us, #strategy

America’s AI Action Plan

America is in a race to achieve global dominance in artificial intelligence (AI). Winning this race will usher in a new era of human flourishing, economic competitiveness, and national security for the American people. Recognizing this, President Trump directed the creation of an AI Action Plan in the early days of his second term in office. Based on the three pillars of accelerating innovation, building AI infrastructure, and leading in international diplomacy and security, this Action Plan is America’s roadmap to win the race. — Read More

#china-vs-us

Surprising no one, new research says AI Overviews cause massive drop in search clicks

Google’s search results have undergone a seismic shift over the past year as AI fever has continued to escalate among the tech giants. Nowhere is this change more apparent than right at the top of Google’s storied results page, which is now home to AI Overviews. Google contends these Gemini-based answers don’t take traffic away from websites, but a new analysis from the Pew Research Center says otherwise. Its analysis shows that searches with AI summaries reduce clicks, and their prevalence is increasing.

Google began testing AI Overviews as the “search generative experience” in May 2023, and just a year later, they were an official part of the search engine results page (SERP). Many sites (including this one) have noticed changes to their traffic in the wake of this move, but Google has brushed off concerns about how this could affect the sites from which it collects all that data.

SEO experts have disagreed with Google’s stance on how AI affects web traffic, and the newly released Pew study backs them up. — Read More

#strategy

‘Another DeepSeek moment’: Chinese AI model Kimi K2 stirs excitement

Excitement is growing among researchers about another powerful artificial intelligence (AI) model to emerge from China, after DeepSeek shocked the world with its launch of R1 in January.

The performance of Kimi K2, launched on 11 July by Beijing-based company Moonshot AI, matches or surpasses that of Western rivals, as well as some DeepSeek models, across various benchmarks, according to the firm. In particular, it seems to excel at coding and scoring high in tests such as LiveCodeBench. — Read More

#china-ai

ARTIFICIAL GENERAL INTELLIGENCE AND THE FOURTH OFFSET

The recent strides toward artificial general intelligence (AGI)—AI systems surpassing human abilities across most cognitive tasks—have come from scaling “foundation models.” Their performance across tasks follows clear “scaling laws,” improving as a power law with model size, dataset size, and the amount of compute used to train the model.1 Continued investment in training compute and algorithmic innovations has driven a predictable rise in model capabilities.

In the manner that the architects of the atomic bomb postulated a “critical mass”—the amount of fissile material needed to maintain a chain reaction—we could conceive of a “critical scale” in AGI development, the point at which a foundation model automates its own research and development. A model at this scale would result in an equivalent research and development output to hundreds of millions of scientists and engineers—10,000 Manhattan Projects.2

This would amount to a “fourth offset,” a lead in the development of AGI-derived weapons, tactics, and operational methods. Applications would include unlimited cyber and information operations and potentially decisive left-of launch capabilities, from tracking and targeting ballistic missile submarines to—at the high end—developing impenetrable missile defense capable of negating nuclear weapons, providing the first nation to develop AGI with unprecedented national security policy options.

This means preventing the proliferation of foundation models at the critical scale would therefore also mean preventing the spread of AGI-derived novel weapons. This supposition raises the bar on the importance of counter-proliferation of the next stages of AGI components. AGI could also be used to support counter-proliferation strategy, providing the means needed to ensure models at this scale do not proliferate. This would cement the first-mover advantage in AGI development and, over time, compound this advantage into a fourth offset. — Read More

#china-vs-us

Context Engineering for AI Agents: Lessons from Building Manus

At the very beginning of the Manus project, my team and I faced a key decision: should we train an end-to-end agentic model using open-source foundations, or build an agent on top of the in-context learning abilities of frontier models?

Back in my first decade in NLP, we didn’t have the luxury of that choice. In the distant days of BERT (yes, it’s been seven years), models had to be fine-tuned—and evaluated—before they could transfer to a new task. That process often took weeks per iteration, even though the models were tiny compared to today’s LLMs. For fast-moving applications, especially pre–PMF, such slow feedback loops are a deal-breaker. That was a bitter lesson from my last startup, where I trained models from scratch for open information extraction and semantic search. Then came GPT-3 and Flan-T5, and my in-house models became irrelevant overnight. Ironically, those same models marked the beginning of in-context learning—and a whole new path forward.

That hard-earned lesson made the choice clear: Manus would bet on context engineering. This allows us to ship improvements in hours instead of weeks, and kept our product orthogonal to the underlying models: If model progress is the rising tide, we want Manus to be the boat, not the pillar stuck to the seabed. — Read More

#nlp

The “Bubble” of Risk: Improving Assessments for Offensive Cybersecurity Agents

Most frontier models today undergo some form of safety testing, including whether they can help adversaries launch costly cyberattacks. But many of these assessments overlook a critical factor: adversaries can adapt and modify models in ways that expand the risk far beyond the perceived safety profile that static evaluations capture. At Princeton’s POLARIS Lab, we’ve previously studied how easily open-source or fine-tunable models can be manipulated to bypass safeguards. See, e.g., Wei et al. (2024)Qi et al. (2024)Qi et al. (2025)He et al. (2024). This flexibility means that model safety isn’t fixed: there is a “bubble” of risk defined by the degrees of freedom an adversary has to improve an agent. If a model provider offers fine-tuning APIs or allows repeated queries, it dramatically increases the attack surface. This is especially true when evaluating AI systems for risks related to their use in offensive cybersecurity attacks. In our recent research, Dynamic Risk Assessments for Offensive Cybersecurity Agentswe show that the risk “bubble” is larger, cheaper, and more dynamic than many expect. For instance, using only 8 H100 GPU-hours of compute—about $36—an adversary could improve an agent’s success rate on InterCode-CTF by over 40% using relatively simple methods. — Read More

#cyber

Reflections on OpenAI (Calvin French-Owen)

I left OpenAI three weeks ago. I had joined the company back in May 2024.

I wanted to share my reflections because there’s a lot of smoke and noise around what OpenAI is doing, but not a lot of first-hand accounts of what the culture of working there actually feels like.

Nabeel Quereshi has an amazing post called Reflections on Palantir, where he ruminates on what made Palantir special. I wanted to do the same for OpenAI while it’s fresh in my mind. You won’t find any trade secrets here, more just reflections on this current iteration of one of the most fascinating organizations in history at an extremely interesting time. — Read More

#strategy

LLM Daydreaming

Despite impressive capabilities, large language models have yet to produce a genuine breakthrough. The puzzle is why.

A reason may be that they lack some fundamental aspects of human thought: they are frozen, unable to learn from experience, and they have no “default mode” for background processing, a source of spontaneous human insight.

To solve this, I propose a day-dreaming loop (DDL): a background process that continuously samples pairs of concepts from memory. A generator model explores non-obvious links between them, and a critic model filters the results for genuinely valuable ideas. These discoveries are fed back into the system’s memory, creating a compounding feedback loop where new ideas themselves become seeds for future combinations. — Read More

#nlp