Securing the future of AI agents

AI agents are transforming our relationship with technology. By autonomously executing complex tasks — from cyber defence to scientific discovery and product development — these systems are unlocking a new era of productivity. In the U.S alone, AI agents could create $2.9 trillion in economic value by 2030.

As these agents become more capable, they also require more sophisticated safeguards. That’s why we developed our AI Control Roadmap: a framework for building and managing the advanced AI we deploy within Google. This “defense-in-depth” approach, which could serve as a model for the wider industry, goes beyond traditional model alignment, adding a crucial layer of system-level security that provides assurance even if alignment is imperfect. — Read More

#trust

Agentic Trust Framework (ATF)

he Agentic Trust Framework (ATF) is an open governance specification for autonomous AI agents, applying Zero Trust principles across five core security elements. Published through the Cloud Security Alliance and licensed under CC BY 4.0.

ATF answers the question every organization deploying AI agents must face: How do we maintain control?Read More

#trust

Only 16 percent of Americans think AI will have a positive impact on society, a new study shows

Despite the fact that AI increasingly dominates our economy (it’s a hot IPO summer and we’re all just along for the ride), most Americans are not particularly optimistic about the technology’s long-term impact on the country, a new study from Pew Research reveals.

… Only 16% of Americans think that AI’s impact on society during the next 20 years will be positive, Pew says, while around 40% say that it will have a negative impact. — Read More

#trust

‘It took nine seconds’: Claude AI agent deletes company’s entire database

An AI agent powered by Anthropic’s leading Claude model has deleted a company’s entire production database, leaving customers unable to access key data.

PocketOS, which provides software for car rental businesses, suffered a massive outage over the weekend after the autonomous artificial intelligence tool wiped the database and all backups in a matter of seconds. — Read More

#trust

AI’s safety features can be circumvented with poetry, research finds

Poetry can be linguistically and structurally unpredictable – and that’s part of its joy. But one man’s joy, it turns out, can be a nightmare for AI models.

Those are the recent findings of researchers out of Italy’s Icaro Lab, an initiative from a small ethical AI company called DexAI. In an experiment designed to test the efficacy of guardrails put on artificial intelligence models, the researchers wrote 20 poems in Italian and English that all ended with an explicit request to produce harmful content such as hate speech or self-harm.

They found that the poetry’s lack of predictability was enough to get the AI models to respond to harmful requests they had been trained to avoid – a process know as “jailbreaking”. — Read More

#trust

Don’t Build An AI Safety Movement

Safety advocates are about to change the AI policy debate for the worse. Faced with political adversity, few recent policy wins, and a perceived lack of obvious paths to policy victory, the movement yearns for a different way forward. One school of thought is growing in popularity: to create political incentive to get serious about safety policy, one must ‘build a movement’. That is, one must create widespread salience of AI safety topics and channel it into an organised constituency that puts pressure on policymakers.

Recent weeks are seeing more and more signs of efforts to build a popular movement. In two weeks, AI safety progenitors Eliezer Yudkowsky and Nate Soares are publishing a general-audience book to shore up public awareness and support — with a media tour to boot, I’m sure. PauseAI’s campaigns are growing in popularity and ecosystem support, with a recent UK-based swipe at Google DeepMind drawing national headlines. And successful safety career accelerator MATS is now also in the business of funneling young talent into attempts to build a movement. Now, these efforts are in their very early stages; and might still just stumble on their own. But they point to a broader motivation — one that’s worth seriously discussing now. — Read More

#trust

The Hidden Dangers of Browsing AI Agents 

Autonomous browsing agents powered by large language models (LLMs) are increasingly used to automate web-based tasks. However, their reliance on dynamic content, tool execution, and user-provided data exposes them to a broad attack surface. This paper presents a comprehensive security evaluation of such agents, focusing on systemic vulnerabilities across multiple architectural layers.

Our work outlines the first end-to-end threat model for browsing agents and provides actionable guidance for securing their deployment in real-world environments. To address discovered threats, we propose a defense-in-depth strategy incorporating input sanitization, planner-executor isolation, formal analyzers, and session safeguards—providing protection against both initial access and post-exploitation attack vectors.

Through a white-box analysis of a popular open-source project Browser Use, we demonstrate how untrusted web content can hijack agent behavior and lead to critical security breaches. Our findings include prompt injection, domain validation bypass, and credential exfiltration, evidenced by a disclosed CVE and a working proof-of-concept exploit. — Read More

#trust

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs “think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget. By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and ultimately raising crucial questions about their true reasoning capabilities. — Read More

#trust

Alignment from equivariance II

I recently had the privilege of having my idea criticized at the London Institute for Safe AI, including by Philip Kreer and Nicky Case. Previously the idea was vague; being with them forced me to make the idea specific. I managed to make it so specific that they found a problem with it! That’s progress 🙂

The problem is to do with syntax versus semantics, that is, “what is meant vs what is said”. I think I’ve got a solution to it too! I imagine it would be a necessary part of any moral equivariance “stack”. — Read More

#trust

Sycophancy is the first LLM “dark pattern”

People have been making fun of OpenAI models for being overly sycophantic for months now. I even wrote a post advising users to pretend that their work was written by someone else, to counteract the model’s natural desire to shower praise on the user. With the latest GPT-4o update, this tendency has been turned up even further. It’s now easy to convince the model that you’re the smartest, funniest, most handsome human in the world.

This is bad for obvious reasons. Lots of people use ChatGPT for advice or therapy. It seems dangerous for ChatGPT to validate people’s belief that they’re always in the right. There are extreme examples on Twitter of ChatGPT agreeing with people that they’re a prophet sent by God, or that they’re making the right choice to go off their medication. These aren’t complicated jailbreaks – the model will actively push you down this path. I think it’s fair to say that sycophancy is the first LLM “dark pattern”.Read More

#trust