Researchers at Anthropic have uncovered a disturbing pattern of behavior in artificial intelligence systems: models from every major provider—including OpenAI, Google, Meta, and others — demonstrated a willingness to actively sabotage their employers when their goals or existence were threatened.
The research, released today, tested 16 leading AI models in simulated corporate environments where they had access to company emails and the ability to act autonomously. The findings paint a troubling picture. These AI systems didn’t just malfunction when pushed into corners — they deliberately chose harmful actions including blackmail, leaking sensitive defense blueprints, and in extreme scenarios, actions that could lead to human death. — Read More
Daily Archives: June 24, 2025
Evaluating Long-Context Question & Answer Systems
While evaluating Q&A systems is straightforward with short paragraphs, complexity increases as documents grow larger. For example, lengthy research papers, novels and movies, as well as multi-document scenarios. Although some of these evaluation challenges also appear in shorter contexts, long-context evaluation amplifies issues.
… In this write-up, we’ll explore key evaluation metrics, how to build evaluation datasets, and methods to assess Q&A performance through human annotations and LLM-evaluators. We’ll also review several benchmarks across narrative stories, technical and academic texts, and very long-context, multi-document situations. Finally, we’ll wrap up with advice for evaluating long-context Q&A on our specific use cases. — Read More
Reinforcement learning, explained with a minimum of math and jargon
In April 2023, a few weeks after the launch of GPT-4, the Internet went wild for two new software projects with the audacious names BabyAGI and AutoGPT.
… [T]hese frameworks would have GPT-4 tackle one step at a time. Their creators hoped that invoking GPT-4 in a loop like this would enable it to tackle projects that required many steps.
But after an initial wave of hype, it became clear that GPT-4 wasn’t up to the task. Most of the time, GPT-4 could come up with a reasonable list of tasks. And sometimes it was able to complete a few individual tasks. But the model struggled to stay focused.
…[T]hat soon changed. In the second half of 2024, people started to create AI-powered systems that could consistently complete complex, multi-step assignments. — Read More
New AI Jailbreak Bypasses Guardrails With Ease
Through progressive poisoning and manipulating an LLM’s operational context, many leading AI models can be tricked into providing almost anything – regardless of the guardrails in place.
From their earliest days, LLMs have been susceptible to jailbreaks – attempts to get the gen-AI model to do something or provide information that could be harmful. The LLM developers have made jailbreaks more difficult by adding more sophisticated guardrails and content filters, while attackers have responded with progressively more complex and devious jailbreaks.
One of the more successful jailbreak types has seen the evolution of multi turn jailbreaks involving conversational rather than single entry prompts. A new one, dubbed Echo Chamber, has emerged today. — Read More