A sophisticated new artificial intelligence (AI) platform tailored for offensive cyber operations, named Xanthorox AI, has been identified by cybersecurity firm SlashNext. First appearing in late Q1 2025, Xanthorox AI is reportedly circulating within cybercrime communities on darknet forums and encrypted channels.XXXXAccording to SlashNext’s investigation, shared with Hackread.com ahead of its publishing on Monday, Xanthorox stands out from previous malicious AI tools like WormGPT, FraudGPT and EvilGPT due to its independent, multi-model framework. The system is based on five distinct AI models optimized for specific cyber operations.
These models are hosted on private servers under the seller’s control rather than public cloud infrastructure or openly accessible APIs. This unique setup sets Xanthorox AI apart from previous malicious tools that often relied on existing large language models (LLMs). — Read More
Recent Updates Page 70
Amazon Nova Reel 1.1: Featuring up to 2-minutes multi-shot videos
At re:Invent 2024, we announced Amazon Nova models, a new generation of foundation models (FMs), including Amazon Nova Reel, a video generation model that creates short videos from text descriptions and optional reference images (together, the “prompt”).
Today, we introduce Amazon Nova Reel 1.1, which provides quality and latency improvements in 6-second single-shot video generation, compared to Amazon Nova Reel 1.0. This update lets you generate multi-shot videos up to 2-minutes in length with consistent style across shots. You can either provide a single prompt for up to a 2-minute video composed of 6-second shots, or design each shot individually with custom prompts. This gives you new ways to create video content through Amazon Bedrock. — Read More
The day I taught AI to think like a Senior Developer
Is it just me, or are the code generation AIs we’re all using fundamentally broken?
For months, I’ve watched developers praise AI coding tools while silently cleaning up their messes, afraid to admit how much babysitting they actually need.
I realized that AI IDEs don’t actually understand codebases — they’re just sophisticated autocomplete tools with good marketing. The emperor has no clothes, and I’m tired of pretending otherwise.
After two years of frustration watching my AI assistants constantly “forget” where files were located, create duplicates, and use completely incorrect patterns, I finally built what the big AI companies couldn’t (or wouldn’t.)
I decided to find out: What if I could make AI actually understand how my codebase works? — Read More
Google announces Sec-Gemini v1, a new experimental cybersecurity model
[D]efenders face the daunting task of securing against all cyber threats, while attackers need to successfully find and exploit only a single vulnerability. This fundamental asymmetry has made securing systems extremely difficult, time consuming and error prone. AI-powered cybersecurity workflows have the potential to help shift the balance back to the defenders by force multiplying cybersecurity professionals like never before.
Effectively powering SecOps workflows requires state-of-the-art reasoning capabilities and extensive current cybersecurity knowledge. Sec-Gemini v1 achieves this by combining Gemini’s advanced capabilities with near real-time cybersecurity knowledge and tooling. This combination allows it to achieve superior performance on key cybersecurity workflows, including incident root cause analysis, threat analysis, and vulnerability impact understanding. — Read More
How to evaluate an LLM system
Evaluating large language model (LLM) based applications is inherently challenging due to the unique nature of these systems. Unlike traditional software applications, where outputs are deterministic and predictable, LLMs generate outputs that can vary each time they are run, even with the same input. This variability arises from the probabilistic nature of these models, which means there is no single correct output for any given input. Consequently, testing LLM-based applications requires specialized evaluation techniques — known today as ‘evals’ — to ensure they meet performance and reliability standards. — Read More
So You Uploaded Your Brain… Now What?
Taking a responsible path to AGI
Artificial general intelligence (AGI), AI that’s at least as capable as humans at most cognitive tasks, could be here within the coming years.
Integrated with agentic capabilities, AGI could supercharge AI to understand, reason, plan, and execute actions autonomously. Such technological advancement will provide society with invaluable tools to address critical global challenges, including drug discovery, economic growth and climate change.
This means we can expect tangible benefits for billions of people. For instance, by enabling faster, more accurate medical diagnoses, it could revolutionize healthcare. By offering personalized learning experiences, it could make education more accessible and engaging. By enhancing information processing, AGI could help lower barriers to innovation and creativity. By democratising access to advanced tools and knowledge, it could enable a small organization to tackle complex challenges previously only addressable by large, well-funded institutions. — Read More
AI 2027
We predict that the impact of superhuman AI over the next decade will be enormous, exceeding that of the Industrial Revolution.
We wrote a scenario that represents our best guess about what that might look like. It’s informed by trend extrapolations, wargames, expert feedback, experience at OpenAI, and previous forecasting successes. — Read More
GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs
Recently, we introduced Platinum Benchmarks as a step toward quantifying the reliability of large language models (LLMs). In that work, we revised older benchmarks to minimize label noise, such as ambiguous or mislabeled examples, and showed that frontier LLMs still make genuine errors on simple questions. For example, as part of that work we revised a 300-problem subset of GSM8K, a dataset of grade school math word problems, and found that all LLMs we tested made at least one genuine error. If certifying the precision of just a subset of the dataset can highlight new failures across models, what if we scale to all of GSM8K?
Today, we’re releasing GSM8K-Platinum, a revised version of the full GSM8K test set. Our comparative evaluation of several frontier LLMs on both the original and revised datasets demonstrates that GSM8K-Platinum provides a more accurate assessment of mathematical reasoning capabilities, revealing differences in performance that were previously hidden. — Read More
Do Large Language Model Benchmarks Test Reliability?
When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs’ growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior.
Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at this https URL — Read More