Artificial general intelligence (AGI), AI that’s at least as capable as humans at most cognitive tasks, could be here within the coming years.
Integrated with agentic capabilities, AGI could supercharge AI to understand, reason, plan, and execute actions autonomously. Such technological advancement will provide society with invaluable tools to address critical global challenges, including drug discovery, economic growth and climate change.
This means we can expect tangible benefits for billions of people. For instance, by enabling faster, more accurate medical diagnoses, it could revolutionize healthcare. By offering personalized learning experiences, it could make education more accessible and engaging. By enhancing information processing, AGI could help lower barriers to innovation and creativity. By democratising access to advanced tools and knowledge, it could enable a small organization to tackle complex challenges previously only addressable by large, well-funded institutions. — Read More
Daily Archives: April 4, 2025
AI 2027
We predict that the impact of superhuman AI over the next decade will be enormous, exceeding that of the Industrial Revolution.
We wrote a scenario that represents our best guess about what that might look like. It’s informed by trend extrapolations, wargames, expert feedback, experience at OpenAI, and previous forecasting successes. — Read More
GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs
Recently, we introduced Platinum Benchmarks as a step toward quantifying the reliability of large language models (LLMs). In that work, we revised older benchmarks to minimize label noise, such as ambiguous or mislabeled examples, and showed that frontier LLMs still make genuine errors on simple questions. For example, as part of that work we revised a 300-problem subset of GSM8K, a dataset of grade school math word problems, and found that all LLMs we tested made at least one genuine error. If certifying the precision of just a subset of the dataset can highlight new failures across models, what if we scale to all of GSM8K?
Today, we’re releasing GSM8K-Platinum, a revised version of the full GSM8K test set. Our comparative evaluation of several frontier LLMs on both the original and revised datasets demonstrates that GSM8K-Platinum provides a more accurate assessment of mathematical reasoning capabilities, revealing differences in performance that were previously hidden. — Read More
Do Large Language Model Benchmarks Test Reliability?
When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs’ growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior.
Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at this https URL — Read More