GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs

Recently, we introduced Platinum Benchmarks as a step toward quantifying the reliability of large language models (LLMs). In that work, we revised older benchmarks to minimize label noise, such as ambiguous or mislabeled examples, and showed that frontier LLMs still make genuine errors on simple questions. For example, as part of that work we revised a 300-problem subset of GSM8K, a dataset of grade school math word problems, and found that all LLMs we tested made at least one genuine error. If certifying the precision of just a subset of the dataset can highlight new failures across models, what if we scale to all of GSM8K?

Today, we’re releasing GSM8K-Platinum, a revised version of the full GSM8K test set. Our comparative evaluation of several frontier LLMs on both the original and revised datasets demonstrates that GSM8K-Platinum provides a more accurate assessment of mathematical reasoning capabilities, revealing differences in performance that were previously hidden. — Read More

#performance

Do Large Language Model Benchmarks Test Reliability?

When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs’ growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior.

Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at this https URLRead More

#performance

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at \href{this https URL}{this url}. – Read More

#performance

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

The verdict is in: OpenAI’s newest and most capable traditional AI model, GPT-4.5, is big, expensive, and slow, providing marginally better performance than GPT-4o at 30x the cost for input and 15x the cost for output. The new model seems to prove that longstanding rumors of diminishing returns in training unsupervised-learning LLMs were correct and that the so-called “scaling laws” cited by many for years have possibly met their natural end. — Read More

#performance

Evolving Deeper LLM Thinking

We explore an evolutionary search strategy for scaling inference time compute in Large Language Models. The proposed approach, Mind Evolution, uses a language model to generate, recombine and refine candidate responses. The proposed approach avoids the need to formalize the underlying inference problem whenever a solution evaluator is available. Controlling for inference cost, we find that Mind Evolution significantly outperforms other inference strategies such as Best-of-N and Sequential Revision in natural language planning tasks. In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more than 98% of the problem instances using Gemini 1.5 Pro without the use of a formal solver. — Read More

#performance

#training

Microsoft’s smaller AI model beats the big guys: Meet Phi-4, the efficiency king

Microsoft launched a new artificial intelligence model today that achieves remarkable mathematical reasoning capabilities while using far fewer computational resources than its larger competitors. The 14-billion-parameter Phi-4 frequently outperforms much larger models like Google’s Gemini Pro 1.5, marking a significant shift in how tech companies might approach AI development.

The breakthrough directly challenges the AI industry’s “bigger is better” philosophy, where companies have raced to build increasingly massive models. While competitors like OpenAI’s GPT-4o and Google’s Gemini Ultra operate with hundreds of billions or possibly trillions of parameters, Phi-4’s streamlined architecture delivers superior performance in complex mathematical reasoning. — Read More

#performance

Scaling and evaluating sparse autoencoders

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer. — Read More

#performance

Ai2’s Molmo shows open source can meet, and beat, closed multimodal models

The common wisdom is that companies like Google, OpenAI, and Anthropic, with bottomless cash reserves and hundreds of top-tier researchers, are the only ones that can make a state-of-the-art foundation model. But as one among them famously noted, they “have no moat” — and Ai2 showed that today with the release of Molmo, a multimodal AI model that matches their best while also being small, free, and truly open source.

… Molmo (coming in 72B, 7B, and 1B-parameter variants), like other multimodal models, is capable of identifying and answering questions about almost any everyday situation or object. How do you work this coffee maker? How many dogs in this picture have their tongues out? Which options on this menu are vegan? What are the variables in this diagram? It’s the kind of visual understanding task we’ve seen demonstrated with varying levels of success and latency for years.

What’s different is not necessarily Molmo’s capabilities (which you can see in the demo below, or test here), but how it achieves them. — Read More

#performance

Reflection 70B model maker breaks silence amid fraud accusations

Matt Shumer, co-founder and CEO of OthersideAI, also known as its signature AI assistant writing product HyperWrite, has broken his near two days of silence after being accused of fraud when third-party researchers were unable to replicate the supposed top performance of a new large language model (LLM) he released on Thursday, September 5.

On his account on the social network X, Shumer apologized and claimed he “Got ahead of himself,” adding “I know that many of you are excited about the potential for this and are now skeptical.” — Read More

#performance

OpenAI’s Strawberry and Orion: The Next Leap in AI Evolution

In the ever-evolving landscape of artificial intelligence, OpenAI continues to push the boundaries of what’s possible. Their latest endeavors, the Strawberry and Orion AI models, are poised to redefine our expectations of machine intelligence. Let’s dive into what makes these models tick and why they matter.

… What sets Strawberry apart is its ability to think — and I mean really think. It’s not just pattern matching or regurgitating training data. This AI is solving problems it’s never seen before, like a mathematician encountering a novel proof. — Read More

#performance