Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale “serial” test-time compute by increasing the number of iterations per trajectory and “parallel” test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys/. — Read More
Tag Archives: Performance
Optimizing LLM Performance with LM Cache: Architectures, Strategies, and Real-World Applications
This article offers an in-depth technical research-minded view of LM Cache operates and how the caching machinery improves the efficiency, scalability, and cost reduction of Large Language Model (LLM) deployment. We study different types of caching architectures and mechanisms, how they can be integrated together with the new AI infrastructure and evaluated for performance. Examples from the field detail how some of our largest customers are deploying LM Caches in practice and what they have learned along the way. Finally, we conclude by highlighting some challenges, limitations and future directions in this fast-evolving field. — Read More
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost. — Read More
Inverse Scaling in Test-Time Compute
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs. — Read More
Life of an inference request (vLLM V1): How LLMs are served efficiently at scale
vLLM is an open-source inference engine that serves large language models. We deploy multiple vLLM instances across GPUs and load open weight models like Llama 4 into them. We then load balance traffic across vLLM instances, run health checks, and do upgrades. Our customers consume our managed service by sending their prompts to our API endpoints. This endpoint also determines the vLLM instance that serves their prompt.
vLLM sits at the intersection of AI and systems programming, so we thought that diving into its details might interest some of our readers. In this blog post, we describe how an inference request travels through vLLM’s OpenAI-compatible API server and core engine. We also provide key code pointers.
We assume readers are already familiar with the transformer architecture and large language models. If you’re not, we highly recommend this video by OpenAI co-founder Andrej Karpathy. We will focus on the new V1 architecture of vLLM and how it achieves state-of-the-art text generation performance. If you’re looking for the V0 behavior or multi-modal inference, please refer to other vLLM documentation. — Read More
How far can reasoning models scale?
Reasoning models like OpenAI’s o3 are less than a year old, but they’ve already seen rapid improvements on capabilities, and OpenAI researchers are very optimistic that this progress will continue. But it’s not clear how much further the techniques used to train reasoning models can scale.
After looking into the question, I think there is room to scale reasoning training further, but it’s unlikely that OpenAI or other frontier AI developers can scale by many orders of magnitude.
If reasoning training continues to scale at 10× every few months, in line with the jump from o1 to o3, it will reach the frontier of total training compute before long, perhaps within a year. At that point, the scaling rate will slow and converge with the overall growth rate in training compute of ~4× per year. Progress in reasoning models may slow down after this point as well. — Read More
Evaluating Long-Context Question & Answer Systems
While evaluating Q&A systems is straightforward with short paragraphs, complexity increases as documents grow larger. For example, lengthy research papers, novels and movies, as well as multi-document scenarios. Although some of these evaluation challenges also appear in shorter contexts, long-context evaluation amplifies issues.
… In this write-up, we’ll explore key evaluation metrics, how to build evaluation datasets, and methods to assess Q&A performance through human annotations and LLM-evaluators. We’ll also review several benchmarks across narrative stories, technical and academic texts, and very long-context, multi-document situations. Finally, we’ll wrap up with advice for evaluating long-context Q&A on our specific use cases. — Read More
With test-time scaling, SLMs can beat large language models in reasoning tasks
A new study by Shanghai AI Laboratory shows that with the test-time scaling (TTS) techniques, an SLM with 1 billion parameters can outperform a 405B LLM on the complex MATH and AIME benchmarks.
Test-time scaling (TTS) is the process of giving LLMs extra compute cylces during inference to improve their performance on various tasks. Leading reasoning models, such as OpenAI o1 and DeepSeek-R1, use “internal TTS,” which means they are trained to “think” slowly by generating a long string of chain-of-thought (CoT) tokens. — Read More
Inference Economics of Language Models
As the capabilities of AI models have expanded, and as the recent paradigm of test-time compute scaling has taken off, the demand for AI inference has grown enormously. Inference revenue at major AI companies such as OpenAI and Anthropic has been growing at a rate of 3x per year or more, even as their models continue to become smaller and cheaper compared to 2023.
A few years ago, the benchmark for whether a language model was fast enough was “human reading speed”: if a model could generate 10 tokens per second when responding to a user, that was good enough. Now, as models are asked to reason at length about complex problems and are placed inside elaborate agentic loops, this benchmark has become obsolete. The benefits to serving models faster for inference are greater than ever before. Despite this, there has been little work investigating how language models can be served quickly at scale and how much we can increase their speed at the expense of paying a higher price per token.
Today, we’re releasing a model of LLM inference economics which helps answer these questions. Working with the model reveals many important facts about inference at scale that are not widely appreciated. — Read More
Building a Distributed Cache for S3
We’ve built a distributed cache for cloud object storage: a shared, low-latency layer that gives all compute nodes fast access to hot data.
This post looks under the hood: how hot data caching worked before, why object storage made it hard, and how the new architecture fixes it. Benchmarks included. — Read More