Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at this https URL — Read More
#performanceRecent Updates Page 68
Alibaba unveils Qwen3, a family of ‘hybrid’ AI reasoning models
Chinese tech company Alibaba on Monday released Qwen3, a family of AI models that the company claims can match and, in some cases, outperform the best models available from Google and OpenAI.
Most of the models are — or soon will be — available for download under an “open” license on AI dev platform Hugging Face and GitHub. They range in size from 0.6 billion parameters to 235 billion parameters. (Parameters roughly correspond to a model’s problem-solving skills, and models with more parameters generally perform better than those with fewer parameters.) — Read More
Why Developers Should Care About Generative AI (Even They Aren’t AI Expert)
Software development is about to undergo a generative change. What this means is that AI (Artificial Intelligence) has the potential to make developers more productive, as three systems on the market already provide this: GitHub Copilot, Anthropic’s Claude and OpenAI’s ChatGPT.
Hence, every developer, no matter if he or she specializes in AI or not, needs to understand and realize that as this technology is advancing so rapidly, any of us needs to know what it is, why it is relevant, and how to use it. — Read More
Alignment from equivariance II
I recently had the privilege of having my idea criticized at the London Institute for Safe AI, including by Philip Kreer and Nicky Case. Previously the idea was vague; being with them forced me to make the idea specific. I managed to make it so specific that they found a problem with it! That’s progress 🙂
The problem is to do with syntax versus semantics, that is, “what is meant vs what is said”. I think I’ve got a solution to it too! I imagine it would be a necessary part of any moral equivariance “stack”. — Read More
Wargaming Insights: Is Investing in a SOC Worth It?
A Markov Chain Simulation to compare two competing strategies.
… By using wargaming, security teams can model cyber threat scenarios, apply different defense measures (like firewalls, endpoint protection, and SOCs), and observe how these defenses alter the attacker’s likelihood of success. This provides a better understanding of where resources should be allocated and how to improve defense measures.
In this post, we’ll use wargaming to evaluate whether investing in security detection and response capabilities is worthwhile. The approach involves modeling a simple cyber intrusion as a Markov Chain and adding a detection step to analyze how it affects the likelihood of a successful attack. — Read More
Don’t Write Prompts; Write Briefs
o1 is not a chat model.
… [T]hink of it like a “report generator.”
…Give a ton of context. Whatever you think I mean by a “ton” — 10x that.
… o1 will just take lazy questions at face value and doesn’t try to pull the context from you. Instead, you need to push as much context as you can into o1. — Read More
Sycophancy is the first LLM “dark pattern”
People have been making fun of OpenAI models for being overly sycophantic for months now. I even wrote a post advising users to pretend that their work was written by someone else, to counteract the model’s natural desire to shower praise on the user. With the latest GPT-4o update, this tendency has been turned up even further. It’s now easy to convince the model that you’re the smartest, funniest, most handsome human in the world.
This is bad for obvious reasons. Lots of people use ChatGPT for advice or therapy. It seems dangerous for ChatGPT to validate people’s belief that they’re always in the right. There are extreme examples on Twitter of ChatGPT agreeing with people that they’re a prophet sent by God, or that they’re making the right choice to go off their medication. These aren’t complicated jailbreaks – the model will actively push you down this path. I think it’s fair to say that sycophancy is the first LLM “dark pattern”. — Read More
This 3D-Printed Starbucks Cafe in Texas Is Just Like Its Coffee – Industrial And Rapidly Manufactured
Starbucks, the world’s most efficient coffee vending machine disguised as a lifestyle brand, has opened its first fully 3D-printed outlet in Brownsville, Texas. If you’ve ever marveled at how a Starbucks latte seems to be conjured out of thin air with military precision – and almost no soul – you’ll appreciate just how perfect it is that their latest café was squeezed out of a robotic nozzle like industrial toothpaste. Built by Peri 3D Construction using a Cobod BOD2 printer, this 1,400-square-foot drive-thru and pickup shop isn’t a café you linger in. It’s a caffeine fueling station, printed into existence, then sprinkled with human finishing touches like windows, doors, and a porch to make it look vaguely more inviting than an automated bunker. — Read More
Guillotine: Hypervisors for Isolating Malicious AIs
As AI models become more embedded in critical sectors like finance, healthcare, and the military, their inscrutable behavior poses ever-greater risks to society. To mitigate this risk, we propose Guillotine, a hypervisor architecture for sandboxing powerful AI models — models that, by accident or malice, can generate existential threats to humanity. Although Guillotine borrows some well-known virtualization techniques, Guillotine must also introduce fundamentally new isolation mechanisms to handle the unique threat model posed by existential-risk AIs. For example, a rogue AI may try to introspect upon hypervisor software or the underlying hardware substrate to enable later subversion of that control plane; thus, a Guillotine hypervisor requires careful co-design of the hypervisor software and the CPUs, RAM, NIC, and storage devices that support the hypervisor software, to thwart side channel leakage and more generally eliminate mechanisms for AI to exploit reflection-based vulnerabilities. Beyond such isolation at the software, network, and microarchitectural layers, a Guillotine hypervisor must also provide physical fail-safes more commonly associated with nuclear power plants, avionic platforms, and other types of mission critical systems. Physical fail-safes, e.g., involving electromechanical disconnection of network cables, or the flooding of a datacenter which holds a rogue AI, provide defense in depth if software, network, and microarchitectural isolation is compromised and a rogue AI must be temporarily shut down or permanently destroyed. — Read More
Sleep-time Compute: Beyond Inference Scaling at Test-time
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to “think” offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks – Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task. — Read More