Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage – the fraction of problems solved by any attempt – scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget. — Read More
Monthly Archives: August 2024
Anthropic’s new Claude prompt caching will save developers a fortune
Anthropic introduced prompt caching on its API, which remembers the context between API calls and allows developers to avoid repeating prompts.
The prompt caching feature is available in public beta on Claude 3.5 Sonnet and Claude 3 Haiku, but support for the largest Claude model, Opus, is still coming soon.
Prompt caching, described in this 2023 paper, lets users keep frequently used contexts in their sessions. As the models remember these prompts, users can add additional background information without increasing costs. This is helpful in instances where someone wants to send a large amount of context in a prompt and then refer back to it in different conversations with the model. It also lets developers and other users better fine-tune model responses. — Read More
Agent Q: Groundbreaking Research in AI Web Agents
OpenAI’s Newest AI Humanoid Robot – Figure 02 – Just Stunned the Robotics World!
The AI bubble has burst. Here’s how we know.
When you live in tech bubble central, signs of a tech bubble become easier to spot every time. Drive to Silicon Valley on any of the Bay Area’s main arteries right now, and you’ll notice nearly every billboard pumping a product “driven by AI.”
On the same drive five years ago, you’d see the same scene with the word “blockchain.” Ten years ago: “big data.” Twenty-five years ago: literally any word followed by “.com.” Each one in turn, for all its promise, became a punchline.
It’s not a question of whether the Silicon Valley machine was wrong on any of these technologies. Especially not the dotcom thing. Heck, the entire internet had just dropped into our laps in the 1990s; you can’t blame anyone for dreaming about creating all the stuff we now take for granted. It’s a question of impatience: all the investors, startup shysters and panicked CEOs that rush in when a promising new technology emerges are eager for immediate results. — Read More
Why AI’s Tom Cruise problem means it is ‘doomed to fail’
LLMs’ ‘reversal curse’ leads it to fail at drawing relationships between simple facts. It’s a problem that could prove fatal
In 2021, linguist Emily Bender and computer scientist Timnit Gebru published a paper that described the then-nascent field of language models as one of “stochastic parrots”. A language model, they wrote, “is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning.”
… If a human learns the fact, “Valentina Tereshkova was the first woman to travel to space”, they can also correctly answer, “Who was the first woman to travel to space?” This is such a basic form of generalization that it seems trivial. Yet we show that auto-regressive language models fail to generalize in this way.
This is an instance of an ordering effect we call the Reversal Curse.
[R]esearchers “taught” a bunch of fake facts to large language models, and found time and again that they simply couldn’t do the base work of inferring the reverse. — Read More
‘Hold on to your seats’: how much will AI affect the art of film-making?
The future is here, whether some like it or not, and artificial intelligence is already impacting the film industry. But just how far can, and should, it go?
Last year, Rachel Antell, an archival producer for documentary films, started noticing AI-generated images mixed in with authentic photos. There are always holes or limitations in an archive; in one case, film-makers got around a shortage of images for a barely photographed 19th-century woman by using AI to generate what looked like old photos. Which brought up the question: should they? And if they did, what sort of transparency is required? The capability and availability of generative AI – the type that can produce text, images and video – have changed so rapidly, and the conversations around it have been so fraught, that film-makers’ ability to use it far outpaces any consensus on how.
… So Antell and several colleagues formed the Archival Producers Alliance (APA), a volunteer group of about 300 documentary producers and researchers dedicated to, in part, developing best practices for use of generative AI in factual storytelling. “Instead of being, ‘the house is burning, we’ll never have jobs,’ it’s much more based around an affirmation of why we got into this in the first place,” said Stephanie Jenkins, a founding APA member. Experienced documentary film-makers have “really been wrestling with this”, in part because “there is so much out there about AI that is so confusing and so devastating or, alternatively, a lot of snake oil.” — Read More