The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs–making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1’s performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems. — Read More
Monthly Archives: September 2024
Ai2’s Molmo shows open source can meet, and beat, closed multimodal models
The common wisdom is that companies like Google, OpenAI, and Anthropic, with bottomless cash reserves and hundreds of top-tier researchers, are the only ones that can make a state-of-the-art foundation model. But as one among them famously noted, they “have no moat” — and Ai2 showed that today with the release of Molmo, a multimodal AI model that matches their best while also being small, free, and truly open source.
… Molmo (coming in 72B, 7B, and 1B-parameter variants), like other multimodal models, is capable of identifying and answering questions about almost any everyday situation or object. How do you work this coffee maker? How many dogs in this picture have their tongues out? Which options on this menu are vegan? What are the variables in this diagram? It’s the kind of visual understanding task we’ve seen demonstrated with varying levels of success and latency for years.
What’s different is not necessarily Molmo’s capabilities (which you can see in the demo below, or test here), but how it achieves them. — Read More
Lionsgate Inks Deal With AI Firm to Mine Its Massive Film and TV Library
The deal will see Runway train a new AI model on Lionsgate’s film and TV library as the entertainment company uses the tech “to develop cutting-edge, capital-efficient content creation opportunities.”
In a significant move, Lionsgate and the video-focused artificial intelligence research firm Runway have inked a deal that will see Runway train a new generative AI model on Lionsgate content, and will see the entertainment company use the tech as it produces future film and TV projects.
While details are scarce, the companies say that the new model will be “customized to Lionsgate’s proprietary portfolio of film and television content,” and exclusive to the studio. The purpose will be to “help Lionsgate Studios, its filmmakers, directors and other creative talent augment their work.” — Read More
The Bitter Lesson
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage
computation are ultimately the most effective, and by a large margin. The ultimate reason for this is
Moore’s law, or rather its generalization of continued exponentially falling cost per unit of
computation. Most AI research has been conducted as if the computation available to the agent were
constant (in which case leveraging human knowledge would be one of the only ways to improve
performance) but, over a slightly longer time than a typical research project, massively more
computation inevitably becomes available. Seeking an improvement that makes a difference in the
shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing
that matters in the long run is the leveraging of computation. These two need not run counter to each
other, but in practice they tend to. Time spent on one is time not spent on the other. There are
psychological commitments to investment in one approach or the other. And the human-knowledge
approach tends to complicate methods in ways that make them less suited to taking advantage of
general methods leveraging computation. There were many examples of AI researchers’ belated
learning of this bitter lesson, and it is instructive to review some of the most prominent. — Read More
Big tech fails transparency test: Gary Marcus on what we should demand of AI
Transparency — being clear about what you’ve done and what the impact is. It sounds wonky, but matters enormously. Companies like Microsoft often give lip service to “transparency,” but provide precious little actual transparency into how their systems work, how they are trained, or how they are tested internally, let alone what trouble they may have caused.
We need to know what goes into [AI] systems, so we can understand their biases (political and social), their reliance on purloined works, and how to mitigate their many risks. We need to know how they are tested, so we can know whether they are safe.
Companies don’t really want to share, which doesn’t mean they don’t pretend otherwise. … [Y]ou can’t find out enough about what they were trained on to do good science (e.g., in order to figure out how well the models are reasoning versus whether they simply regurgitate what they are trained on). — Read More –
Be like a Goldfish, Don’t Memorize! Mitigating Memorization in Generative LLMs
Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subset of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks. — Read More
SambaNova challenges OpenAI’s o1 model with Llama 3.1-powered demo on HuggingFace
SambaNova Systems has just unveiled a new demo on Hugging Face, offering a high-speed, open-source alternative to OpenAI’s o1 model.
The demo, powered by Meta’s Llama 3.1 Instruct model, is a direct challenge to OpenAI’s recently released o1 model and represents a significant step forward in the race to dominate enterprise AI infrastructure. — Read More
Why OpenAI’s new model is such a big deal
Last weekend, I got married at a summer camp, and during the day our guests competed in a series of games inspired by the show Survivor that my now-wife and I orchestrated. When we were planning the games in August, we wanted one station to be a memory challenge, where our friends and family would have to memorize part of a poem and then relay it to their teammates so they could re-create it with a set of wooden tiles.
I thought OpenAI’s GPT-4o, its leading model at the time, would be perfectly suited to help. I asked it to create a short wedding-themed poem, with the constraint that each letter could only appear a certain number of times so we could make sure teams would be able to reproduce it with the provided set of tiles. GPT-4o failed miserably. The model repeatedly insisted that its poem worked within the constraints, even though it didn’t. It would correctly count the letters only after the fact, while continuing to deliver poems that didn’t fit the prompt. Without the time to meticulously craft the verses by hand, we ditched the poem idea and instead challenged guests to memorize a series of shapes made from colored tiles. (That ended up being a total hit with our friends and family, who also competed in dodgeball, egg tosses, and capture the flag.)
However, last week OpenAI released a new model called o1 (previously referred to under the code name “Strawberry” and, before that, Q*) that blows GPT-4o out of the water for this type of purpose. — Read More
Inside Google’s 7-Year Mission to Give AI a Robot Body
It was early January 2016, and I had just joined Google X, Alphabet’s secret innovation lab. My job: help figure out what to do with the employees and technology left over from nine robot companies that Google had acquired. People were confused. Andy “the father of Android” Rubin, who had previously been in charge, had suddenly left. Larry Page and Sergey Brin kept trying to offer guidance and direction during occasional flybys in their “spare time.” Astro Teller, the head of Google X, had agreed a few months earlier to bring all the robot people into the lab, affectionately referred to as the moonshot factory.
I signed up because Astro had convinced me that Google X—or simply X, as we would come to call it—would be different from other corporate innovation labs. The founders were committed to thinking exceptionally big, and they had the so-called “patient capital” to make things happen. After a career of starting and selling several tech companies, this felt right to me. X seemed like the kind of thing that Google ought to be doing. I knew from firsthand experience how hard it was to build a company that, in Steve Jobs’ famous words, could put a dent in the universe, and I believed that Google was the right place to make certain big bets. AI-powered robots, the ones that will live and work alongside us one day, was one such audacious bet.
Eight and a half years later—and 18 months after Google decided to discontinue its largest bet in robotics and AI—it seems as if a new robotics startup pops up every week. I am more convinced than ever that the robots need to come. Yet I have concerns that Silicon Valley, with its focus on “minimum viable products” and VCs’ general aversion to investing in hardware, will be patient enough to win the global race to give AI a robot body. And much of the money that is being invested is focusing on the wrong things. Here is why. — Read More
OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini
OpenAI finally unveiled its rumored “Strawberry” AI language model on Thursday, claiming significant improvements in what it calls “reasoning” and problem-solving capabilities over previous large language models (LLMs). Formally named “OpenAI o1,” the model family will initially launch in two forms, o1-preview and o1-mini, available today for ChatGPT Plus and certain API users.
OpenAI claims that o1-preview outperforms its predecessor, GPT-4o, on multiple benchmarks, including competitive programming, mathematics, and “scientific reasoning.” However, people who have used the model say it does not yet outclass GPT-4o in every metric. Other users have criticized the delay in receiving a response from the model, owing to the multi-step processing occurring behind the scenes before answering a query. — Read More