LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench

The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities. PlanBench, an extensible benchmark we developed in 2022, soon after the release of GPT3, has remained an important tool for evaluating the planning abilities of LLMs. Despite the slew of new private and open source LLMs since GPT3, progress on this benchmark has been surprisingly slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs–making it a new kind of model: a Large Reasoning Model (LRM). Using this development as a catalyst, this paper takes a comprehensive look at how well current LLMs and new LRMs do on PlanBench. As we shall see, while o1’s performance is a quantum improvement on the benchmark, outpacing the competition, it is still far from saturating it. This improvement also brings to the fore questions about accuracy, efficiency, and guarantees which must be considered before deploying such systems. — Read More

#strategy

The Bitter Lesson

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage
computation are ultimately the most effective, and by a large margin. The ultimate reason for this is
Moore’s law, or rather its generalization of continued exponentially falling cost per unit of
computation. Most AI research has been conducted as if the computation available to the agent were
constant (in which case leveraging human knowledge would be one of the only ways to improve
performance) but, over a slightly longer time than a typical research project, massively more
computation inevitably becomes available. Seeking an improvement that makes a difference in the
shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing
that matters in the long run is the leveraging of computation. These two need not run counter to each
other, but in practice they tend to. Time spent on one is time not spent on the other. There are
psychological commitments to investment in one approach or the other. And the human-knowledge
approach tends to complicate methods in ways that make them less suited to taking advantage of
general methods leveraging computation. There were many examples of AI researchers’ belated
learning of this bitter lesson, and it is instructive to review some of the most prominent. — Read More

#strategy

Time100/AI

…Our purpose in creating the TIME100 AI is to put leaders like [Sundar] Pichai and [Meredith] Whittaker in dialogue and to open up their views to TIME’s readers. That is why we are excited to share with you the second edition of the TIME100 AI. We built this program in the spirit of the TIME100, the world’s most influential community. TIME’s knowledgeable editors and correspondents, led by Emma Barker and Ayesha Javed, interviewed their sources and consulted members of last year’s list to find the best new additions to our community of AI leaders. Ninety-one of the members of the 2024 list were not on last year’s, an indication of just how quickly this field is changing. They span dozens of companies, regions, and perspectives, including 15-year-old Francesca Mani, who advocates across the U.S. for protections for victims of deepfakes, and 77-year-old Andrew Yao, one of China’s most prominent computer scientists, who called last fall for an international regulatory body for AI. — Read More

#strategy

Superhuman Automated Forecasting (FiveThirtyNine)

In a recent appearance on Conversations with Tyler, famed political forecaster Nate Silver expressed skepticism about AIs replacing human forecasters in the near future. When asked how long it might take for AIs to reach superhuman forecasting abilities, Silver replied: “15 or 20 [years].”

In light of this, we are excited to announce “FiveThirtyNine,” a superhuman AI forecasting bot. Our bot, built on GPT-4o, provides probabilities for any user-entered query, including “Will Trump win the 2024 presidential election?” and “Will China invade Taiwan by 2030?” Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters; since crowds are for the most part superhuman, so is FiveThirtyNine. — Read More

#strategy

AI worse than humans in every way at summarising information, government trial finds

Artificial intelligence is worse than humans in every way at summarising documents and might actually create additional work for people, a government trial of the technology has found.

Amazon conducted the test earlier this year for Australia’s corporate regulator the Securities and Investments Commission (ASIC) using submissions made to an inquiry. The outcome of the trial was revealed in an answer to a questions on notice at the Senate select committee on adopting artificial intelligence.

… [R]eviewers overwhelmingly found that the human summaries beat out their AI competitors on every criteria and on every submission, scoring an 81% on an internal rubric compared with the machine’s 47%.  — Read More

#strategy

Galileo LLM Hallucination Index

Many enterprise teams have already successfully deployed LLMs in production, and many others have committed to deploying Generative AI products in 2024. However, for enterprise AI teams, the biggest hurdle to deploying production-ready Generative AI products remains the fear of model hallucinations – a catch-all phrase for when the model generates text that is incorrect or fabricated. There can be several reasons for this, such as a lack of the model’s capacity to memorize all of the information it was fed, training data errors, and outdated training data. — Read More

The Index

#strategy, #accuracy

Reimagining cloud strategy for AI-first enterprises

The rise of generative artificial intelligence (AI), natural language processing, and computer vision has sparked lofty predictions: AI will revolutionize business operations, transform the nature of knowledge work, and boost companies’ bottom lines and the larger global economy by trillions of dollars.

Executives and technology leaders are eager to see these promises realized, and many are enjoying impressive results of early AI investments. Balakrishna D.R. (Bali), executive vice president, global services head, AI and industry verticals at Infosys, says that generative AI is already proving game-changing for tasks such as knowledge management, search and summarization, software development, and customer service across sectors such as financial services, retail, health care, and automotive. — Read More

#strategy

Andrew Ng’s new model lets you play around with solar geoengineering to see what would happen

AI pioneer Andrew Ng has released a simple online tool that allows anyone to tinker with the dials of a solar geoengineering model, exploring what might happen if nations attempt to counteract climate change by spraying reflective particles into the atmosphere.

The concept of solar geoengineering was born from the realization that the planet has cooled in the months following massive volcanic eruptions, including one that occurred in 1991, when Mt. Pinatubo blasted some 20 million tons of sulfur dioxide into the stratosphere. But critics fear that deliberately releasing such materials could harm certain regions of the world, discourage efforts to cut greenhouse-gas emissions, or spark conflicts between nations, among other counterproductive consequences

The goal of Ng’s emulator, called Planet Parasol, is to invite more people to think about solar geoengineering, explore the potential trade-offs involved in such interventions, and use the results to discuss and debate our options for climate action. The tool, developed in partnership with researchers at Cornell, the University of California, San Diego, and other institutions, also highlights how AI could help advance our understanding of solar geoengineering.  — Read More

Try the Model

#strategy

A16Z: THE TOP 100 GEN AI CONSUMER APPS

Keeping up with the ever-expanding universe of consumer gen AI products is a dynamic, fast-moving job, whether we’re building time-saving new workflows, exploring realworld uses, or experimenting with new creative stacks. But amid the relentless onslaught of product launches, investment announcements, and hyped-up features, it’s worth asking: Which of these gen AI apps are people actually using? Which behaviors and categories are gaining traction among consumers? And which AI apps are people returning to, versus dabbling and dropping?

Welcome to the third installment of the Top 100 Gen AI Consumer Apps.

Every six months, we take a deeper dive into the data to rank the top 50 AI-first web products (by unique monthly visits) and top 50 AI-first mobile apps (by monthly active users). This time, nearly 30% of the companies were new, compared to our previous March 2024 report.

Read More

#strategy

The AI summer

Hundreds of millions of people have tried ChatGPT, but most of them haven’t been back. Every big company has done a pilot, but far fewer are in deployment. Some of this is just a matter of time. But LLMs might also be a trap: they look like products and they look magic, but they aren’t. Maybe we have to go through the slow, boring hunt for product-market fit after all.

My old boss Marc Andreessen liked to say that every failed idea from the Dotcom bubble would work now. It just took time – it took years to build out broadband, consumers had to buy PCs, retailers and big companies needed to build e-commerce infrastructure, a whole online ad business had to evolve and grow, and more fundamentally, consumers and businesses had to change their behaviour. The future can take a while – it took more than 20 years for 20% of US retail to move online.

… For consumers, ChatGPT is just a website or an app, and (to begin with) it could ride on all of the infrastructure we’ve built over the last 25 years. So a huge number of people went off to try it last year. The problem is that most of them haven’t been back. — Read More

#strategy