From System 1 to System 2: A Survey of Reasoning Large Language Models

Achieving human-level intelligence requires refining the transition from the fast, intuitive System 1 to the slower, more deliberate System 2 reasoning. While System 1 excels in quick, heuristic decisions, System 2 relies on logical reasoning for more accurate judgments and reduced biases. Foundational Large Language Models (LLMs) excel at fast decision-making but lack the depth for complex reasoning, as they have not yet fully embraced the step-by-step analysis characteristic of true System 2 thinking. Recently, reasoning LLMs like OpenAI’s o1/o3 and DeepSeek’s R1 have demonstrated expert-level performance in fields such as mathematics and coding, closely mimicking the deliberate reasoning of System 2 and showcasing human-like cognitive abilities. This survey begins with a brief overview of the progress in foundational LLMs and the early development of System 2 technologies, exploring how their combination has paved the way for reasoning LLMs. Next, we discuss how to construct reasoning LLMs, analyzing their features, the core methods enabling advanced reasoning, and the evolution of various reasoning LLMs. Additionally, we provide an overview of reasoning benchmarks, offering an in-depth comparison of the performance of representative reasoning LLMs. Finally, we explore promising directions for advancing reasoning LLMs and maintain a real-time \href{this https URL}{GitHub Repository} to track the latest developments. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this rapidly evolving field. — Read More

#nlp

AI Promise and Chip Precariousness

Yesterday Anthropic released Claude Sonnet 3.7; Dylan Patel had the joke of the day about Anthropic’s seeming aversion to the number “4”, which means “die” in Chinese.

Jokes aside, the correction on this post by Ethan Mollick suggests that Anthropic did not increment the main version number because Sonnet 3.7 is still in the GPT-4 class of models as far as compute is concerned. I love Mollick’s work, but reject his neutral naming scheme: whoever gets to a generation first deserves the honor of the name. In other words, if Gen2 models are GPT-4 class, then Gen3 models are Grok 3 class.

And, whereas Sonnet 3.7 is an evolution of Sonnet 3.5’s fascinating mixture of personality and coding prowess, likely a result of some Anthropic special sauce in post-training, Grok 3 feels like a model that is the result of a step-order increase in compute capacity, with a much lighter layer of reinforcement learning with human feedback (RLHF). Its answers are far more in-depth and detailed (model good!), but frequently becomes too verbose (RLHF lacking); it gets math problems right (model good!), but its explanations are harder to follow (RLHF lacking). It is also much more willing to generate forbidden content, from erotica to bomb recipes, while having on the surface the political sensibilities of Tumblr, with something more akin to 4chan under the surface if you prod.  Grok 3, more than any model yet, feels like the distilled Internet; it’s my favorite so far. — Read More

#nvidia

Brain-to-Text Decoding: A Non-invasive Approach via Typing

Modern neuroprostheses can now restore communication in patients who have lost the ability to speak or move. However, these invasive devices entail risks inherent to neurosurgery. Here, we introduce a non-invasive method to decode the production of sentences from brain activity and demonstrate its efficacy in a cohort of 35 healthy volunteers. For this, we present Brain2Qwerty, a new deep learning architecture trained to decode sentences from either electro- (EEG) or magneto-encephalography (MEG), while participants typed briefly memorized sentences on a QWERTY keyboard. With MEG, Brain2Qwerty reaches, on average, a character-error-rate (CER) of 32% and substantially outperforms EEG (CER: 67%). For the best participants, the model achieves a CER of 19%, and can perfectly decode a variety of sentences outside of the training set. While error analyses suggest that decoding depends on motor processes, the analysis of typographical errors suggests that it also involves higher- level cognitive factors. Overall, these results narrow the gap between invasive and non-invasive methods and thus open the path for developing safe brain-computer interfaces for non-communicating patients. — Read More

#human

The Voice Stack

The Voice Stack is improving rapidly. Systems that interact with users via speaking and listening will drive many new applications. Over the past year, I’ve been working closely with DeepLearning.AI, AI Fund, and several collaborators on voice-based applications, and I will share best practices I’ve learned in this and future letters.

Foundation models that are trained to directly input, and often also directly generate, audio have contributed to this growth, but they are only part of the story. OpenAI’s RealTime API makes it easy for developers to write prompts to develop systems that deliver voice-in, voice-out experiences. This is great for building quick-and-dirty prototypes, and it also works well for low-stakes conversations where making an occasional mistake is okay. I encourage you to try it! — Read More

#voice

Jewish celebrities offer a f**k you to Kanye West in viral AI video

Read More

#videos

Meta Appears to Have Invented a Device Allowing You to Type With Your Brain

Mark Zuckerberg’s Meta says it’s created a device that lets you produce text simply by thinking what you want to say.

As detailed in a pair of studies released by Meta last week, researchers used a state-of-the-art brain scanner and a deep learning AI model to interpret the neural signals of people while they typed, guessing what keys they were hitting with an accuracy high enough to allow them to reconstruct entire sentences.  — Read More

#human