We explore language models that recursively call themselves or other LLMs before providing a final answer. Our goal is to enable the processing of essentially unbounded input context length and output length and to mitigate degradation “context rot”.
We propose Recursive Language Models, or RLMs, a general inference strategy where language models can decompose and recursively interact with their input context as a variable. We design a specific instantiation of this where GPT-5 or GPT-5-mini is queried in a Python REPL environment that stores the user’s prompt in a variable. — Read More
Tag Archives: NLP
State of LLMs in Late 2025
By October 2025, the AI landscape has evolved from “one model does everything” to a hyper-specialized ecosystem where each LLM has distinct strengths.
Training compute is doubling every five months, datasets expand every eight months, and performance continues hitting new benchmarks. Yet challenges are emerging: diminishing returns on scaling, massive energy consumption, and the rise of smaller specialized models (SLMs) are reshaping the field.
The question isn’t “Which AI is smartest?” It’s “Which AI is the right tool for this job?”
This guide explains the technical foundations that make each model different and helps choose the right one for specific tasks. — Read More
Why Can’t Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls
Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via implicit chain-of-thought, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to “cache” and “retrieve” pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the “running sum” via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue. — Read More
Effective context engineering for AI agents
After a few years of prompt engineering being the focus of attention in applied AI, a new term has come to prominence: context engineering. Building with language models is becoming less about finding the right words and phrases for your prompts, and more about answering the broader question of “what configuration of context is most likely to generate our model’s desired behavior?”
Context refers to the set of tokens included when sampling from a large-language model (LLM). The engineering problem at hand is optimizing the utility of those tokens against the inherent constraints of LLMs in order to consistently achieve a desired outcome. Effectively wrangling LLMs often requires thinking in context — in other words: considering the holistic state available to the LLM at any given time and what potential behaviors that state might yield.
In this post, we’ll explore the emerging art of context engineering and offer a refined mental model for building steerable, effective agents. — Read More
Why language models hallucinate
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such “hallucinations” persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious — they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded — language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This “epidemic” of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems. — Read More
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents. — Read More
GitHub Repo
Understanding LLMs: Insights from Mechanistic Interpretability
Since the release of ChatGPT in 2022, large language models (LLMs) based on the transformer architecture like ChatGPT, Gemini and Claude have transformed the world with their ability to produce high-quality, human-like text and more recently the ability to produce images and videos. Yet, behind this incredible capability lies a profound mystery: we don’t understand how these models work.
The reason is that LLMs aren’t built like traditional software. A traditional program is designed by human programmers and written in explicit, human-readable code. But LLMs are different. Instead of being programmed, LLMs are automatically trained to predict the next word on vast amounts of internet text, growing a complex network of trillions of connections that enable them to perform tasks and understand language. This training process automatically creates emergent knowledge and abilities, but the resulting model is usually messy, complex and incomprehensible since the training process optimizes the model for performance but not interpretability or ease of understanding.
The field of mechanistic interpretability aims to study LLM models and reverse engineer the knowledge and algorithms they use to perform tasks, a process that is more like biology or neuroscience than computer science.
The goal of this post is to provide insights into how LLMs work using findings from the field of mechanistic interpretability. — Read More
“RAG is Dead, Context Engineering is King” — with Jeff Huber of Chroma
In December 2023, we first covered The Four Wars of AI and the RAG/Ops War. After tens of millions poured into vector databases, ups and downs in the hype cycle, we finally have Jeff Huber from Chroma joining us today for the new hot take: “RAG” is dead…
and as context lengths increase, and more and more AI workloads are shifting from simple chatbots to IMPACTful agents, new work from thoughtleaders like Lance Martin and Dex Horthy are making genuine contributions of substance to the previously underrated context box. — Read More
The math and logic behind ChatGPT. This paper is all you need.
This paper explains everything there is to know about Large Language Models in simple and understandable terms.
We’ve all heard of ChatGPT and DeepSeek, which are Large Language Models (LLMs). These Large Language Models are powered by a technology called transformers or transformer neural networks.
What makes them so special? They’re able to understand context between words in a sentence and predict the next or expected word in an output sentence. That’s the reason why ChatGPT and other LLMs generate words sequentially; because this complex neural network generates or predicts the next word step by step based on the input sentence.
For example, if I were to input a sentence like ‘Thank you’, obviously the LLM should respond by saying ‘You are welcome’. So, it uses algorithms to predict the first word which is ‘You’, and then the next ‘are’, then finally ‘welcome’. I’m going to show you how they work in detail, so weigh anchor and prepare to set sail! — Read More
Is GPT-5 a “phenomenal” success or an “underwhelming” failure?
It was inevitable that people would be disappointed with last week’s release of GPT-5. That’s not because OpenAI did a poor job, and it’s not even because OpenAI did anything in particular to hype up the new version. The problem was simply that OpenAI’s previous “major” model releases—GPT-2, GPT-3, and GPT-4—have been so consequential.
… So of course people had high expectations for GPT-5. And OpenAI seems to have worked hard to meet those expectations.
…. OpenAI probably should have given the GPT-5 name to o1, the reasoning model OpenAI announced last September. That model really did deliver a dramatic performance improvement over previous models. It was followed by o3, which pushed this paradigm—based on reinforcement learning and long chains of thought—to new heights. But we haven’t seen another big jump in performance over the last six months, suggesting that the reasoning paradigm may also be reaching a point of diminishing returns (though it’s hard to know for certain).
Regardless, OpenAI found itself in a tough spot in early 2025. It needed to release something it could call GPT-5, but it didn’t have anything that could meet the sky-high expectations that had developed around that name. So rather than using the GPT-5 name for a dramatically better model, it decided to use it to signal a reboot of ChatGPT as a product.
… The reality is that GPT-5 is a solid model (or technically suite of models—we’ll get to that) that performs as well or better than anything else on the market today. In my own testing over the last week, I found GPT-5 to be the most capable model I’ve ever used. But it’s not the kind of dramatic breakthrough people expected from the GPT-5 name. And it has some rough edges that OpenAI is still working to sand down. — Read More