When OpenAI released o1 in 2024 and called it a “reasoning model,” the industry celebrated a breakthrough. Finally, AI that could think step-by-step, solve complex problems, handle graduate-level mathematics.
But look closer at what’s actually happening under the hood. When you ask o1 to multiply two large numbers, it doesn’t calculate. It generates Python code, executes it in a sandbox, and returns the result. Unlike GPT-3, which at least attempted arithmetic internally (and often failed), o1 explicitly delegates computation to external tools.
This pattern extends everywhere. The autonomy in agentic AI? Chained tool calls like web searches, API invocations, database queries. The breakthrough isn’t in the model’s intelligence. It’s in the orchestration layer coordinating external systems. Everything from reasoning to agentic AI is just a sophisticated application of code generation. These are not model improvements. They’re engineering workarounds for models that stopped improving.
This matters because the entire AI industry (from unicorn valuations to trillion-dollar GDP projections) depends on continued model improvement. What we’re getting instead is increasingly elaborate plumbing for fundamentally stagnant foundations. — Read More
Recent Updates Page 15
Machine Learning and Design Thinking are “basically” the same
When you hear backpropagation, you probably think of machine learning, neural networks, and intimidating math. But even if the concept is new to you there’s no reason to worry. Because if we look closely, backpropagation isn’t just a computer science algorithm for machine learning.
No, backpropagation acts on the philosophy of learning through feedback, and thereby has a lot in common with design thinking.
In this article, I compare design thinking to machine learning to make complex concepts from computer science more graspable. I translate the logic of backprop (backpropagation) into design thinking language, and I illustrate how both follow the same idea: iterative improvement through feedback loops. In the latter half of the article I explain more machine learning concepts, the “bias”, “cost function”, what is “overfittig” and “underfitting”, as well as “activation functions”. And what seems incredibly complicated or simply unknown to you now will be a little bit more clear and relatable by the end of this article. — Read More
The Continual Learning Problem
If we want to move towards a world where models are “always training” and continually learning from experience over time, we need to address a basic challenge: how do we keep updating the parameters of a model without breaking it? In this post, I’ll motivate memory layers as a natural architecture for this paradigm: high-capacity, but sparse (few active parameters) on each forward pass. In our recent paper, we found that finetuning memory layers enables learning without forgetting much more effectively than LoRA. When learning TriviaQA facts, NaturalQuestions performance drops by 89% with full finetuning and 71% with LoRA, but only 11% with memory layers. Along the way, I’ll also discuss the challenges of the continual learning problem broadly.
Read More
Check out the paper here: Continual Learning via Sparse Memory Finetuning
Claude Code is unreasonably good at building MVPs
The most valuable code I’ve written in the past six months is code I fully intend to throw away. This isn’t some zen programming philosophy or agile methodology talking point. It’s the practical reality of using Claude Code to build prototypes and MVPs.
Here’s what’s fundamentally changed: the time from “what if we built X” to “here’s a working version of X” has collapsed from weeks or months down to hours or days. That compression doesn’t just make development faster. It changes what kinds of ideas are worth exploring in the first place. — Read More
Andrej Karpathy — AGI is still a decade away
Data Modeling for the Agentic Era: Semantics, Speed, and Stewardship
In data analytics, we’re facing a paradox. AI agents can theoretically analyze anything, but without the right foundations, they’re as likely to hallucinate a metric as to calculate it correctly. They can write SQL in seconds, but will it answer the right business question? They promise autonomous insights, but at what cost to trust and accuracy?
These days, everyone is embedding AI chat in their product. But to what end? Does it actually help, or would users rather turn to tools like Claude Code when they need real work done? The real questions are: how can we model our data for agents to reliably consume, and how can we use agents to develop better data models?
After spending the last year exploring where LLMs have genuine leverage in analytics (see my writing on GenBI and Self-Serve BI), I’ve identified three essential pillars that make agentic data modeling actually work: semantics as the shared language both humans and AI need to understand metrics, speed through sub-second analytics that lets you verify numbers before they become decisions, and stewardship with guardrails that guide without constraining. The TL;DR? AI needs structure to understand, humans need speed to verify, and both need boundaries to stay productive. — Read More
Advanced RAG Techniques for High-Performance LLM Applications
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by combining retrieval with generation to ground outputs in your own data rather than relying solely on pretraining. In practice, RAG systems retrieve relevant information from a knowledge source and integrate it into the prompt, enabling responses that are more accurate, contextual, and trustworthy.
RAG is now a widely used architecture for LLM applications, powering everything from question-answering services that leverage web search, to internal chat tools that index enterprise content, to complex QA pipelines. Its appeal is simple: by augmenting generation with retrieval, teams can deliver LLM experiences that meet today’s expectations for relevance and reliability.
But shipping a RAG system isn’t the finish line. Anyone who’s moved beyond a prototype knows the symptoms: hallucinations creep back in, long queries bog down performance, or answers miss the mark despite the right documents being retrieved. That’s where advanced RAG techniques come in. This guide walks through the strategies that help teams improve relevance, accuracy, and efficiency, so your system not only works, but works at scale. — Read More
Emerging Architectures for Modern Data Infrastructure
The growth of the data infrastructure industry has continued unabated since we published a set of reference architectures in late 2020. Nearly all key industry metrics hit record highs during the past year, and new product categories appeared faster than most data teams could reasonably keep track. Even the benchmarkwars and billboard battles returned.
To help data teams stay on top of the changes happening in the industry, we’re publishing in this post an updated set of data infrastructure architectures. They show the current best-in-class stack across both analytic and operational systems, as gathered from numerous operators we spoke with over the last year. Each architectural blueprint includes a summary of what’s changed since the prior version.
We’ll also attempt to explain why these changes are taking place. We argue that core data processing systems have remained relatively stable over the past year, while supporting tools and applications have proliferated rapidly. We explore the hypothesis that platforms are beginning to emerge in the data ecosystem, and that this helps explain the particular patterns we’re seeing in the evolution of the data stack. — Read More
Lexicon: How China talks about ‘agentic AI’
Three months after the Chinese AI company DeepSeek shocked global markets with a highly capable reasoning model, another China-linked company made a splash with a capable agentic AI system. Did Manus, released in March 2025, portend Chinese leadership in AI systems that go beyond chatbots to take action on the user’s behalf? Victor Mustar, head of product at Hugging Face described Manus’ capabilities as “mind-blowing, redefining what’s possible.” A journalist’s comparison with ChatGPT DeepResearch found that Manus provided better results, despite speed and stability issues.
Manus had been released by a Singapore-based firm but developed by a startup in Wuhan with backing from the Chinese tech giant Tencent. It wasn’t China’s only foray into the emerging field. The same month, the Beijing-based firm Zhipu AI launched AutoGLM-Rumination, an open-source agentic system the company said achieved “state-of-the-art” scores on benchmarks such as AgentBench. (Zhipu also announced an “international alliance” for autonomous AI models, to include 10 countries associated with the Belt and Road Initiative and from ASEAN.) Earlier in January, Alibaba released the Qwen-Agent framework for building agentic systems with its Qwen models. ByteDance followed with its Coze Studio platform in July. Last month, Tencent open-sourced Youtu-Agent agentic framework, which was reportedly built atop a DeepSeek model.
With so much action this year in Chinese “agentic” AI efforts, it’s worth pausing to ask what Chinese developers mean when they talk about agentic AI. Moreover, what does the proliferation of such systems in China mean for AI safety and governance in the country? — Read More
Stanford RNA 3D Folding: 1st Place Solution
My approach was clear from the outset. Without GPUs, training a model from scratch or fine-tuning was not viable. My early research – drawing on CASP results, literature, and conference talks, including one by host @rhijudas – showed that Template-Based Modeling approaches consistently dominated. Based on this, I committed to TBM from day one and spent the next 90 days refining my method.
Next, I focused on the evaluation metric, since understanding it determines the exploration path. TM-score has two key properties: it is normalized by structure length (so 50nt and 200nt RNAs are compared on the same 0-1 scale), and it is robust to local errors – a small number of misplaced nucleotides does not disproportionately lower the score. This insight allowed me to prioritize getting the overall fold correct over achieving atomic-level precision. — Read More