Backpropagation is one of those terms that gets thrown around so much in AI that people assume everyone already understands it.
But most explanations stop at “the network adjusts its weights using gradients” and leave you nodding along without actually knowing what is being computed or why.
In this blog, I’m going to fix that.
We’ll start from scratch and work all the way to a complete, clean idea of every gradient you need. — Read More
Tag Archives: Training
This World Model Learns Physics by Watching Videos
Yann LeCun’s team just taught an AI to imagine the future from raw video. On one GPU. With a model smaller than most apps on your phone.
You know how you can close your eyes and imagine what happens when you push a coffee cup off the edge of a table? You don’t need to actually do it. Your brain just… knows. Gravity. Impact. Shattered ceramic. Coffee everywhere.
hat is a world model. An internal simulation of how reality works. AI researchers have been trying to build the same thing for machines. Not by programming physics rules manually, but by letting the AI watch videos and figure it out on its own. If a robot can imagine the consequences of its actions before taking them, it can plan. It can reason. It can avoid stupid mistakes. The problem? Building these things has been an absolute nightmare. Read More
The Death of model.fit(): What Data Scientists Actually Do in the Age of AI Agents
A few months ago, I joined a team building two AI-agent products.
My first week, I opened a Jupyter notebook out of habit. Then I closed it. There was no training set, no features to engineer, no model.fit(X_train, y_train) waiting to be called. The agents orchestrated foundation models. The “intelligence” came from a model someone else trained. The entire codebase was TypeScript. No notebooks, no model, no Python. The toolbox I’d spent years filling was, on its surface, irrelevant.
So what, exactly, was I supposed to do?
The answer turned out to be hiding in a simple framework.
Every AI agent has three layers. The foundation model provides raw intelligence. The engineering provides the body: tools, APIs, orchestration, and product surfaces. But the behavior of the agent – what it actually does when a user shows up – is shaped by the context, prompts, policies, schemas, and guardrails that surround the model. That’s the brain of the system. Not the neural network itself, but the cognitive architecture built on top of it.
Someone needs to own the quality of that brain; to make it legible, to understand its failure modes, measure its consistency, map its weaknesses, and create the feedback loops that systematically make it smarter. That someone, it turns out, is the data scientist. Not as a model trainer, but as the team’s methodologist. — Read More
Lossy self-improvement
Fast takeoff, the singularity, and recursive self-improvement (RSI) are all top of mind in AI circles these days. There are elements of truth to them in what’s happening in the AI industry. Two, maybe three, labs are consolidating as an oligopoly with access to the best AI models (and the resources to build the next ones). The AI tools of today are abruptly transforming engineering and research jobs.
AI research is becoming much easier in many ways. The technical problems that need to be solved to scale training large language models even further are formidable. Super-human coding assistants making these approachable is breaking a lot of former claims of what building these things entailed. Together this is setting us up for a year (or more) of rapid progress at the cutting edge of AI.
We’re also at a time where language models are already extremely good. They’re in fact good enough for plenty of extremely valuable knowledge-work tasks. Language models taking another big step is hard to imagine — it’s unclear which tasks they’re going to master this year outside of code and CLI-based computer-use. There will be some new ones! These capabilities unlock new styles of working that’ll send more ripples through the economy.
These dramatic changes almost make it seem like a foregone conclusion that language models can then just keep accelerating progress on their own. The popular language for this is a recursive self-improvement loop. — Read More
Open Weights isn’t Open Training
When I was in college, my data structures professor told a story. It went something like this:
“When I was your age, I received an assignment, and encountered an inexplicable bug. I debugged and debugged and found that adding a print statement resolved the bug. I was young like all of you, and I was certain I’d found a bug in the C compiler. Turns out the problem was me.”
The takeaway was clear: if you have a bug, it’s your fault.
This is a good heuristic for most cases, but with open source ML infrastructure, you need to throw this advice out the window. There might be features that appear to be supported but are not. If you’re suspicious about an operation or stage that’s taking a long time, it may be implemented in a way that’s efficient enough…for an 8B model, not a 1T+ one. HuggingFace is good, but it’s not always correct. Libraries have dependencies, and problems can hide several layers down the stack. Even Pytorch isn’t ground truth.
Over the past couple months, I worked on developing infrastructure to post-train and serve models cheaply. Ultimately, my team decided to develop a custom training codebase, but only after I spent a few days attempting to use existing open-source options. The following is an account of my successes and failures and what it means for open-weights models. — Read More
Teaching LLMs to reason like Bayesians
AI systems based on large language models (LLMs) are increasingly used as agents that interact with users and the world. To do this successfully, LLMs need to construct internal representations of the world and estimate the probability that each of these representations is accurate. Take personalized recommendations, for example: the LLM needs to gradually infer the user’s preferences from their choices over the course of multiple interactions.
Bayesian inference defines the optimal way to perform such updates. By implementing this strategy, LLMs could optimize user interactions by updating their estimates of the user’s preferences as new info about the user arrives. But without specific training, LLMs often default to simple heuristics — like assuming everyone wants the cheapest option — instead of inferring a specific user’s unique preferences.
In “Bayesian teaching enables probabilistic reasoning in large language models”, we teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of the Bayesian model, which defines the optimal way to reason about probabilities. We find that this approach not only significantly improves the LLM’s performance on the particular recommendation task on which it is trained, but also enables generalization to other tasks. This suggests that this method teaches the LLM to better approximate Bayesian reasoning. More generally, our results indicate that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains. — Read More
Beyond Language Modeling: An Exploration of Multimodal Pretraining
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models. — Read More
Andrej Karpathy Just Built an Entire GPT in 243 Lines of Python
I’ve read many transformer implementations during my PhD. Dense codebases. Thousands of files. Dependencies stacked on top of dependencies. You open a repo, run pip install -r requirements.txt, and watch 400 packages download before you can even see your model train (than errors , dependency issues … etc.).
Then on February 11, 2026, Andrej Karpathy dropped a single Python file that trains and runs a GPT from scratch. 243 lines. Zero dependencies. — Read More
Master Any Skill Faster With an AI Learning System
You can learn almost anything online.
So why does it still feel slow?
Most “learning” is simply the collection of information. Tabs. Notes. Videos. Highlights.
But skill only grows when you do three things again and again:
Try → Get feedback → Try again.
AI can make that loop faster — if you use it like a system, not a chat. — Read More
Top 10 YouTube Channels for Learning AI in 2026
Around 2.5 billion people used YouTube in January 2025, and a decent chunk of them are trying to figure out this whole AI thing. The platform has quietly become the best place to learn artificial intelligence without spending thousands on courses or going back to school. You can find everything from mathematical breakdowns to practical coding tutorials, and most of it is actually free.
The problem is not finding content but finding good content. YouTube is full of channels that either oversimplify to the point of being useless or overcomplicate to the point where you need a PhD to follow along. After watching dozens of hours of AI tutorials and checking what people are actually recommending in 2026, I put together this list of ten channels that actually teach you something useful. — Read More