When I was in college, my data structures professor told a story. It went something like this:
“When I was your age, I received an assignment, and encountered an inexplicable bug. I debugged and debugged and found that adding a print statement resolved the bug. I was young like all of you, and I was certain I’d found a bug in the C compiler. Turns out the problem was me.”
The takeaway was clear: if you have a bug, it’s your fault.
This is a good heuristic for most cases, but with open source ML infrastructure, you need to throw this advice out the window. There might be features that appear to be supported but are not. If you’re suspicious about an operation or stage that’s taking a long time, it may be implemented in a way that’s efficient enough…for an 8B model, not a 1T+ one. HuggingFace is good, but it’s not always correct. Libraries have dependencies, and problems can hide several layers down the stack. Even Pytorch isn’t ground truth.
Over the past couple months, I worked on developing infrastructure to post-train and serve models cheaply. Ultimately, my team decided to develop a custom training codebase, but only after I spent a few days attempting to use existing open-source options. The following is an account of my successes and failures and what it means for open-weights models. — Read More
Tag Archives: Training
Teaching LLMs to reason like Bayesians
AI systems based on large language models (LLMs) are increasingly used as agents that interact with users and the world. To do this successfully, LLMs need to construct internal representations of the world and estimate the probability that each of these representations is accurate. Take personalized recommendations, for example: the LLM needs to gradually infer the user’s preferences from their choices over the course of multiple interactions.
Bayesian inference defines the optimal way to perform such updates. By implementing this strategy, LLMs could optimize user interactions by updating their estimates of the user’s preferences as new info about the user arrives. But without specific training, LLMs often default to simple heuristics — like assuming everyone wants the cheapest option — instead of inferring a specific user’s unique preferences.
In “Bayesian teaching enables probabilistic reasoning in large language models”, we teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of the Bayesian model, which defines the optimal way to reason about probabilities. We find that this approach not only significantly improves the LLM’s performance on the particular recommendation task on which it is trained, but also enables generalization to other tasks. This suggests that this method teaches the LLM to better approximate Bayesian reasoning. More generally, our results indicate that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains. — Read More
Beyond Language Modeling: An Exploration of Multimodal Pretraining
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models. — Read More
Andrej Karpathy Just Built an Entire GPT in 243 Lines of Python
I’ve read many transformer implementations during my PhD. Dense codebases. Thousands of files. Dependencies stacked on top of dependencies. You open a repo, run pip install -r requirements.txt, and watch 400 packages download before you can even see your model train (than errors , dependency issues … etc.).
Then on February 11, 2026, Andrej Karpathy dropped a single Python file that trains and runs a GPT from scratch. 243 lines. Zero dependencies. — Read More
Master Any Skill Faster With an AI Learning System
You can learn almost anything online.
So why does it still feel slow?
Most “learning” is simply the collection of information. Tabs. Notes. Videos. Highlights.
But skill only grows when you do three things again and again:
Try → Get feedback → Try again.
AI can make that loop faster — if you use it like a system, not a chat. — Read More
Top 10 YouTube Channels for Learning AI in 2026
Around 2.5 billion people used YouTube in January 2025, and a decent chunk of them are trying to figure out this whole AI thing. The platform has quietly become the best place to learn artificial intelligence without spending thousands on courses or going back to school. You can find everything from mathematical breakdowns to practical coding tutorials, and most of it is actually free.
The problem is not finding content but finding good content. YouTube is full of channels that either oversimplify to the point of being useless or overcomplicate to the point where you need a PhD to follow along. After watching dozens of hours of AI tutorials and checking what people are actually recommending in 2026, I put together this list of ten channels that actually teach you something useful. — Read More
Synthetic pretraining
Pretraining data infrastructure used to be the most conservative part of a fast-moving AI world. Since GPT-3 we have been mostly scaling the usual mix of web crawls peppered with a few more select sources (including, controversially, digitized books). This is finally changing.
In 2025, several major releases used extensive synthetic datasets before mid-training happens: Minimax, Trinity, K2/K2.5, Nemotron-3 and, more speculatively, GPT-OSS. At Pleias we even experimented with full synthetic training with Baguettotron/Monad exclusively trained on a generalist synthetic environment, SYNTH.
At this point, a few clarifications are needed. To what extent does synthetic pretraining contrast with an already common use of synthetic methods in mid- and post-training? And what do even we mean by synthetic? Is it just another data source? or a much more significant shift in the way we envision data, model design and training infrastructure?
Overall, this post isn’t an introduction. It’s rather an attempt to bind together scattered strands of research and practice around synthetic pretraining—an area that is both fragmented in the open and secretive in frontier labs. I’ll strive to anchor definitions in the operational realities of building and scaling synthetic pipelines, then later move on to more speculative extrapolations. — Read More
The private cloud returns for AI workloads
A North American manufacturer spent most of 2024 and early 2025 doing what many innovative enterprises did: aggressively standardizing on the public cloud by using data lakes, analytics, CI/CD, and even a good chunk of ERP integration. The board liked the narrative because it sounded like simplification, and simplification sounded like savings. Then generative AI arrived, not as a lab toy but as a mandate. “Put copilots everywhere,” leadership said. “Start with maintenance, then procurement, then the call center, then engineering change orders.”
… The most valuable AI use cases were those closest to people who build and fix things. Those people lived near manufacturing plants with strict network boundaries, latency constraints, and operational rhythms that don’t tolerate “the provider is investigating.” Within six months, the company began shifting its AI inference and retrieval workloads to a private cloud located near its factories, while keeping model training bursts in the public cloud when it made sense. It wasn’t a retreat. It was a rebalancing. — Read More
Stop Tuning Hyperparameters. You’re Just Procrastinating.
You Spent 3 Weeks Tuning. Your Colleague Beat Your Score in 2 Hours With Better Data.
You: “I’m optimizing learning rate, batch size, dropout, layers…”
Your colleague: “I cleaned the data and added 2 features.”
Results:
Your model after 3 weeks: 87.3% accuracy
Their model with defaults: 91.2% accuracy
Read More
On-Policy Distillation
LLMs are capable of expert performance in focused domains, a result of several capabilities stacked together: perception of input, knowledge retrieval, plan selection, and reliable execution. This requires a stack of training approaches[.]
… Smaller models with stronger training often outperform larger, generalist models in their trained domains of expertise. There are many benefits to using smaller models: they can be deployed locally for privacy or security considerations, can continuously train and get updated more easily, and save on inference costs. Taking advantage of these requires picking the right approach for the later stages of training.
Approaches to post-training a “student” model can be divided into two kinds:
Off-policy training relies on target outputs from some external source that the student learns to imitate.
On-policy training samples rollouts from the student model itself, and assigns them some reward.
We can do on-policy training via reinforcement learning, by grading each student rollout on whether it solves the question. This grading can be done by a human, or by a “teacher” model that reliably gets the correct answer. — Read More