Andrej Karpathy Just Built an Entire GPT in 243 Lines of Python

I’ve read many transformer implementations during my PhD. Dense codebases. Thousands of files. Dependencies stacked on top of dependencies. You open a repo, run pip install -r requirements.txt, and watch 400 packages download before you can even see your model train (than errors , dependency issues … etc.).

Then on February 11, 2026, Andrej Karpathy dropped a single Python file that trains and runs a GPT from scratch. 243 lines. Zero dependencies. — Read More

#training

Master Any Skill Faster With an AI Learning System

You can learn almost anything online.

So why does it still feel slow?

Most “learning” is simply the collection of information. Tabs. Notes. Videos. Highlights.

But skill only grows when you do three things again and again:

Try → Get feedback → Try again.

AI can make that loop faster — if you use it like a system, not a chat. — Read More

#training

Top 10 YouTube Channels for Learning AI in 2026

Around 2.5 billion people used YouTube in January 2025, and a decent chunk of them are trying to figure out this whole AI thing. The platform has quietly become the best place to learn artificial intelligence without spending thousands on courses or going back to school. You can find everything from mathematical breakdowns to practical coding tutorials, and most of it is actually free.

The problem is not finding content but finding good content. YouTube is full of channels that either oversimplify to the point of being useless or overcomplicate to the point where you need a PhD to follow along. After watching dozens of hours of AI tutorials and checking what people are actually recommending in 2026, I put together this list of ten channels that actually teach you something useful. — Read More

#training

Synthetic pretraining

Pretraining data infrastructure used to be the most conservative part of a fast-moving AI world. Since GPT-3 we have been mostly scaling the usual mix of web crawls peppered with a few more select sources (including, controversially, digitized books). This is finally changing.

In 2025, several major releases used extensive synthetic datasets before mid-training happens: MinimaxTrinityK2/K2.5, Nemotron-3 and, more speculatively, GPT-OSS. At Pleias we even experimented with full synthetic training with Baguettotron/Monad exclusively trained on a generalist synthetic environment, SYNTH.

At this point, a few clarifications are needed. To what extent does synthetic pretraining contrast with an already common use of synthetic methods in mid- and post-training? And what do even we mean by synthetic? Is it just another data source? or a much more significant shift in the way we envision data, model design and training infrastructure?

Overall, this post isn’t an introduction. It’s rather an attempt to bind together scattered strands of research and practice around synthetic pretraining—an area that is both fragmented in the open and secretive in frontier labs. I’ll strive to anchor definitions in the operational realities of building and scaling synthetic pipelines, then later move on to more speculative extrapolations. — Read More

#training

The private cloud returns for AI workloads

A North American manufacturer spent most of 2024 and early 2025 doing what many innovative enterprises did: aggressively standardizing on the public cloud by using data lakes, analytics, CI/CD, and even a good chunk of ERP integration. The board liked the narrative because it sounded like simplification, and simplification sounded like savings. Then generative AI arrived, not as a lab toy but as a mandate. “Put copilots everywhere,” leadership said. “Start with maintenance, then procurement, then the call center, then engineering change orders.”

… The most valuable AI use cases were those closest to people who build and fix things. Those people lived near manufacturing plants with strict network boundaries, latency constraints, and operational rhythms that don’t tolerate “the provider is investigating.” Within six months, the company began shifting its AI inference and retrieval workloads to a private cloud located near its factories, while keeping model training bursts in the public cloud when it made sense. It wasn’t a retreat. It was a rebalancing. — Read More

#training

Stop Tuning Hyperparameters. You’re Just Procrastinating.

You Spent 3 Weeks Tuning. Your Colleague Beat Your Score in 2 Hours With Better Data.

You: “I’m optimizing learning rate, batch size, dropout, layers…”
Your colleague: “I cleaned the data and added 2 features.”

Results:

Your model after 3 weeks: 87.3% accuracy
Their model with defaults: 91.2% accuracy

Read More

    #training

    On-Policy Distillation

    LLMs are capable of expert performance in focused domains, a result of several capabilities stacked together: perception of input, knowledge retrieval, plan selection, and reliable execution. This requires a stack of training approaches[.]

    … Smaller models with stronger training often outperform larger, generalist models in their trained domains of expertise. There are many benefits to using smaller models: they can be deployed locally for privacy or security considerations, can continuously train and get updated more easily, and save on inference costs. Taking advantage of these requires picking the right approach for the later stages of training.

    Approaches to post-training a “student” model can be divided into two kinds:

    Off-policy training relies on target outputs from some external source that the student learns to imitate.
    On-policy training samples rollouts from the student model itself, and assigns them some reward.

    We can do on-policy training via reinforcement learning, by grading each student rollout on whether it solves the question. This grading can be done by a human, or by a “teacher” model that reliably gets the correct answer. — Read More

    #training

    Machine Learning and Design Thinking are “basically” the same

    When you hear backpropagation, you probably think of machine learning, neural networks, and intimidating math. But even if the concept is new to you there’s no reason to worry. Because if we look closely, backpropagation isn’t just a computer science algorithm for machine learning.
    No, backpropagation acts on the philosophy of learning through feedback, and thereby has a lot in common with design thinking.

    In this article, I compare design thinking to machine learning to make complex concepts from computer science more graspable. I translate the logic of backprop (backpropagation) into design thinking language, and I illustrate how both follow the same idea: iterative improvement through feedback loops. In the latter half of the article I explain more machine learning concepts, the “bias”, “cost function”, what is “overfittig” and “underfitting”, as well as “activation functions”. And what seems incredibly complicated or simply unknown to you now will be a little bit more clear and relatable by the end of this article. — Read More

    #training

    The Continual Learning Problem

    If we want to move towards a world where models are “always training” and continually learning from experience over time, we need to address a basic challenge: how do we keep updating the parameters of a model without breaking it? In this post, I’ll motivate memory layers as a natural architecture for this paradigm: high-capacity, but sparse (few active parameters) on each forward pass. In our recent paper, we found that finetuning memory layers enables learning without forgetting much more effectively than LoRA. When learning TriviaQA facts, NaturalQuestions performance drops by 89% with full finetuning and 71% with LoRA, but only 11% with memory layers. Along the way, I’ll also discuss the challenges of the continual learning problem broadly.

    Read More

    Check out the paper here: Continual Learning via Sparse Memory Finetuning

    #training

    Less is More: Recursive Reasoning with Tiny Networks

    Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters. — Read More

    #training