Synthetic pretraining

Pretraining data infrastructure used to be the most conservative part of a fast-moving AI world. Since GPT-3 we have been mostly scaling the usual mix of web crawls peppered with a few more select sources (including, controversially, digitized books). This is finally changing.

In 2025, several major releases used extensive synthetic datasets before mid-training happens: MinimaxTrinityK2/K2.5, Nemotron-3 and, more speculatively, GPT-OSS. At Pleias we even experimented with full synthetic training with Baguettotron/Monad exclusively trained on a generalist synthetic environment, SYNTH.

At this point, a few clarifications are needed. To what extent does synthetic pretraining contrast with an already common use of synthetic methods in mid- and post-training? And what do even we mean by synthetic? Is it just another data source? or a much more significant shift in the way we envision data, model design and training infrastructure?

Overall, this post isn’t an introduction. It’s rather an attempt to bind together scattered strands of research and practice around synthetic pretraining—an area that is both fragmented in the open and secretive in frontier labs. I’ll strive to anchor definitions in the operational realities of building and scaling synthetic pipelines, then later move on to more speculative extrapolations. — Read More

#training

The private cloud returns for AI workloads

A North American manufacturer spent most of 2024 and early 2025 doing what many innovative enterprises did: aggressively standardizing on the public cloud by using data lakes, analytics, CI/CD, and even a good chunk of ERP integration. The board liked the narrative because it sounded like simplification, and simplification sounded like savings. Then generative AI arrived, not as a lab toy but as a mandate. “Put copilots everywhere,” leadership said. “Start with maintenance, then procurement, then the call center, then engineering change orders.”

… The most valuable AI use cases were those closest to people who build and fix things. Those people lived near manufacturing plants with strict network boundaries, latency constraints, and operational rhythms that don’t tolerate “the provider is investigating.” Within six months, the company began shifting its AI inference and retrieval workloads to a private cloud located near its factories, while keeping model training bursts in the public cloud when it made sense. It wasn’t a retreat. It was a rebalancing. — Read More

#training

Stop Tuning Hyperparameters. You’re Just Procrastinating.

You Spent 3 Weeks Tuning. Your Colleague Beat Your Score in 2 Hours With Better Data.

You: “I’m optimizing learning rate, batch size, dropout, layers…”
Your colleague: “I cleaned the data and added 2 features.”

Results:

Your model after 3 weeks: 87.3% accuracy
Their model with defaults: 91.2% accuracy

Read More

    #training

    On-Policy Distillation

    LLMs are capable of expert performance in focused domains, a result of several capabilities stacked together: perception of input, knowledge retrieval, plan selection, and reliable execution. This requires a stack of training approaches[.]

    … Smaller models with stronger training often outperform larger, generalist models in their trained domains of expertise. There are many benefits to using smaller models: they can be deployed locally for privacy or security considerations, can continuously train and get updated more easily, and save on inference costs. Taking advantage of these requires picking the right approach for the later stages of training.

    Approaches to post-training a “student” model can be divided into two kinds:

    Off-policy training relies on target outputs from some external source that the student learns to imitate.
    On-policy training samples rollouts from the student model itself, and assigns them some reward.

    We can do on-policy training via reinforcement learning, by grading each student rollout on whether it solves the question. This grading can be done by a human, or by a “teacher” model that reliably gets the correct answer. — Read More

    #training

    Machine Learning and Design Thinking are “basically” the same

    When you hear backpropagation, you probably think of machine learning, neural networks, and intimidating math. But even if the concept is new to you there’s no reason to worry. Because if we look closely, backpropagation isn’t just a computer science algorithm for machine learning.
    No, backpropagation acts on the philosophy of learning through feedback, and thereby has a lot in common with design thinking.

    In this article, I compare design thinking to machine learning to make complex concepts from computer science more graspable. I translate the logic of backprop (backpropagation) into design thinking language, and I illustrate how both follow the same idea: iterative improvement through feedback loops. In the latter half of the article I explain more machine learning concepts, the “bias”, “cost function”, what is “overfittig” and “underfitting”, as well as “activation functions”. And what seems incredibly complicated or simply unknown to you now will be a little bit more clear and relatable by the end of this article. — Read More

    #training

    The Continual Learning Problem

    If we want to move towards a world where models are “always training” and continually learning from experience over time, we need to address a basic challenge: how do we keep updating the parameters of a model without breaking it? In this post, I’ll motivate memory layers as a natural architecture for this paradigm: high-capacity, but sparse (few active parameters) on each forward pass. In our recent paper, we found that finetuning memory layers enables learning without forgetting much more effectively than LoRA. When learning TriviaQA facts, NaturalQuestions performance drops by 89% with full finetuning and 71% with LoRA, but only 11% with memory layers. Along the way, I’ll also discuss the challenges of the continual learning problem broadly.

    Read More

    Check out the paper here: Continual Learning via Sparse Memory Finetuning

    #training

    Less is More: Recursive Reasoning with Tiny Networks

    Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (around 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters. — Read More

    #training

    The Parallelism Mesh Zoo

    When training large scale LLMs, there is a large assortment of parallelization strategies which you can employ to scale your training runs to work on more GPUs. There are already a number of good resources for understanding how to parallelize your models: I particularly recommend How To Scale Your Model and The Ultra-Scale Playbook. The purpose of this blog post is to discuss parallelization strategies in a more schematic way by focusing only on how they affect your device mesh. The device mesh is an abstraction used by both PyTorch and JAX that takes your GPUs (however many of them you’ve got in your cluster!) and organizes them into a N-D tensor that expresses how the devices communicate with each other. When we parallelize computation, we shard a tensor along one dimension of the mesh, and then do collectives along that dimension when there are nontrivial dependencies between shards. Being able to explain why a device mesh is set up the way it is for a collection of parallelization strategies is a good check for seeing if you understand how the parallelization strategies work in the first place! (Credit: This post was influenced by Visualizing 6D Mesh Parallelism.) — Read More

    #training

    Test Time Scaling Will Be MUCH Bigger Than Anyone Realizes

    Read More

    #training, #videos

    INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning

    We introduce INTELLECT-2, the first globally distributed reinforcement learning (RL) training run of a 32 billion parameter language model. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors.

    To enable a training run with this unique infrastructure, we built various components from scratch: we introduce PRIME-RL, our training framework purpose-built for distributed asynchronous reinforcement learning, based on top of novel components such as TOPLOC, which verifies rollouts from untrusted inference workers, and SHARDCAST, which efficiently broadcasts policy weights from training nodes to inference workers.

    Beyond infrastructure components, we propose modifications to the standard GRPO training recipe and data filtering techniques that were crucial to achieve training stability and ensure that our model successfully learned its training objective, thus improving upon QwQ-32B, the state of the art reasoning model in the 32B parameter range.

    We open-source INTELLECT-2 along with all of our code and data, hoping to encourage and enable more open research in the field of decentralized training. — Read More

    #training