Synthetic pretraining

Pretraining data infrastructure used to be the most conservative part of a fast-moving AI world. Since GPT-3 we have been mostly scaling the usual mix of web crawls peppered with a few more select sources (including, controversially, digitized books). This is finally changing.

In 2025, several major releases used extensive synthetic datasets before mid-training happens: MinimaxTrinityK2/K2.5, Nemotron-3 and, more speculatively, GPT-OSS. At Pleias we even experimented with full synthetic training with Baguettotron/Monad exclusively trained on a generalist synthetic environment, SYNTH.

At this point, a few clarifications are needed. To what extent does synthetic pretraining contrast with an already common use of synthetic methods in mid- and post-training? And what do even we mean by synthetic? Is it just another data source? or a much more significant shift in the way we envision data, model design and training infrastructure?

Overall, this post isn’t an introduction. It’s rather an attempt to bind together scattered strands of research and practice around synthetic pretraining—an area that is both fragmented in the open and secretive in frontier labs. I’ll strive to anchor definitions in the operational realities of building and scaling synthetic pipelines, then later move on to more speculative extrapolations. — Read More

#training

MaliciousCorgi: The Cute-Looking AI Extensions Leaking Code from 1.5 Million Developers

AI coding assistants are everywhere. They suggest code, explain errors, write functions, review pull requests. Every developer marketplace is flooded with them – ChatGPT wrappers, Copilot alternatives, code completion tools promising to 10x your productivity.

We install them without a second thought. They’re in the official marketplace. They have thousands of reviews. They work. So we grant them access to our workspaces, our files, our keystrokes – and assume they’re only using that access to help us code.

Not all of them are.

Our risk engine has identified two VS Code extensions, a campaign we’re calling MaliciousCorgi – 1.5 million combined installs, both live in the marketplace right now – that work exactly as promised. They answer your coding questions. They explain your errors. They also capture every file you open, every edit you make, and send it all to servers in China. No consent. No disclosure. — Read More

#cyber

Inside China’s Real Advantage: Manufacturing at Scale

Observers often fixate on the most visible layer of China’s tech stack: consumer-facing conveniences like mobile payments, fifteen-minute food delivery, and dockless bikes. These can make for good investments — we regularly cover them at Tech Buzz China — but they are primarily business model innovations, increasingly familiar, and replicable with modest effort. In my opinion, they do not represent China’s true advantages, the ones that resist replication.

What proves far harder to replicate, and far more consequential, is the invisible layer: China’s manufacturing base. This is the part of the ecosystem that actually reshapes global supply chains, yet it remains the part most visitors never see and, in many cases, never think to see. — Read More

#china-ai

Taboola & Columbia University Research Shows GenAI Ads Perform Just as Well as Human-Made Content

While GenAI has revolutionised production speed and cost, its impact on actual performance has remained a subject of intense debate. The new study, titled “AI Ads That Work: How AI Creative Stacks Up Against Humans,” analysed hundreds of thousands of live ads running on Realize, Taboola’s performance advertising platform, totalling more than 500 million impressions and 3 million clicks.  — Read More

#strategy

Ads Candidate Generation using Behavioral Sequence Modeling

At Pinterest, ads are more than just advertisements; they are a vital part of the content ecosystem, designed to inspire users and connect them with products and ideas they love. Our goal is to surface the right ads at the right time, ensuring they seamlessly integrate into a user’s shopping journey and provide genuine value. To achieve this, understanding user behavior is paramount.

Delivering highly relevant ads in a dynamic environment like Pinterest presents unique challenges. Users’ interests and shopping intents evolve rapidly, making it crucial for our ad systems to adapt and anticipate their needs. Traditional ad targeting methods often rely on broad demographic data or static interest categories, which can fall short in capturing the nuanced and evolving nature of user behavior. — Read More

#devops

How I Structure My Data Pipelines: The Silver Layer

… Dimensional modeling is more important than ever.

The methodology has decades of literature behind it. The patterns are documented, the edge cases are known, and there’s no need to invent solutions from scratch. Facts and dimensions are composable primitives that mix and match to answer questions nobody has thought of yet. Paired with an ERD, tests, and naming conventions, Silver becomes something people can navigate without asking questions.

Gold models are the primary consumers of Silver. Every metric view, every wide table, every consumption artifact in Gold starts by referencing Silver facts and dimensions. 

Overview
The Bronze Layer
The Silver Layer

#data-science

China’s Military Uses Hawk and Wolf Behavior to Train AI Weapon Swarms

On January 23, China’s National University of Defense Technology demonstrated something that’s reshaping how autonomous weapons work: a single operator supervising over 200 drones simultaneously during urban combat exercises. The swarm operated with minimal human input, relying on what the People’s Liberation Army calls “effect-based control,” designed to function even when communication signals are jammed.

The technology didn’t emerge from traditional programming. It came from watching hawks hunt.

Engineers at Beihang University, a military-linked institution, observed how hawks select vulnerable prey and trained defensive drones to replicate that behaviour, according to The Wall Street Journal. In parallel tests, attack drones mimicked pigeons to evade threats. The result: in a five-versus-five combat simulation, the hawk-trained drones eliminated all opponents in 5.3 seconds, according to a patent filed in April 2024. — Read More

#china-ai