Rick's Cafe AI 11:02 am on July 5, 2026
Tags: Training

LoRA: Low-Rank Adaptation of Large Language Models

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example — deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at this https URL. — Read More

#training

Rick's Cafe AI 10:16 am on June 30, 2026
Tags: Training

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student’s own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving superior token efficiency compared to reinforcement learning methods and better performance over off-policy distillation methods. Code repo: this https URL.Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student’s own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving superior token efficiency compared to reinforcement learning methods and better performance over off-policy distillation methods. — Read More

Code repo: this https URL.

#training

Rick's Cafe AI 1:55 pm on May 19, 2026
Tags: Training

Generalization Dynamics of LM Pre-training

People typically assume that LMs stably mature from pattern-matching parrots to generalizable intelligence during pre-training. We build a toy eval suite and show this mental model is wrong: throughout pre-training, LMs frequently and suddenly hop between parrot-like and intelligence-like modes, i.e. distinct algorithms implemented by distinct circuits. We call this mode-hopping. Across our suite, LMs can suddenly latch onto memorized or in-context patterns instead of in-context learning, use System 1 instead of System 2 thinking, pick up what sounds true instead of what is true, fail at multi-hop persona QA, out-of-context reasoning, and emergent misalignment — then just as suddenly revert and generalize. Mode-hopping is not explained by standard optimization dynamics: it is locally stable and can not be fixed by checkpoint averaging. We instead think of it as a capacity allocation problem: in a capacity-bounded model, generalizable circuits must compete with the shallow ones learned early in training, and the data in each pre-training window decides which circuits win. Our suite provides a cheap set of pre-training monitors and a new lens on generalization. Building upon our insights, we demonstrate three applications: (i) select intermediate pre-training checkpoints that strongly generalize reasoning and alignment, better than the final pre- or mid-training checkpoints, (ii) select pre-training data that controls and stabilizes generalization dynamics, and (iii) test prior generalization predictors, falsifying the monolithic belief that “simpler solutions generalize better”. — Read More

#training

Rick's Cafe AI 12:37 pm on May 17, 2026
Tags: Training

Interaction Models: A Scalable Approach to Human-AI Collaboration

Today, we’re announcing a research preview of interaction models: models that handle interaction natively rather than through external scaffolding. We think interactivity should scale alongside intelligence; the way we work with AI should not be treated as an afterthought. Interaction models let people collaborate with AI the way we naturally collaborate with each other—they continuously take in audio, video, and text, and think, respond, and act in real time.

We train an interaction model from scratch. To ensure real-time responsiveness, we adopt a multi-stream, micro-turn design. Our research preview demonstrates qualitatively new interaction capabilities, as well as state-of-the-art combined performance in intelligence and responsiveness. — Read More

#training

Rick's Cafe AI 5:43 pm on May 12, 2026
Tags: Training

Building Blocks for Foundation Model Training and Inference on AWS

For a long time, “scaling” in foundation models mostly meant one thing: spend more compute on pre-training and capabilities rise. That intuition was supported by empirical work such as Kaplan et al. (2020), which reported predictable power-law trends in loss as you scale model parameters, dataset size, and training compute. In practice, these trends justified sustained investment in large-scale accelerator capacity and the surrounding distributed infrastructure needed to keep it efficiently utilized. But the frontier has evolved—and scaling is no longer a single curve. NVIDIA’s “from one to three scaling laws” framing usefully emphasizes that, beyond pre-training, performance increasingly scales through post-training (e.g., supervised fine-tuning (SFT) and reinforcement learning (RL)-based methods) and through test-time compute (“long thinking,” search/verification, multi-sample strategies).

Taken together, these scaling regimes push the foundation-model lifecycle—pre-training, post-training, and inference—toward convergent infrastructure requirements: tightly coupled accelerator compute, a high-bandwidth low-latency network, and a distributed storage backend. They also raise the importance of orchestration for resource management, and of application- and hardware-level observability to maintain cluster health and diagnose performance pathologies at scale. — Read More

#training

Rick's Cafe AI 11:16 am on April 30, 2026
Tags: Training

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language “reasoning chains”, their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or “hallucinate”) in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score — a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations — outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models. — Read More

#training

Rick's Cafe AI 11:38 am on April 1, 2026
Tags: Training

Backpropagation is simpler than you think (once you see this)

Backpropagation is one of those terms that gets thrown around so much in AI that people assume everyone already understands it.

But most explanations stop at “the network adjusts its weights using gradients” and leave you nodding along without actually knowing what is being computed or why.

In this blog, I’m going to fix that.

We’ll start from scratch and work all the way to a complete, clean idea of every gradient you need. — Read More

#training

Rick's Cafe AI 8:48 am on March 30, 2026
Tags: Robotics ( 207 ), Training

This World Model Learns Physics by Watching Videos

Yann LeCun’s team just taught an AI to imagine the future from raw video. On one GPU. With a model smaller than most apps on your phone.

You know how you can close your eyes and imagine what happens when you push a coffee cup off the edge of a table? You don’t need to actually do it. Your brain just… knows. Gravity. Impact. Shattered ceramic. Coffee everywhere.

hat is a world model. An internal simulation of how reality works. AI researchers have been trying to build the same thing for machines. Not by programming physics rules manually, but by letting the AI watch videos and figure it out on its own. If a robot can imagine the consequences of its actions before taking them, it can plan. It can reason. It can avoid stupid mistakes. The problem? Building these things has been an absolute nightmare. Read More

#training

#robotics

Rick's Cafe AI 1:09 pm on March 26, 2026
Tags: Training

The Death of model.fit(): What Data Scientists Actually Do in the Age of AI Agents

A few months ago, I joined a team building two AI-agent products.

My first week, I opened a Jupyter notebook out of habit. Then I closed it. There was no training set, no features to engineer, no model.fit(X_train, y_train) waiting to be called. The agents orchestrated foundation models. The “intelligence” came from a model someone else trained. The entire codebase was TypeScript. No notebooks, no model, no Python. The toolbox I’d spent years filling was, on its surface, irrelevant.

So what, exactly, was I supposed to do?

The answer turned out to be hiding in a simple framework.

Every AI agent has three layers. The foundation model provides raw intelligence. The engineering provides the body: tools, APIs, orchestration, and product surfaces. But the behavior of the agent – what it actually does when a user shows up – is shaped by the context, prompts, policies, schemas, and guardrails that surround the model. That’s the brain of the system. Not the neural network itself, but the cognitive architecture built on top of it.

Someone needs to own the quality of that brain; to make it legible, to understand its failure modes, measure its consistency, map its weaknesses, and create the feedback loops that systematically make it smarter. That someone, it turns out, is the data scientist. Not as a model trainer, but as the team’s methodologist. — Read More

#training

Rick's Cafe AI 10:20 am on March 23, 2026
Tags: Training

Lossy self-improvement

Fast takeoff, the singularity, and recursive self-improvement (RSI) are all top of mind in AI circles these days. There are elements of truth to them in what’s happening in the AI industry. Two, maybe three, labs are consolidating as an oligopoly with access to the best AI models (and the resources to build the next ones). The AI tools of today are abruptly transforming engineering and research jobs.

AI research is becoming much easier in many ways. The technical problems that need to be solved to scale training large language models even further are formidable. Super-human coding assistants making these approachable is breaking a lot of former claims of what building these things entailed. Together this is setting us up for a year (or more) of rapid progress at the cutting edge of AI.

We’re also at a time where language models are already extremely good. They’re in fact good enough for plenty of extremely valuable knowledge-work tasks. Language models taking another big step is hard to imagine — it’s unclear which tasks they’re going to master this year outside of code and CLI-based computer-use. There will be some new ones! These capabilities unlock new styles of working that’ll send more ripples through the economy.

These dramatic changes almost make it seem like a foregone conclusion that language models can then just keep accelerating progress on their own. The popular language for this is a recursive self-improvement loop. — Read More

#training

Recent Activity

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Rick's Cafe AI

The latest in Artificial Intelligence carefully curated into its own special blend

Tag Archives: Training

LoRA: Low-Rank Adaptation of Large Language Models

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Generalization Dynamics of LM Pre-training

Interaction Models: A Scalable Approach to Human-AI Collaboration

Building Blocks for Foundation Model Training and Inference on AWS

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Backpropagation is simpler than you think (once you see this)

This World Model Learns Physics by Watching Videos

The Death of model.fit(): What Data Scientists Actually Do in the Age of AI Agents

Lossy self-improvement