Rick's Cafe AI 10:16 am on June 30, 2026
Tags: Training ( 77 )

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student’s own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving superior token efficiency compared to reinforcement learning methods and better performance over off-policy distillation methods. Code repo: this https URL.Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self, we introduce On-Policy Self-Distillation (OPSD), a learning algorithm where a single LLM acts as both teacher and student with different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student’s own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving superior token efficiency compared to reinforcement learning methods and better performance over off-policy distillation methods. — Read More

Code repo: this https URL.

#training

Rick's Cafe AI 3:40 pm on June 29, 2026
Tags: Big7 ( 271 )

Why is Google suddenly losing AI talent? The lure of pre-IPO equity is strong.

If you want to see a person’s eyes light up at a Silicon Valley party, just say the words “pre-IPO equity.” It works.

Google’s sudden AI talent losses may have less to do with dissatisfaction and more to do with a timeless Silicon Valley calculation: where the biggest equity upside lives. — Read More

#big7

Rick's Cafe AI 2:21 pm on June 29, 2026
Tags: Strategy ( 598 )

Lean Software Scaling Laws

Research proposal for measuring how coding LLM perplexity scales with codebase context size, using Lean as a test case for whether formal languages have better predictability exponents and could lead to safer, more secure software worldwide.

Research idea: empirically measure the scaling of coding LLM perplexity over codebase size to estimate the scaling laws of ‘predictability’ by programming language or other factors. This should translate into overall security and safety. — Read More

#strategy

Rick's Cafe AI 2:17 pm on June 29, 2026
Tags: Performance ( 122 )

Discretizing Reward Models

Despite their widespread use, the role of reward models in shaping reinforcement learning is poorly understood. Reward models offer a tempting promise: they automatically estimate response quality in the absence of verifiers or human judges. Unlike “verifiable rewards” which typically produce binary scores, reward models typically produce continuous scores, allowing them to be sensitive to fine-grained differences in responses. However, we show this apparent strength is a serious weakness: many popular reward models are oversensitive, assigning different scores to equally good responses. Theoretically, we show that seemingly perfect reward models can be highly oversensitive; empirically, this oversensitivity can lead to bad policies. In place of existing notions of “reward model accuracy,” we propose evaluating reward models using distinct measures of “discriminative ability” and “specificity” (the complement of oversensitivity). As a solution, we describe a training-free algorithm that uses Monte Carlo dropout on any neural reward model to produce discrete reward clusters. Theoretically, we prove there exist discretizations that reduce oversensitivity at minimal expense of discriminative ability; empirically we show, in both controlled and natural RL settings, that discretizing rewards leads to less reward hacking and better policies than training on the original rewards. — Read More

#performance

Rick's Cafe AI 2:15 pm on June 29, 2026
Tags: Strategy ( 598 )

The next big breakthrough will be AIs learning on the job

Here’s the big research bet the labs are making currently: if we train AIs to accomplish millions of verifiable tasks across thousands of diverse RL environments, then we’ll basically have built AGI. Because such training will create these general problem solving skills (like how to make progress on an open ended task for weeks on end in the face of errors, mistakes, and ambiguity).

The people optimistic about this vision would say that anything we might consider a fundamental deficits with the current learning paradigm—for example, data inefficiency and lack of continual learning—can be steamrolled by just scaling training more, just as all the supposed “fundamental” research problems in natural language processing collapsed against the flood of compute thrown into LLMs. — Read More

#strategy

Rick's Cafe AI 12:53 pm on June 29, 2026
Tags: DevOps ( 385 )

“Bring it to our shop”: Workday’s pitch for keeping AI agents close to your most valuable data

Workday, the payroll and HR data platform, has been pursuing AI and agents for a while, but while other businesses may allow a little room for error, getting a payroll run in Workday 99% right is not exactly good enough.

… “There aren’t many systems that are more critical — or less forgiving — than ones that are dealing with people and money,” he says. There’s no tolerance for “well, it works most of the time,” Gabe Monroy, Workday’s CTO, says in an interview. — Read More

#devops

Rick's Cafe AI 12:42 pm on June 29, 2026
Tags: Strategy ( 598 )

Forget AGI. The real prize is enterprise AGI

We believe much of the artificial intelligence industry is chasing the wrong prize. Frontier model vendors, such as Anthropic and OpenAI Group, may have shifted their commercial focus toward enterprise customers, but they’ve not changed their fundamental architecture

Specifically, they’re still trying to concentrate ever more intelligence inside a generalized model. We agree with Databricks Inc. Chief Executive Ali Ghodsi that the practical definition of artificial general intelligence has actually been achieved. Moving the goalpost to superintelligence — or what we’ve called Messiah AGI in a prior Breaking Analysis — does little to create differentiation for enterprise customers.

The real prize as we see it is what we call enterprise AGI. What do we mean by that? Specifically, we’re talking about intelligence that is unique to and owned by each enterprise. — Read More

#strategy

Rick's Cafe AI 12:13 pm on June 29, 2026
Tags: Strategy ( 598 )

Genie’s Not Going Back in the Bottle

I was recently watching a Demis Hassabis interview, and his one line has stuck with me. When Stanford President John Levin asked, “Demis, if you were back in school, what would you be studying, and what would be your advice to students about what to study in their careers?”

… Many prestigious people received backlash on stage from students while giving commencement speeches for discussing and utilizing artificial intelligence. It reflects a growing perception among students that AI will wipe out all Jobs.

Current evidence states otherwise. — Read More

#strategy

Rick's Cafe AI 12:08 pm on June 29, 2026
Tags: Strategy ( 598 )

AI Is Changing Software Jobs, but Most Developers Are Still Needed

For the past year, the software industry has been flooded with one message: AI is getting so good at writing code that developers may soon become unnecessary.

You hear it everywhere. AI will replace programmers. Teams will shrink. Product owners will type one prompt and fully working software will appear on the other side. In that version of the future, developers are either already obsolete or about to be.

I do not claim to have insider knowledge about what is happening inside every company. I have spent the last five years working as a solo developer, so I have not been sitting inside large engineering teams watching their workflows change from the inside.

But I have been job hunting recently, and that experience gave me a different view. — Read More

#strategy

Rick's Cafe AI 12:03 pm on June 29, 2026
Tags: Strategy ( 598 )

Your Coding Skills Are Losing Value Faster Than You Think

For a long time, developers could build a career around becoming unusually good at implementation.

You learned a framework deeply enough to predict its failure modes. You got fast at tracing production bugs through a stack nobody had documented properly. Give you a vague ticket and you could infer the missing requirements, then ship something sane without leaving the codebase worse than you found it.

Teams valued those skills because they were scarce. More importantly, replacing them was expensive.

The market is repricing them now. — Read More

#strategy

Rick's Cafe AI

The latest in Artificial Intelligence carefully curated into its own special blend

Author Archives: Rick's Cafe AI

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Why is Google suddenly losing AI talent? The lure of pre-IPO equity is strong.

Lean Software Scaling Laws

Discretizing Reward Models

The next big breakthrough will be AIs learning on the job

“Bring it to our shop”: Workday’s pitch for keeping AI agents close to your most valuable data

Forget AGI. The real prize is enterprise AGI

Genie’s Not Going Back in the Bottle

AI Is Changing Software Jobs, but Most Developers Are Still Needed

Your Coding Skills Are Losing Value Faster Than You Think