If you’re building AI products on top of closed models, anyone with an API key can get similar capabilities. Lasting differentiation comes from proprietary data, the training recipe, the infrastructure, and the speed of iteration.
Shopify has something most companies don’t: a product surface where millions of merchant interactions directly signal whether the model’s output is any good. That feedback loop is the foundation, but only if you keep learning from it.
We fine-tuned a tool-calling agent to turn natural language into a Shopify Flow for Sidekick, our AI commerce assistant. It’s 2.2x faster, 68% cheaper, and outperforms closed models. — Read More
Recent Updates Page 18
Google is testing AI chatbot search for YouTube
Google is bringing conversational AI search to YouTube, marking the company’s latest push to infuse its products with AI-powered discovery tools. The feature, dubbed “Ask YouTube,” started rolling out to YouTube Premium subscribers in the US today as an experimental test. It transforms the platform’s search bar into a chatbot-style interface that pulls results from longform videos, Shorts, and text summaries – essentially giving YouTube its own version of Google’s AI Mode for search. — Read More
Can agents replace the search stack?
How is search implemented where you work? Probably as a complex set of capabilities on top of retrieval. Our search APIs understand queries, call backend search systems, and finally rerank results.
But if we had an agent in the loop, would we need all that? Could we replace search backends with an agent? After all, an agent understands user requests, calls retrieval tools, and evaluates relevance on its own. We see ChatGPT do this all the time, why can’t our search bar?
In other words, if you give a basic BM25 backend to an agent, could it take the Search API’s job? — Read More
Can AI Attack the Cloud? Lessons From Building an Autonomous Cloud Offensive Multi-Agent System
The offensive capabilities of large language models (LLMs) have until recently existed as theoretical risks – frequently discussed at security conferences and in conceptual industry reports, but rarely discovered in practical exploits. However, in November 2025, Anthropic published a pivotal report documenting a state-sponsored espionage campaign. In this operation, AI didn’t just assist human operators – it became the operator, performing 80-90% of the campaign autonomously, at speeds that no human team could match.
This disclosure shifted the conversation from “could this happen?” to “this is happening.” But it also raised practical questions: Can AI actually operate autonomously end-to-end, or does it still require human guidance at each decision point? Where do current LLM capabilities excel, and where do they fall short compared to skilled human operators?
To answer these questions, we built a multi-agent penetration testing proof of concept (PoC), designed to empirically test autonomous AI offensive capabilities against cloud environments. — Read More
Agent Auth: Why OAuth Wasn’t Built for This: Where this leaves builders
Authentication is converging around known primitives. Authorization across trust domains is not.
Knowing that an agent is who it claims to be is one problem. Knowing what it is actually allowed to do during a specific task, and producing proof of that afterward, is harder. OWASP’s MCP Top 10 and A2A’s signed Agent Cards address pieces of this, as does the WIMSE architecture. No single specification covers the full chain from identity through intent to audit trail.
The infrastructure gap extends beyond the auth layer itself. — Read More
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
When language models (LMs) are trained via reinforcement learning (RL) to generate natural language “reasoning chains”, their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or “hallucinate”) in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score — a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations — outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models. — Read More
Introducing talkie: a 13B vintage language model from 1930
Have you ever daydreamed about talking to someone from the past? What would you ask someone with no knowledge of the modern world? What would they ask you? While we don’t have time machines yet, we can simulate this experience by training, in Owain Evans’s phrase, ‘vintage’ language models: LMs trained only on historical text.<
These models are fascinating conversation partners (watch Claude prompt talkie, our 13B 1930 LM, in the widget above). But we are also excited by the possibility that the careful study of the behaviors and capabilities of vintage LMs will advance our understanding of AI in general. — Read More
The prefrontal cortex controls memory organization in the hippocampus
Prior memories can be integrated with novel experiences during learning to facilitate memory organization. This process must be tightly regulated to prevent inappropriate integration of unrelated memories. However, the biological mechanisms underlying such control are currently unknown. Using multiple imaging, chemogenetic and optogenetic techniques in mice, we demonstrate that the ventromedial prefrontal cortex is recruited over time to control memory integration in the hippocampus according to contextual similarities between experiences. This control is achieved through direct projections to the medial entorhinal cortex that modulate entorhinal activity, ensemble overlap in the dorsal hippocampus, memory linking, activity of neurogliaform cells in the dorsal CA1 and memory allocation. Together, our results provide new insights into the mechanisms controlling crucial processes of memory organization in the mammalian brain. — Read More
The Space Between Humans, AI, and the Work We’ve Been Avoiding
It can be hard to tell what’s real these days between the productivity/token maxxing and robot apocalypse – terrorizing our eyeballs with messages that the machine is either perfect or complete garbage. While technology is moving faster than I have ever seen in my lifetime, I can’t help but think we are applying to solve our non-technical problems.
The cracks were always there. AI just made them visible.
At Monki Gras 2026, Laura Tacho called out that what holds us back are our human and systems-level constraints. Not the technology – Us and the ways in which we organize and communicate (or don’t). — Read More
I’m Sorry Dave, This Request Triggered Restrictions On Violative Cyber Content
n mid-April 2026, Context.ai was breached and used as a pivot into a Vercel employee’s Google Workspace account. From there, the threat actor pivoted into Vercel’s production environment. Vercel’s CEO Guillermo Rauch provided an update that is more noteworthy than the breach itself. In a tweet providing more details he said:
We believe the attacking group to be highly sophisticated and, I strongly suspect, significantly accelerated by AI. They moved with surprising velocity and in-depth understanding of Vercel.
Anyone doing red team work already knows this. — Read More