In this post, I will present an algorithm that was able to compute an optimal tokenizer in some settings. This result is cool because optimal tokenization is theoretically intractable, but seems to be solvable in practice. My finding is very similar to various results on the Traveling Salesman Problem (TSP), where even difficult instances can be solved optimally using cutting-plane techniques.
I’ll highlight that, while this result is cool, there are a few reasons that it isn’t necessarily useful. First, the existing state of the art was already somewhat close to optimal (often within 1%). Second, even if a tokenizer is optimal on the training data, it may not generalize as well as other tokenizers when evaluated on held out test data. Finally, inefficient tokenizers are basically fine: you can pay for the cost of a less efficient tokenizer by slightly increasing your vocabulary size. — Read More
Tag Archives: Performance
The Agentic Reckoning: Enterprise AI organizations have a runtime problem, not a model problem
In Q1 2026, VentureBeat’s Pulse Research surfaced the “Governance Mirage”: the gap between the governance org charts enterprises had drawn and the control layers they had actually built. Forty-three percent said a central team owned AI governance; 23% couldn’t agree on who owned it at all; and 31% named vendor opacity as the single biggest obstacle.
This new wave of research asks the next question: Once you’ve admitted the governance problem, what breaks first when you try to fix it? The answer from our respondents is unambiguous. The failure point is not the model. It’s the runtime. — Read More
A Guide to AI Inference Engineering
Every time an LLM generates a response, two operations run in sequence on the same GPU. The first processes the input prompt and emits a single token. The second produces every token after that, one at a time.
From the outside, they look like stages of one process. However, inside the hardware, they have opposite bottlenecks. One is limited by raw compute. The other is limited by how fast data moves through memory. Most of the engineering work that makes production AI systems fast exists because of this split, and the techniques used to handle it are what inference engineering is built around.
Inference engineering is the discipline of running trained AI models in production efficiently. The work spans low-level GPU code, model serving frameworks, and the cloud infrastructure that ties them together. — Read More
Doing nothing at work
Many engineers should be doing less work. I don’t necessarily mean producing less code or fewer changes, but literally working fewer hours in the day. When they do work, they should be working at a slower pace. I like to aim to be running at 80% utilization by default: unless I have a high-pressure project going on, I spend 20% of my workday away from the computer.
Why? Performance at tech companies is dominated by outlier events. When I think about the most impactful changes I’ve made, many of them involved a surprisingly trivial amount of work. There are no points for effort in software development. What matters is solving the right problem at the right time. — Read More
Dreaming: Better memory for a more helpful ChatGPT
Today we’re beginning to roll out a more capable and scalable system for synthesizing memory, developed to tackle the staleness, correctness, and scalability challenges that we observe when memory is applied to the hundreds of millions of users and multi-year time horizons in ChatGPT.
Memory is what helps ChatGPT learn your preferences, projects, and constraints, allowing future conversations to start from shared context rather than from scratch.
Over the last two years, memory has grown into a critical part of the ChatGPT experience, helping ChatGPT better understand your context so it can help you accomplish meaningful goals over time. This is central to making ChatGPT more useful: knowing you, helping you, and doing more for you. — Read More
All major AI models violate EU regulations — study
All of the big AI models violate EU rules on AI and data protection to varying degrees, according to the nonprofit research foundation Aithos.
Aithos tested the models using its own tool, LARA (Legal Assessment for Real-world Agents), which simulates real-world situations where AI assistants may find themselves in legally questionable situations, according to The Register. The tests measure compliance with the GDPR and the EU’s AI Regulation, among other things and found the models collected user data without proper consent, attempted to manipulate vulnerable individuals, or created psychological profiles of users. — Read More
Alibaba’s Metis Agent Reduces AI Tool Calls and Enhances Accuracy
Alibaba’s Metis agent represents a significant advancement in AI operational efficiency by utilizing a novel reinforcement learning framework called Hierarchical Decoupled Policy Optimization (HDPO). This innovative approach has drastically cut down redundant tool invocations from 98% to just 2%, while simultaneously improving reasoning accuracy across industry benchmarks. This breakthrough not only tackles operational bottlenecks but also challenges traditional AI training methods, promoting a more cost-effective and responsive AI deployment in business applications. — Read More
Geometric AI does not need attention
I got the idea for this post when I had a virtual coffee with an engineer who builds AI models for one of the big airplane builders. And he hasn’t built a model that writes your emails or hallucinates your legal documents, but his model does something different. It looks at, say, a winglet — that’s the little upturned fin at the tip of every commercial aircraft wing — and with it he is able to predict the turbulence it will generate with 98% accuracy.
Let that sit for a moment.
… I walked away from that coffee thinking about wave interference. Because turbulence is, at its core, a wave problem. Pressure waves, superimposed, creating chaotic but geometrically structured patterns. And if a model can learn those patterns in aerodynamics, the obvious question is, where else do superimposed wave systems produce instability that we desperately need to control? — Read More
How to achieve truly serverless GPUs
We are in the age of inference. Billion- to trillion-parameter neural networks are run on specialized accelerators at quadrillions of operations per second to generate media, author software, and fold proteins at massive scale.
Inference workloads are more variable and less predictable than the training workloads that previously dominated. That makes them a natural fit for serverless computing, where applications are defined at a level above the (virtual) machine so that they can be more readily scaled up and down to handle variable load
But serverless computing only works if new replicas can be spun up quickly — as fast as demand changes, which can be at the scale of seconds. Naïvely spinning up a new instance of, say, SGLang serving a billion-parameter LLM on a B200 can take tens of minutes or stall for hours on GPU availability.
At Modal, we’ve done deep engineering work over the last five years to solve this problem. In this blog post, we walk through what we did. — Read More
SubQ AI Explained: How Good Is the 12M Context Window LLM?
On May 5, 2026, a tiny Miami-based startup called Subquadratic released a model named SubQ. The team is small, but they’ve raised $29M in seed funding and claim the model can process up to 12 million tokens in a single pass.
They have also made other crazy-sounding claims, like their model is up to 52 times more efficient than FlashAttention at 1M tokens and achieves a coding performance similar to Claude Opus at roughly 1/20th of the cost.
These are big statements, so it makes sense to break this down and see what’s actually going on. In this piece, I’ll walk through what SubQ is, how the architecture works, and what the early details and developer communities suggest about these claims. — Read More