Every frontier model in 2026 advertises a context window of at least a million tokens, but almost none of them are actually great at making use of all of that information. On MRCR v2, the multi-reference retrieval benchmark labs report, the best model is GPT-5.5, which scores 74.0%. Others like Claude Opus 4.7 at 32.2% are far behind.
At this point, a million tokens seems to be the maximum for the context window that the major frontier labs are offering. One major reason for the million-token max is the same one that has shaped every transformer-based model since 2017: Attention cost scales quadratically with context length, so doubling the input quadruples the work. Essentially, RAG, agentic decomposition, hybrid model architectures, and every other workaround the industry has built are ways of making tradeoffs to get around this.
Subquadratic, a Miami-based startup, launched its first model on Tuesday and claims it can get around all of this, now offering a model that can handle a token window of 12 million tokens. What’s more, the company says it plans to offer a model with a 50-million-context window soon. — Read More
Tag Archives: Performance
Computer use is 45x More Expensive Than Structured APIs
We ran a benchmark comparing two ways of letting an AI agent operate the same admin panel, with the goal of putting a price tag on vision agents (browser-use, computer-use).
Here is what we measured, what we had to change to make the vision agent work at all, and what changes when generating an API surface stops being a separate engineering project. — Read More
Salesforce launches Agentforce Operations to fix the workflows breaking enterprise AI
Enterprise AI teams are hitting a wall — not because their models can’t reason, but because the workflows underneath them were never built for agents. Tasks fail, handoffs break, and the problem compounds as organizations push agents deeper into back-office systems. A new architectural layer is emerging to address it: workflow execution control planes that impose deterministic structure on processes agents are expected to run.
One of the companies bringing this to the forefront is Salesforce, with a new workflow platform that turns back-office workflows into a set of tasks for specialized agents to complete. Users can upload their processes or use one of the set Blueprints provided by Salesforce, and Agentforce Operations will break it down for agents. — Read More
Challenges and Research Directions for Large Language Model Inference Hardware
Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices. — Read More
Building Hierarchical Agentic RAG Systems: Multi-Modal Reasoning with Autonomous Error Recovery
Enterprise AI teams face a persistent challenge: Most Retrieval-Augmented Generation (RAG) systems excel at either structured data queries or document search, but struggle when both are required simultaneously. A financial analyst who asked “Why are European operations underperforming?” needs data from both SQL databases (revenue, margins, and employee counts) and unstructured documents (market reports, competitive analysis, regulatory filings). Current RAG systems might return revenue data without regulatory context or surface market reports without quantitative validation, leaving analysts to manually bridge the gap. Current RAG approaches treat these modalities as separate concerns, forcing engineers to build custom orchestration layers or accept incomplete answers.
This article explores architectural patterns for solving the modality gap through hierarchical multi-agent orchestration, using Protocol-H as a reference implementation to illustrate these concepts in practice. The patterns discussed, supervisor-worker topology with autonomous error recovery, build on LangGraph/LangChain agentic patterns used by teams at companies like xAI and Databricks. The accompanying open source code demonstrates these patterns deployed at enterprise scale with Docker/K8s, though readers can apply the same architectural principles using their preferred frameworks.
The architecture described in this article is based on a reference implementation and production-oriented experimentation with enterprise datasets; specific deployment details have been generalized to focus on the architectural patterns rather than any particular system implementation. — Read More
Scaling Karpathy’s Autoresearch: What Happens When the Agent Gets a GPU Cluster
We pointed Claude Code at autoresearch and gave it access to 16 GPUs on a Kubernetes cluster. Over 8 hours it submitted ~910 experiments, found that scaling model width mattered more than any single hyperparameter, taught itself to use H200s for validation while screening ideas on H100s, and drove val_bpb from 1.003 down to 0.974 – a 2.87% improvement over baseline.
Beyond raw speedup, parallelism changed how the agent searched. With one GPU, it’s stuck doing greedy hill-climbing – try one thing, check, repeat. With 16 GPUs, it ran factorial grids of 10-13 experiments per wave, catching interaction effects between parameters that sequential search would miss. For example, the agent tested six model widths in a single wave, saw the trend immediately, and zeroed in on the best one – one round instead of six. — Read More
Anthropic’s Compute Advantage: Why Silicon Strategy is Becoming an AI Moat
Compute is not a commodity for frontier AI labs. It is a structural cost input that determines margin, throughput, and model iteration velocity at scale. The divergence in how Anthropic, OpenAI, and Microsoft have approached silicon procurement over the last 18 months is not just a supply chain story — it is a compounding strategic gap.
Anthropic has built what is today the most diversified and cost-efficient compute architecture among frontier AI labs. OpenAI remains almost entirely dependent on Nvidia. Microsoft’s internal chip program is years behind schedule. The structural implications favor Anthropic on unit economics and negotiating leverage as inference workloads scale. While Anthropic has had so much demand, it has struggled with up-time — its long-term strategy is the most fundamentally resilient.
One important caveat up front: compute advantage amplifies model advantage; it does not replace it. If a competitor’s models are materially better, customers absorb the higher token cost. The argument here is not that Anthropic wins because of infrastructure. The argument is that equivalent model quality delivered at 30–60% lower cost per token is a compounding advantage — on margin, on training budget, and on the pace of iteration. — Read More
Kimi K2.5
… Artificial Analysis calls Kimi the new leading open weights model, ‘now closer than ever to the frontier’ behind only OpenAI, Anthropic and Google.
Kimi K2.5 gets to top some benchmarks: HLE-Full with tools (50%), BrowseComp with Agent Swarp (78%), OCRBench (92%), OmiDocBench 1.5 (89%), MathVista (90%) and InfoVQA (93%). It is not too far behind on AIME 2025 (96% vs. 100%), SWE-Bench (77% vs. 81%) and GPQA-Diamond (88% vs. 92%).
[B]enchmarks are highly useful, but easy to overinterpret.
Inference is cheap, and speed is similar to Gemini 3 Pro, modestly faster than Opus. — Read More
Conditional Memory via Scalable Lookup:A New Axis of Sparsity for Large Language Models
While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic N-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone’s early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models. — Read More
Engram: How DeepSeek Added a Second Brain to Their LLM
When DeepSeek released their technical reports for V2 and V3, the ML community focused on the obvious innovations: massive parameter counts, clever load balancing, and Multi-head Latent Attention. But buried in their latest research is something that deserves more attention: a different way to think about what an LLM should remember.
The insight is deceptively simple. Large language models spend enormous computational effort reconstructing patterns they’ve seen millions of times before. The phrase “United States of” almost certainly ends with “America.” “New York” probably precedes “City” or “Times.” These patterns are burned into the training data, and the model learns them, but it learns them the hard way: by propagating gradients through billions of parameters across dozens of layers.
What if you could just look them up? — Read More