Building Hierarchical Agentic RAG Systems: Multi-Modal Reasoning with Autonomous Error Recovery

Enterprise AI teams face a persistent challenge: Most Retrieval-Augmented Generation (RAG) systems excel at either structured data queries or document search, but struggle when both are required simultaneously. A financial analyst who asked “Why are European operations underperforming?” needs data from both SQL databases (revenue, margins, and employee counts) and unstructured documents (market reports, competitive analysis, regulatory filings). Current RAG systems might return revenue data without regulatory context or surface market reports without quantitative validation, leaving analysts to manually bridge the gap. Current RAG approaches treat these modalities as separate concerns, forcing engineers to build custom orchestration layers or accept incomplete answers.

This article explores architectural patterns for solving the modality gap through hierarchical multi-agent orchestration, using Protocol-H as a reference implementation to illustrate these concepts in practice. The patterns discussed, supervisor-worker topology with autonomous error recovery, build on LangGraph/LangChain agentic patterns used by teams at companies like xAI and Databricks. The accompanying open source code demonstrates these patterns deployed at enterprise scale with Docker/K8s, though readers can apply the same architectural principles using their preferred frameworks.

The architecture described in this article is based on a reference implementation and production-oriented experimentation with enterprise datasets; specific deployment details have been generalized to focus on the architectural patterns rather than any particular system implementation. — Read More

#performance

Scaling Karpathy’s Autoresearch: What Happens When the Agent Gets a GPU Cluster

We pointed Claude Code at autoresearch and gave it access to 16 GPUs on a Kubernetes cluster. Over 8 hours it submitted ~910 experiments, found that scaling model width mattered more than any single hyperparameter, taught itself to use H200s for validation while screening ideas on H100s, and drove val_bpb from 1.003 down to 0.974 – a 2.87% improvement over baseline.

Beyond raw speedup, parallelism changed how the agent searched. With one GPU, it’s stuck doing greedy hill-climbing – try one thing, check, repeat. With 16 GPUs, it ran factorial grids of 10-13 experiments per wave, catching interaction effects between parameters that sequential search would miss. For example, the agent tested six model widths in a single wave, saw the trend immediately, and zeroed in on the best one – one round instead of six. — Read More

#performance

Anthropic’s Compute Advantage: Why Silicon Strategy is Becoming an AI Moat

Compute is not a commodity for frontier AI labs. It is a structural cost input that determines margin, throughput, and model iteration velocity at scale. The divergence in how Anthropic, OpenAI, and Microsoft have approached silicon procurement over the last 18 months is not just a supply chain story — it is a compounding strategic gap.

Anthropic has built what is today the most diversified and cost-efficient compute architecture among frontier AI labs. OpenAI remains almost entirely dependent on Nvidia. Microsoft’s internal chip program is years behind schedule. The structural implications favor Anthropic on unit economics and negotiating leverage as inference workloads scale. While Anthropic has had so much demand, it has struggled with up-time — its long-term strategy is the most fundamentally resilient.

One important caveat up front: compute advantage amplifies model advantage; it does not replace it. If a competitor’s models are materially better, customers absorb the higher token cost. The argument here is not that Anthropic wins because of infrastructure. The argument is that equivalent model quality delivered at 30–60% lower cost per token is a compounding advantage — on margin, on training budget, and on the pace of iteration. — Read More

#performance

Kimi K2.5

Artificial Analysis calls Kimi the new leading open weights model, ‘now closer than ever to the frontier’ behind only OpenAI, Anthropic and Google.

Kimi K2.5 gets to top some benchmarks: HLE-Full with tools (50%), BrowseComp with Agent Swarp (78%), OCRBench (92%), OmiDocBench 1.5 (89%), MathVista (90%) and InfoVQA (93%). It is not too far behind on AIME 2025 (96% vs. 100%), SWE-Bench (77% vs. 81%) and GPQA-Diamond (88% vs. 92%).

[B]enchmarks are highly useful, but easy to overinterpret.

Inference is cheap, and speed is similar to Gemini 3 Pro, modestly faster than Opus. — Read More

#performance

Conditional Memory via Scalable Lookup:A New Axis of Sparsity for Large Language Models

While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic N-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone’s early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models. — Read More

#performance

Engram: How DeepSeek Added a Second Brain to Their LLM

When DeepSeek released their technical reports for V2 and V3, the ML community focused on the obvious innovations: massive parameter counts, clever load balancing, and Multi-head Latent Attention. But buried in their latest research is something that deserves more attention: a different way to think about what an LLM should remember.

The insight is deceptively simple. Large language models spend enormous computational effort reconstructing patterns they’ve seen millions of times before. The phrase “United States of” almost certainly ends with “America.” “New York” probably precedes “City” or “Times.” These patterns are burned into the training data, and the model learns them, but it learns them the hard way: by propagating gradients through billions of parameters across dozens of layers.

What if you could just look them up? — Read More

#performance

Use multiple models

The meta for getting the most out of AI in 2026.

… [I]t doesn’t feel like I could get away with just using one of these models without taking a substantial haircut in capabilities. This is a very strong endorsement for the notion of AI being jagged — i.e. with very strong capabilities spread out unevenly — while also being a bit of an unusual way to need to use a product. Each model is jagged in its own way. Through 2023, 2024, and the earlier days of modern AI, it quite often felt like there was always just one winning model and keeping up was easier. Today, it takes a lot of work and fiddling to make sure you’re not missing out on capabilities. — Read More

#performance

Evaluating Context Compression for AI Agents

We built an evaluation framework to measure how much context different compression strategies preserve. After testing three approaches on real-world, long-running agent sessions spanning debugging, code review, and feature implementation, we found that structured summarization retains more useful information than alternatives from OpenAI and Anthropic. — Read More

#performance

Video App Zoom Shows Surprising Result By Topping Humanity’s Last Exam Benchmark, Beats Gemini 3 Pro

Topping AI benchmarks are usually thought to be the preserve of the top four AI frontier labs, but a surprising new name has emerged on the Humanity’s Last Exam benchmark.

Zoom, the video conferencing platform, has announced it achieved a state-of-the-art score of 48.1% on the Humanity’s Last Exam (HLE) full-set benchmark, surpassing Google’s Gemini 3 Pro with tools, which previously held the top position at 45.8%. The 2.3 percentage point improvement marks a significant achievement for a company better known for video calls than AI research. — Read More

#performance

Baidu just dropped an open-source multimodal AI that it claims beats GPT-5 and Gemini

Baidu Inc., China’s largest search engine company, released a new artificial intelligence model on Monday that its developers claim outperforms competitors from Google and OpenAI on several vision-related benchmarks despite using a fraction of the computing resources typically required for such systems.

The model, dubbed ERNIE-4.5-VL-28B-A3B-Thinking, is the latest salvo in an escalating competition among technology companies to build AI systems that can understand and reason about images, videos, and documents alongside traditional text — capabilities increasingly critical for enterprise applications ranging from automated document processing to industrial quality control.

What sets Baidu’s release apart is its efficiency: the model activates just 3 billion parameters during operation while maintaining 28 billion total parameters through a sophisticated routing architecture. According to documentation released with the model, this design allows it to match or exceed the performance of much larger competing systems on tasks involving document understanding, chart analysis, and visual reasoning while consuming significantly less computational power and memory. — Read More

#performance