Rick's Cafe AI 3:50 pm on May 29, 2026
Tags: Performance

All major AI models violate EU regulations — study

All of the big AI models violate EU rules on AI and data protection to varying degrees, according to the nonprofit research foundation Aithos.

Aithos tested the models using its own tool, LARA (Legal Assessment for Real-world Agents), which simulates real-world situations where AI assistants may find themselves in legally questionable situations, according to The Register. The tests measure compliance with the GDPR and the EU’s AI Regulation, among other things and found the models collected user data without proper consent, attempted to manipulate vulnerable individuals, or created psychological profiles of users. — Read More

#performance

Rick's Cafe AI 4:41 pm on May 22, 2026
Tags: Performance

Alibaba’s Metis Agent Reduces AI Tool Calls and Enhances Accuracy

Alibaba’s Metis agent represents a significant advancement in AI operational efficiency by utilizing a novel reinforcement learning framework called Hierarchical Decoupled Policy Optimization (HDPO). This innovative approach has drastically cut down redundant tool invocations from 98% to just 2%, while simultaneously improving reasoning accuracy across industry benchmarks. This breakthrough not only tackles operational bottlenecks but also challenges traditional AI training methods, promoting a more cost-effective and responsive AI deployment in business applications. — Read More

#performance

Rick's Cafe AI 9:22 am on May 21, 2026
Tags: Performance

Geometric AI does not need attention

I got the idea for this post when I had a virtual coffee with an engineer who builds AI models for one of the big airplane builders. And he hasn’t built a model that writes your emails or hallucinates your legal documents, but his model does something different. It looks at, say, a winglet — that’s the little upturned fin at the tip of every commercial aircraft wing — and with it he is able to predict the turbulence it will generate with 98% accuracy.

Let that sit for a moment.

… I walked away from that coffee thinking about wave interference. Because turbulence is, at its core, a wave problem. Pressure waves, superimposed, creating chaotic but geometrically structured patterns. And if a model can learn those patterns in aerodynamics, the obvious question is, where else do superimposed wave systems produce instability that we desperately need to control? — Read More

#performance

Rick's Cafe AI 5:00 pm on May 13, 2026
Tags: Performance

How to achieve truly serverless GPUs

We are in the age of inference. Billion- to trillion-parameter neural networks are run on specialized accelerators at quadrillions of operations per second to generate media, author software, and fold proteins at massive scale.

Inference workloads are more variable and less predictable than the training workloads that previously dominated. That makes them a natural fit for serverless computing, where applications are defined at a level above the (virtual) machine so that they can be more readily scaled up and down to handle variable load

But serverless computing only works if new replicas can be spun up quickly — as fast as demand changes, which can be at the scale of seconds. Naïvely spinning up a new instance of, say, SGLang serving a billion-parameter LLM on a B200 can take tens of minutes or stall for hours on GPU availability.

At Modal, we’ve done deep engineering work over the last five years to solve this problem. In this blog post, we walk through what we did. — Read More

#performance

Rick's Cafe AI 10:03 am on May 13, 2026
Tags: Performance

SubQ AI Explained: How Good Is the 12M Context Window LLM?

On May 5, 2026, a tiny Miami-based startup called Subquadratic released a model named SubQ. The team is small, but they’ve raised $29M in seed funding and claim the model can process up to 12 million tokens in a single pass.

They have also made other crazy-sounding claims, like their model is up to 52 times more efficient than FlashAttention at 1M tokens and achieves a coding performance similar to Claude Opus at roughly 1/20th of the cost.

These are big statements, so it makes sense to break this down and see what’s actually going on. In this piece, I’ll walk through what SubQ is, how the architecture works, and what the early details and developer communities suggest about these claims. — Read More

#performance

Rick's Cafe AI 2:43 pm on May 7, 2026
Tags: Performance

New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration

Today we’re launching dreaming in Claude Managed Agents as a research preview. Dreaming extends memory by reviewing past sessions to find patterns and help agents self-improve. We’re also making outcomes, multiagent orchestration, and webhooks available to developers building with Managed Agents. Together, these updates make agents more capable at handling complex tasks with minimal steering. — Read More

#performance

Rick's Cafe AI 1:49 pm on May 6, 2026
Tags: Performance

The context window has been shattered: Subquadratic debuts a 12-million-token window

Every frontier model in 2026 advertises a context window of at least a million tokens, but almost none of them are actually great at making use of all of that information. On MRCR v2, the multi-reference retrieval benchmark labs report, the best model is GPT-5.5, which scores 74.0%. Others like Claude Opus 4.7 at 32.2% are far behind.

At this point, a million tokens seems to be the maximum for the context window that the major frontier labs are offering. One major reason for the million-token max is the same one that has shaped every transformer-based model since 2017: Attention cost scales quadratically with context length, so doubling the input quadruples the work. Essentially, RAG, agentic decomposition, hybrid model architectures, and every other workaround the industry has built are ways of making tradeoffs to get around this.

Subquadratic, a Miami-based startup, launched its first model on Tuesday and claims it can get around all of this, now offering a model that can handle a token window of 12 million tokens. What’s more, the company says it plans to offer a model with a 50-million-context window soon. — Read More

#performance

Rick's Cafe AI 1:46 pm on May 6, 2026
Tags: Performance

Computer use is 45x More Expensive Than Structured APIs

We ran a benchmark comparing two ways of letting an AI agent operate the same admin panel, with the goal of putting a price tag on vision agents (browser-use, computer-use).

Here is what we measured, what we had to change to make the vision agent work at all, and what changes when generating an API surface stops being a separate engineering project. — Read More

#performance

Rick's Cafe AI 1:40 pm on May 4, 2026
Tags: Performance

Salesforce launches Agentforce Operations to fix the workflows breaking enterprise AI

Enterprise AI teams are hitting a wall — not because their models can’t reason, but because the workflows underneath them were never built for agents. Tasks fail, handoffs break, and the problem compounds as organizations push agents deeper into back-office systems. A new architectural layer is emerging to address it: workflow execution control planes that impose deterministic structure on processes agents are expected to run.

One of the companies bringing this to the forefront is Salesforce, with a new workflow platform that turns back-office workflows into a set of tasks for specialized agents to complete. Users can upload their processes or use one of the set Blueprints provided by Salesforce, and Agentforce Operations will break it down for agents. — Read More

#performance

Rick's Cafe AI 11:19 am on April 23, 2026
Tags: Performance

Challenges and Research Directions for Large Language Model Inference Hardware

Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices. — Read More

#performance

Rick's Cafe AI

The latest in Artificial Intelligence carefully curated into its own special blend

Tag Archives: Performance

All major AI models violate EU regulations — study

Alibaba’s Metis Agent Reduces AI Tool Calls and Enhances Accuracy

Geometric AI does not need attention

How to achieve truly serverless GPUs

SubQ AI Explained: How Good Is the 12M Context Window LLM?

New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration

The context window has been shattered: Subquadratic debuts a 12-million-token window

Computer use is 45x More Expensive Than Structured APIs

Salesforce launches Agentforce Operations to fix the workflows breaking enterprise AI

Challenges and Research Directions for Large Language Model Inference Hardware