When training large scale LLMs, there is a large assortment of parallelization strategies which you can employ to scale your training runs to work on more GPUs. There are already a number of good resources for understanding how to parallelize your models: I particularly recommend How To Scale Your Model and The Ultra-Scale Playbook. The purpose of this blog post is to discuss parallelization strategies in a more schematic way by focusing only on how they affect your device mesh. The device mesh is an abstraction used by both PyTorch and JAX that takes your GPUs (however many of them you’ve got in your cluster!) and organizes them into a N-D tensor that expresses how the devices communicate with each other. When we parallelize computation, we shard a tensor along one dimension of the mesh, and then do collectives along that dimension when there are nontrivial dependencies between shards. Being able to explain why a device mesh is set up the way it is for a collection of parallelization strategies is a good check for seeing if you understand how the parallelization strategies work in the first place! (Credit: This post was influenced by Visualizing 6D Mesh Parallelism.) — Read More
Recent Updates Page 50
Python: The Documentary | An origin story
Every Abstraction Is a Door and a Wall: The Hidden Law of Abstraction
TL;DR: Virtualization emerges as the strategy to increase efficiency and achieve feats that physical reality never could. To the point where even our work, friends, and experiences have gone virtual. But what’s the real cost of living in abstractions — and could reality itself be just another layer we can’t see through?
A July 2025 MIT study examined how large language models (LLMs) handle complex, changing information. Researchers tasked AI models with predicting the final arrangement of scrambled digits after a series of moves, without knowing the final result. Transformer models learned to skip explicit simulation of every move. Instead of following state changes step by step, the models organized them into hierarchies, eventually making reasonable predictions.
In other words, the AI developed its own internal “language” of shortcuts to solve the task more efficiently. Does it hint at a broader truth? When faced with complexity, intelligent systems (biological or artificial) seek compressed, virtual representations that capture the essence without expending the energy to simulate every detail. — Read More
Google and Grok are catching up to ChatGPT, says a16z’s latest AI report
ChatGPT rivals like Google’s Gemini, xAI’s Grok, and, to a lesser extent, Meta AI, are closing the gap to ChatGPT, OpenAI’s popular AI chatbot, according to a new report focused on the consumer AI landscape from venture firm Andreessen Horowitz.
The report, in its fifth iteration, showcases two and a half years of data about consumers’ evolving use of AI products.
And for the fifth time, 14 companies appeared on the list of top AI products: ChatGPT, Perplexity, Poe, Character AI, Midjourney, Leonardo, Veed, Cutout, ElevenLabs, Photoroom, Gamma, QuillBot, Civitai, and Hugging Face. — Read More
TIME100 AI 2025
Meet the innovators, leaders, and thinkers reshaping our world through groundbreaking advances in artificial intelligence. Time’s 100 most influential people in AI of 2025. The list includes familiar names like Sam Altman, Elon Musk, Jensen Huang, and Fei-Fei Li alongside newcomers like DeepSeek CEO Liang Wenfeng. — Read More
#strategyMass Intelligence
More than a billion people use AI chatbots regularly. ChatGPT has over 700 million weekly users. Gemini and other leading AIs add hundreds of millions more. In my posts, I often focus on the advances that AI is making (for example, in the past few weeks, both OpenAI and Google AIs chatbots got gold medals in the International Math Olympiad), but that obscures a broader shift that’s been building: we’re entering an era of Mass Intelligence, where powerful AI is becoming as accessible as a Google search.
Until recently, free users of these systems (the overwhelming majority) had access only to older, smaller AI models that frequently made mistakes and had limited use for complex work. The best models, like Reasoners that can solve very hard problems and hallucinate much less often, required paying somewhere between $20 and $200 a month. And even then, you needed to know which model to pick and how to prompt it properly. But the economics and interfaces are changing rapidly, with fairly large consequences for how all of us work, learn, and think. — Read More
Building Agents for Small Language Models: A Deep Dive into Lightweight AI
The landscape of AI agents has been dominated by large language models (LLMs) like GPT-4 and Claude, but a new frontier is opening up: lightweight, open-source, locally-deployable agents that can run on consumer hardware. This post shares internal notes and discoveries from my journey building agents for small language models (SLMs) – models ranging from 270M to 32B parameters that run efficiently on CPUs or modest GPUs. These are lessons learned from hands-on experimentation, debugging, and optimizing inference pipelines.
SLMs offer immense potential: privacy through local deployment, predictable costs, and full control thanks to open weights. However, they also present unique challenges that demand a shift in how we design agent architectures. — Read More
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05x inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks — narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. — Read More
DINOv3: Self-supervised learning for vision at unprecedented scale
Self-supervised learning (SSL) —the concept that AI models can learn independently without human supervision—has emerged as the dominant paradigm in modern machine learning. It has driven the rise of large language models that acquire universal representations by pre-training on massive text corpora. However, progress in computer vision has lagged behind, as the most powerful image encoding models still rely heavily on human-generated metadata, such as web captions, for training.
Today, we’re releasing DINOv3, a generalist, state-of-the-art computer vision model trained with SSL that produces superior high-resolution visual features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks including object detection and semantic segmentation. — Read More
China unveils bionic antelope robot to observe endangered Tibetan species
A lifelike robotic Tibetan antelope is now roaming the high-altitude wilderness of Hoh Xil National Nature Reserve in Northwest China’s Qinghai Province.
Equipped with 5G ultra-low latency networks and advanced artificial intelligence (AI) algorithms, the bionic robot is being used to collect real-time data on Tibetan antelope populations without disturbing them.
This is the first time such a robotic antelope has been deployed in the heart of Hoh Xil, which sits more than 15,092 feet (4,600 meters) above sea level. — Read More