vLLM is an open-source inference engine that serves large language models. We deploy multiple vLLM instances across GPUs and load open weight models like Llama 4 into them. We then load balance traffic across vLLM instances, run health checks, and do upgrades. Our customers consume our managed service by sending their prompts to our API endpoints. This endpoint also determines the vLLM instance that serves their prompt.
vLLM sits at the intersection of AI and systems programming, so we thought that diving into its details might interest some of our readers. In this blog post, we describe how an inference request travels through vLLM’s OpenAI-compatible API server and core engine. We also provide key code pointers.
We assume readers are already familiar with the transformer architecture and large language models. If you’re not, we highly recommend this video by OpenAI co-founder Andrej Karpathy. We will focus on the new V1 architecture of vLLM and how it achieves state-of-the-art text generation performance. If you’re looking for the V0 behavior or multi-modal inference, please refer to other vLLM documentation. — Read More
Recent Updates Page 35
Efficient Federated Learning with Encrypted Data Sharing for Data-Heterogeneous Edge Devices
As privacy protection gains increasing importance, more models are being trained on edge devices and subsequently merged into the central server through Federated Learning (FL). However, current research overlooks the impact of network topology, physical distance, and data heterogeneity on edge devices, leading to issues such as increased latency and degraded model performance. To address these issues, we propose a new federated learning scheme on edge devices that called Federated Learning with Encrypted Data Sharing(FedEDS). FedEDS uses the client model and the model’s stochastic layer to train the data encryptor. The data encryptor generates encrypted data and shares it with other clients. The client uses the corresponding client’s stochastic layer and encrypted data to train and adjust the local model. FedEDS uses the client’s local private data and encrypted shared data from other clients to train the model. This approach accelerates the convergence speed of federated learning training and mitigates the negative impact of data heterogeneity, making it suitable for application services deployed on edge devices requiring rapid convergence. Experiments results show the efficacy of FedEDS in promoting model performance. — Read More
7 People Now Have Elon Musk’s Neuralink Brain Implant
The brain-computer interface lets those with cervical spinal cord injuries or ALS control a computer with their thoughts. This year, Neuralink has more than doubled the number of patients.
Neuralink has been quietly increasing the number of patients with its N1 brain implant. According to the Barrow Neurological Institute, seven people have now received one. — Read More
It’s Known as ‘The List’—and It’s a Secret File of AI Geniuses
The leader of Meta’s Superintelligence lab, Alexandr Wang, 28, is one of the priciest hires anywhere.
Meta has made offers to dozens of researchers at OpenAI. The startup has responded to Zuckerberg’s blitz with impressive packages of its own.
Not everyone on Meta’s list gets $100 million, though they’re still fetching astronomical sums. — Read More
AI’s Trillion-Dollar Opportunity: Sequoia AI Ascent 2025 Keynote
Checking In on AI and the Big Five
This is how I opened January 2023’s AI and the Big Five:
The story of 2022 was the emergence of AI, first with image generation models, including DALL-E, MidJourney, and the open source Stable Diffusion, and then ChatGPT, the first text-generation model to break through in a major way. It seems clear to me that this is a new epoch in technology.
Sometimes the accuracy of a statement is measured by its banality, and that certainly seems to be the case here: AI is the new epoch, consuming the mindshare of not just Stratechery but also the companies I cover. To that end, two-and-a-half years on, I thought it would be useful to revisit that 2023 analysis and re-evaluate the state of AI’s biggest players, primarily through the lens of the Big Five: Apple, Google, Meta, Microsoft, and Amazon.
The proximate cause for this reevaluation is the apparent five alarm fire that is happening at Meta: the company’s latest Llama 4 release was disappointing — and in at least one case, deceptive — pushing founder and CEO Mark Zuckerberg to go on a major spending spree for talent. — Read More
Microsoft and OpenAI at Odds Over Future of AGI: A Tech Titans’ Tug-of-War
Microsoft and OpenAI are embroiled in a heated debate over the future of Artificial General Intelligence (AGI). OpenAI’s CEO, Sam Altman, is optimistic about nearing AGI, while Microsoft’s Satya Nadella remains doubtful, suspecting potential manipulation. This disagreement could disrupt their exclusive partnership and reshape the AI landscape. With big stakes in AGI’s advent, the two companies are grappling over contracts, ownership, and tech access, while OpenAI eyes new alliances with rivals like Oracle and Google. — Read More
2025 State of Foundation Models
How far can reasoning models scale?
Reasoning models like OpenAI’s o3 are less than a year old, but they’ve already seen rapid improvements on capabilities, and OpenAI researchers are very optimistic that this progress will continue. But it’s not clear how much further the techniques used to train reasoning models can scale.
After looking into the question, I think there is room to scale reasoning training further, but it’s unlikely that OpenAI or other frontier AI developers can scale by many orders of magnitude.
If reasoning training continues to scale at 10× every few months, in line with the jump from o1 to o3, it will reach the frontier of total training compute before long, perhaps within a year. At that point, the scaling rate will slow and converge with the overall growth rate in training compute of ~4× per year. Progress in reasoning models may slow down after this point as well. — Read More
Leaking Secrets in the Age of AI
In a rush to adopt and experiment with AI, developers and other technology practitioners are willing to cut corners. This is evident from multiple recent security incidents, such as:
- Platform resource abuses (attackers hijack cloud infrastructure to power their own LLM applications)
- Vendors offering unsafe 3rd-party model execution (Probllama)
- Model escape vulnerabilities in hosting services (Replicate, HuggingFace and SAP-AI vulnerabilities)
Yet another side-effect of these hasty practices is the leakage of AI-related secrets in public code repositories. Secrets in public code repositories are nothing new. What’s surprising is the fact that after years of research, numerous security incidents, millions of dollars in bug bounty hunters’ pockets, and general awareness of the risk, it is still painfully easy to find valid secrets in public repositories. — Read More