Unlike chatbots or image generators, robots must operate in real time. While a robot is “thinking”, the world around it evolves according to physical laws, so delays between inputs and outputs have a tangible impact on performance. For a language model, the difference between fast and slow generation is a satisfied or annoyed user; for a vision-language-action model (VLA), it could be the difference between a robot handing you a hot coffee or spilling it in your lap. While VLAs have achieved promising results in open-world generalization, they can be slow to run. Like their cousins in language and vision, these models have billions of parameters and require heavy-duty GPUs. On edge devices like mobile robots, that adds even more latency for network communication between a centralized inference server and the robot. — Read More