We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05x inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks — narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released. — Read More
Tag Archives: Multi-Modal
Gemini 2.0 Flash ushers in a new era of real-time multimodal AI
Google’s release of Gemini 2.0 Flash this week, offering users a way to interact live with video of their surroundings, has set the stage for what could be a pivotal shift in how enterprises and consumers engage with technology.
This release — alongside announcements from OpenAI, Microsoft, and others — is part of a transformative leap forward happening in the technology area called “multimodal AI.” The technology allows you to take video — or audio or images — that comes into your computer or phone, and ask questions about it.
It also signals an intensification of the competitive race among Google and its chief rivals — OpenAI and Microsoft — for dominance in AI capabilities. But more importantly, it feels like it is defining the next era of interactive, agentic computing. — Read More
Meta’s Transfusion model handles text and images in a single architecture
Multi-modal models that can process both text and images are a growing area of research in artificial intelligence. However, training these models presents a unique challenge: language models deal with discrete values (words and tokens), while image generation models must handle continuous pixel values.
Current multi-modal models use techniques that reduce the quality of representing data. In a new research paper, scientists from Meta and the University of Southern California introduce Transfusion, a novel technique that enables a single model to seamlessly handle both discrete and continuous modalities. — Read More
Agent Q: Groundbreaking Research in AI Web Agents
Reward-Free Curricula for Training Robust World Models
There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms several baselines, resulting in improved robustness, efficiency, and generalisation. —Read More
#multi-modal, #reinforcement-learningMM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting. — Read More
LEGO:Language Enhanced Multi-modal Grounding Model
Multi-modal large language models have demonstrated impressive performance across various tasks in different modalities. However, existing multi-modal models primarily emphasize capturing global information within each modality while neglecting the importance of perceiving local information across modalities. Consequently, these models lack the ability to effectively understand the fine-grained details of input data, limiting their performance in tasks that require a more nuanced understanding. To address this limitation, there is a compelling need to develop models that enable fine-grained understanding across multiple modalities, thereby enhancing their applicability to a wide range of tasks. In this paper, we propose LEGO, a language enhanced multi-modal grounding model. Beyond capturing global information like other multi-modal models, our proposed model excels at tasks demanding a detailed understanding of local information within the input. It demonstrates precise identification and localization of specific regions in images or moments in videos. To achieve this objective, we design a diversified dataset construction pipeline, resulting in a multi-modal, multi-granularity dataset for model training. The code, dataset, and demo of our model can be found at https: //github.com/lzw-lzw/LEGO. – Read More
#nlp, #multi-modalMulti-Modal AI is a UX Problem
Transformers and other AI breakthroughs have shown state-of-the-art performance across different modalities.
The next frontier in AI is combining these modalities in interesting ways. Explain what’s happening in a photo. Debug a program with your voice. Generate music from an image. There’s still technical work to be done with combining these modalities, but the greatest challenge is not a technical one but a user experience one.
What is the right UX for these use cases? — Read More
GPT-4V(ision) System Card — Safety Properties of GPT-4V
GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development [1, 2, 3]. Multimodal LLMs offer the possibility of expanding the impact of language-only systems with novel interfaces and capabilities, enabling them to solve new tasks and provide novel experiences for their users.
In this system card, [4, 5]1 we analyze the safety properties of GPT-4V. Our work on safety for GPT-4V builds on the work done for GPT-4 [7] and here we dive deeper into the evaluations, preparation, and mitigation work done specifically for image inputs. — Read More
ChatGPT can now see, hear, and speak
We are beginning to roll out new voice and image capabilities in ChatGPT. They offer a new, more intuitive type of interface by allowing you to have a voice conversation or show ChatGPT what you’re talking about.
Voice and image give you more ways to use ChatGPT in your life. Snap a picture of a landmark while traveling and have a live conversation about what’s interesting about it. When you’re home, snap pictures of your fridge and pantry to figure out what’s for dinner (and ask follow up questions for a step by step recipe). After dinner, help your child with a math problem by taking a photo, circling the problem set, and having it share hints with both of you.
We’re rolling out voice and images in ChatGPT to Plus and Enterprise users over the next two weeks. Voice is coming on iOS and Android (opt-in in your settings) and images will be available on all platforms. — Read More