Midjourney, one of the most popular AI image generation startups, announced on Wednesday the launch of its much-anticipated AI video generation model, V1.
V1 is an image-to-video model, in which users can upload an image — or take an image generated by one of Midjourney’s other models — and V1 will produce a set of four five-second videos based on it. Much like Midjourney’s image models, V1 is only available through Discord, and it’s only available on the web at launch. — Read More
Tag Archives: Image Recognition
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world. — Read More
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows. — Read More
OpenAI rolls out image generation powered by GPT-4o to ChatGPT
OpenAI is integrating new image generation capabilities directly into ChatGPT — this feature is dubbed “Images in ChatGPT.” Users can now use GPT-4o to generate images within ChatGPT itself.
This initial release focuses solely on image creation and will be available across ChatGPT Plus, Pro, Team, and Free subscription tiers. The free tier’s usage limit is the same as DALL-E, spokesperson Taya Christianson told The Verge, but added that they “didn’t have a specific number to share” and ”these may change over time based on demand.“ Per the ChatGPT FAQ, free users were previously able to generate “three images per day with DALL·E 3.” As for the fate of DALL-E, Christianson said “fans” will “still have access via a custom GPT.” — Read More
Veo and Imagen 3: Announcing new video and image generation models on Vertex AI
Generative AI is leading to real business growth and transformation. Among enterprise companies with gen AI in production, 86% report an increase in revenue1, with an estimated 6% growth. That’s why Google is investing in its AI technology with new models like Veo, our most advanced video generation model, and Imagen 3, our highest quality image generation model. … Veo, now available on Vertex AI in private preview, empowers companies to effortlessly generate high-quality videos from simple text or image prompts, while Imagen 3 generates the most realistic and highest quality images from simple text prompts, surpassing previous versions of Imagen in detail, lighting, and artifact reduction. Imagen 3 will be available to all Vertex AI customers starting next week. — Read More
Pixtral 12B
We introduce Pixtral-12B, a 12–billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license. – Read More
Webpage: https://mistral.ai/news/pixtral-12b/
Inference code: https://github.com/mistralai/mistral-inference/
Evaluation code: https://github.com/mistralai/mistral-evals/
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Consistency models (CMs) are a powerful class of diffusion-based generative models optimized for fast sampling. Most existing CMs are trained using discretized timesteps, which introduce additional hyperparameters and are prone to discretization errors. While continuous-time formulations can mitigate these issues, their success has been limited by training instability. To address this, we propose a simplified theoretical framework that unifies previous parameterizations of diffusion models and CMs, identifying the root causes of instability. Based on this analysis, we introduce key improvements in diffusion process parameterization, network architecture, and training objectives. These changes enable us to train continuous-time CMs at an unprecedented scale, reaching 1.5B parameters on ImageNet 512×512. Our proposed training algorithm, using only two sampling steps, achieves FID scores of 2.06 on CIFAR-10, 1.48 on ImageNet 64×64, and 1.88 on ImageNet 512×512, narrowing the gap in FID scores with the best existing diffusion models to within 10%. — Read More
Meta announces Movie Gen, an AI-powered video generator
A new AI-powered video generator from Meta produces high-definition footage complete with sound, the company announced today. The announcement comes several months after competitor OpenAI unveiled Sora, its text-to-video model — though public access to Movie Gen isn’t happening yet.
Movie Gen uses text inputs to automatically generate new videos, as well as edit existing footage or still images. The New York Times reports that the audio added to videos is also AI-generated, matching the imagery with ambient noise, sound effects, and background music. The videos can be generated in different aspect ratios. — Read More
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
We present SF3D, a novel method for rapid and high-quality textured object mesh reconstruction from a single image in just 0.5 seconds. Unlike most existing approaches, SF3D is explicitly trained for mesh generation, incorporating a fast UV unwrapping technique that enables swift texture generation rather than relying on vertex colors. The method also learns to predict material parameters and normal maps to enhance the visual quality of the reconstructed 3D meshes. Furthermore, SF3D integrates a delighting step to effectively remove low-frequency illumination effects, ensuring that the reconstructed meshes can be easily used in novel illumination conditions. Experiments demonstrate the superior performance of SF3D over the existing techniques. Project page: this https URL. — Read More
Revisiting Feature Prediction for Learning Visual Representations from Video
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K. — Read More