Veo and Imagen 3: Announcing new video and image generation models on Vertex AI

Generative AI is leading to real business growth and transformation. Among enterprise companies with gen AI in production, 86% report an increase in revenue1, with an estimated 6% growth. That’s why Google is investing in its AI technology with new models like Veo, our most advanced video generation model, and Imagen 3, our highest quality image generation model. … Veo, now available on Vertex AI in private preview, empowers companies to effortlessly generate high-quality videos from simple text or image prompts, while Imagen 3 generates the most realistic and highest quality images from simple text prompts, surpassing previous versions of Imagen in detail, lighting, and artifact reduction. Imagen 3 will be available to all Vertex AI customers starting next week. — Read More

#image-recognition

Pixtral 12B

We introduce Pixtral-12B, a 12–billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license. – Read More

Webpage: https://mistral.ai/news/pixtral-12b/
Inference code: https://github.com/mistralai/mistral-inference/
Evaluation code: https://github.com/mistralai/mistral-evals/

#image-recognition

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Consistency models (CMs) are a powerful class of diffusion-based generative models optimized for fast sampling. Most existing CMs are trained using discretized timesteps, which introduce additional hyperparameters and are prone to discretization errors. While continuous-time formulations can mitigate these issues, their success has been limited by training instability. To address this, we propose a simplified theoretical framework that unifies previous parameterizations of diffusion models and CMs, identifying the root causes of instability. Based on this analysis, we introduce key improvements in diffusion process parameterization, network architecture, and training objectives. These changes enable us to train continuous-time CMs at an unprecedented scale, reaching 1.5B parameters on ImageNet 512×512. Our proposed training algorithm, using only two sampling steps, achieves FID scores of 2.06 on CIFAR-10, 1.48 on ImageNet 64×64, and 1.88 on ImageNet 512×512, narrowing the gap in FID scores with the best existing diffusion models to within 10%. — Read More

#image-recognition

Meta announces Movie Gen, an AI-powered video generator

A new AI-powered video generator from Meta produces high-definition footage complete with sound, the company announced today. The announcement comes several months after competitor OpenAI unveiled Sora, its text-to-video model — though public access to Movie Gen isn’t happening yet.

Movie Gen uses text inputs to automatically generate new videos, as well as edit existing footage or still images. The New York Times reports that the audio added to videos is also AI-generated, matching the imagery with ambient noise, sound effects, and background music. The videos can be generated in different aspect ratios. — Read More

#image-recognition, #vfx

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

We present SF3D, a novel method for rapid and high-quality textured object mesh reconstruction from a single image in just 0.5 seconds. Unlike most existing approaches, SF3D is explicitly trained for mesh generation, incorporating a fast UV unwrapping technique that enables swift texture generation rather than relying on vertex colors. The method also learns to predict material parameters and normal maps to enhance the visual quality of the reconstructed 3D meshes. Furthermore, SF3D integrates a delighting step to effectively remove low-frequency illumination effects, ensuring that the reconstructed meshes can be easily used in novel illumination conditions. Experiments demonstrate the superior performance of SF3D over the existing techniques. Project page: this https URL. — Read More

#image-recognition

Revisiting Feature Prediction for Learning Visual Representations from Video

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K. — Read More

#image-recognition

Kling, the AI video generator rival to Sora that’s wowing creators

If you follow any AI influencers or creators on social media, there’s a good chance you may have seen them more excited than usual lately about a new AI video generation model called “Kling.”

The videos it generates from pure text prompts and some configurable, in-app buttons and settings, look incredibly realistic, on par with OpenAI’s still non-public, invitation only, closed beta AI model Sora, which it has shared with a small group of artists and filmmakers as it tests it and its adversarial (read: risky, objectionable) uses.

[W]here did Kling come from? What does it offer? And how can you get your hands on it? Read on to find out. — Read More

#china-ai, #image-recognition

UKRAINE IS RIDDLED WITH LAND MINES. DRONES AND AI CAN HELP

EARLY ON A JUNE morning in 2023, my colleagues and I drove down a bumpy dirt road north of Kyiv in Ukraine. The Ukrainian Armed Forces were conducting training exercises nearby, and mortar shells arced through the sky. We arrived at a vast field for a technology demonstration set up by the United Nations. Across the 25-hectare field—that’s about the size of 62 American football fields—the U.N. workers had scattered 50 to 100 inert mines and other ordnance. Our task was to fly our drone over the area and use our machine learning software to detect as many as possible. And we had to turn in our results within 72 hours.

The scale was daunting: The area was 10 times as large as anything we’d attempted before with our drone demining startup, Safe Pro AI. My cofounder Gabriel Steinberg and I used flight-planning software to program a drone to cover the whole area with some overlap, taking photographs the whole time. It ended up taking the drone 5 hours to complete its task, and it came away with more than 15,000 images. Then we raced back to the hotel with the data it had collected and began an all-night coding session.

We were happy to see that our custom machine learning model took only about 2 hours to crunch through all the visual data and identify potential mines and ordnance. But constructing a map for the full area that included the specific coordinates of all the detected mines in under 72 hours was simply not possible with any reasonable computational resources. The following day (which happened to coincide with the short-lived Wagner Group rebellion), we rewrote our algorithms so that our system mapped only the locations where suspected land mines were identified—a more scalable solution for our future work. — Read More

#dod, #image-recognition

Microsoft AI creates scary real talkie videos from a single photo

Microsoft Research Asia has revealed an AI model that can generate frighteningly realistic deepfake videos from a single still image and an audio track. How will we be able to trust what we see and hear online from here on in?

… After training the [VASA-1] model on footage of around 6,000 real-life talking faces from the VoxCeleb2 dataset, the technology is able to generate scary real video where the newly animated subject is not only able to accurately lip-sync to a supplied voice audio track, but also sports varied facial expressions and natural head movements – all from a single static headshot photo. — Read More

#image-recognition

Stability AI Announces Stable Diffusion 3: All We Know So Far

Stability AI announced an early preview of Stable Diffusion 3, their text-to-image generative AI model. Unlike last week’s Sora text-to-video announcement from OpenAI, there were limited demonstrations of the model’s new capabilities, but some details were provided. Here, we explore what the announcement means, how the new model works, and some implications for the advancement of image generation. — Read More

#image-recognition