OpenAI’s gpt-image generation models are designed for production-quality visuals and highly controllable creative workflows. They are well-suited for both professional design tasks and iterative content creation, and support both high-quality rendering and lower-latency use cases depending on the workflow.
… This guide highlights prompting patterns, best practices, and example prompts drawn from real production use cases for gpt-image-2. It is our most capable image model, with stronger image quality, improved editing performance, and broader support for production workflows. The low quality setting is especially strong for latency-sensitive use cases, while medium and high remain good fits when maximum fidelity matters. — Read More
Tag Archives: Image Recognition
ELT: Elastic Looped Transformers for Visual Generation
We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model’s depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of on class-conditional ImageNet and FVD of on class-conditional UCF-101. — Read More
VOID: Video Object and Interaction Deletion
Existing video object removal methods excel at inpainting content “behind” the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions — such as collisions with other objects — current models fail to correct them and produce implausible results.
We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. — Read More
How does AI understand my visual searches?
We’ve all been there: You see a photo of a perfectly styled living room or a well-curated street-style outfit, and you want to know where everything came from. Until recently, visual search was a one-item-at-a-time process. But a major update to Circle to Search and Lens now allows Google to break down and search for multiple objects within a single image simultaneously. This means if you use Circle to Search on Android to search for an entire outfit, you’ll see results for every component of a look, not just one piece at a time. In recent months, we’ve also launched several updates that enhance both visual search and image results in AI Mode, so you can better find inspiration as you search. — Read More
The First Fully General Computer Action Model
We trained a model on our 11-million-hour video dataset. Our model can explore complex websites, complete multi-action CAD modeling sequences, and drive a car in the real world, all at 30 FPS.
We designed FDM-1, a foundation model for computer use. FDM-1 is trained on videos from a portion of our 11-million-hour screen recording dataset, which we labeled using an inverse dynamics model that we trained. Our video encoder can compress almost 2 hours of 30 FPS video in only 1M tokens. FDM-1 is the first model with the long-context training needed to become a coworker for CAD, finance, engineering, and eventually ML research, and it consistently improves with scale. It trains and infers directly on video instead of screenshots and can learn unsupervised from the entirety of the internet. — Read More
Why aren’t video codec intrinsics used to train generative AI?
Every video we feed into a model carries a hidden companion that seems to be largely ignored. Alongside the frames, the encoder leaves behind a rich trail of signals — motion vectors, block partitions, quantisation/rate-distortion decisions and residual energy. Call them “codec intrinsics”, or simply “codec signals.” They aren’t pixels, but they are shaped by decades of engineering about what people actually see, where detail matters and how motion really flows. If our generators learn from images and videos, why not let them learn from this perceptual map as well? It’s the difference between teaching an AI to paint by only showing it finished masterpieces versus letting it study the painter’s original sketches, compositional notes, and brush-stroke tests. — Read More
Qwen Image Model — New Open Source Leader?
There has been some excitement over the last week or two around the new model in the Qwen series by Alibaba. Qwen Image is a 20B parameter — that’s 3 billion more than HiDream — MMDiT (Multimodal Diffusion Transformer) model, open-sourced under the Apache 2.0 license.
As well as the features of the core model it also uses the Qwen2.5-VL LLM for text encoding and has a specialised VAE (Variational Autoencoder). It supposedly can render readable, multilingual text in much longer forms than previous models and the VAE is trained to preserve small fonts, text edges and layout. Using Qwen2.5-VL as the text encoder should mean better language, vision and context understanding.
… These improvements come at a cost: size. The full BF16 model is 40GB in size, with the FP16 version of the text encoder coming in at an additional 16GB. FP8 versions are more reasonable at 20GB for the model and 9GB for the text encoder. If those sizes are still too large for your set up, then there are distilled versions available from links on the ComfyUI guide. City96 has also created various GGUF versions available for download from Hugging Face. — Read More
Why do LLMs freak out over the seahorse emoji?
Is there a seahorse emoji?
… [P]opular language models are very confident that there’s a seahorse emoji. And they’re not alone in that confidence.
… Maybe LLMs believe a seahorse emoji exists because so many humans in the training data do. Or maybe it’s a convergent belief – given how many other aquatic animals are in Unicode, it’s reasonable for both humans and LLMs to assume (generalize, even) that such a delightful animal is as well. A seahorse emoji was even formally proposed at one point, but was rejected in 2018.
Regardless of the root cause, many LLMs begin each new context window fresh with the mistaken latent belief that the seahorse emoji exists. But why does that produce such strange behavior? I mean, I used to believe a seahorse emoji existed myself, but if I had tried to send it to a friend, I would’ve simply looked for it on my keyboard and realized it wasn’t there, not sent the wrong emoji and then gone into an emoji spam doomloop. So what’s happening inside the LLM that causes it to act like this? — Read More
DINOv3: Self-supervised learning for vision at unprecedented scale
Self-supervised learning (SSL) —the concept that AI models can learn independently without human supervision—has emerged as the dominant paradigm in modern machine learning. It has driven the rise of large language models that acquire universal representations by pre-training on massive text corpora. However, progress in computer vision has lagged behind, as the most powerful image encoding models still rely heavily on human-generated metadata, such as web captions, for training.
Today, we’re releasing DINOv3, a generalist, state-of-the-art computer vision model trained with SSL that produces superior high-resolution visual features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks including object detection and semantic segmentation. — Read More
Midjourney launches its first AI video generation model, V1
Midjourney, one of the most popular AI image generation startups, announced on Wednesday the launch of its much-anticipated AI video generation model, V1.
V1 is an image-to-video model, in which users can upload an image — or take an image generated by one of Midjourney’s other models — and V1 will produce a set of four five-second videos based on it. Much like Midjourney’s image models, V1 is only available through Discord, and it’s only available on the web at launch. — Read More