Rick's Cafe AI 12:29 pm on November 5, 2025
Tags: Image Recognition

Why aren’t video codec intrinsics used to train generative AI?

Every video we feed into a model carries a hidden companion that seems to be largely ignored. Alongside the frames, the encoder leaves behind a rich trail of signals — motion vectors, block partitions, quantisation/rate-distortion decisions and residual energy. Call them “codec intrinsics”, or simply “codec signals.” They aren’t pixels, but they are shaped by decades of engineering about what people actually see, where detail matters and how motion really flows. If our generators learn from images and videos, why not let them learn from this perceptual map as well? It’s the difference between teaching an AI to paint by only showing it finished masterpieces versus letting it study the painter’s original sketches, compositional notes, and brush-stroke tests. — Read More

#image-recognition

Rick's Cafe AI 11:44 am on October 28, 2025
Tags: Image Recognition

Qwen Image Model — New Open Source Leader?

There has been some excitement over the last week or two around the new model in the Qwen series by Alibaba. Qwen Image is a 20B parameter — that’s 3 billion more than HiDream — MMDiT (Multimodal Diffusion Transformer) model, open-sourced under the Apache 2.0 license.

As well as the features of the core model it also uses the Qwen2.5-VL LLM for text encoding and has a specialised VAE (Variational Autoencoder). It supposedly can render readable, multilingual text in much longer forms than previous models and the VAE is trained to preserve small fonts, text edges and layout. Using Qwen2.5-VL as the text encoder should mean better language, vision and context understanding.

… These improvements come at a cost: size. The full BF16 model is 40GB in size, with the FP16 version of the text encoder coming in at an additional 16GB. FP8 versions are more reasonable at 20GB for the model and 9GB for the text encoder. If those sizes are still too large for your set up, then there are distilled versions available from links on the ComfyUI guide. City96 has also created various GGUF versions available for download from Hugging Face. — Read More

#image-recognition

Rick's Cafe AI 11:06 am on October 10, 2025
Tags: Image Recognition

Why do LLMs freak out over the seahorse emoji?

Is there a seahorse emoji?

… [P]opular language models are very confident that there’s a seahorse emoji. And they’re not alone in that confidence.

… Maybe LLMs believe a seahorse emoji exists because so many humans in the training data do. Or maybe it’s a convergent belief – given how many other aquatic animals are in Unicode, it’s reasonable for both humans and LLMs to assume (generalize, even) that such a delightful animal is as well. A seahorse emoji was even formally proposed at one point, but was rejected in 2018.

Regardless of the root cause, many LLMs begin each new context window fresh with the mistaken latent belief that the seahorse emoji exists. But why does that produce such strange behavior? I mean, I used to believe a seahorse emoji existed myself, but if I had tried to send it to a friend, I would’ve simply looked for it on my keyboard and realized it wasn’t there, not sent the wrong emoji and then gone into an emoji spam doomloop. So what’s happening inside the LLM that causes it to act like this? — Read More

#image-recognition

Rick's Cafe AI 7:40 am on August 28, 2025
Tags: Image Recognition

DINOv3: Self-supervised learning for vision at unprecedented scale

Self-supervised learning (SSL) —the concept that AI models can learn independently without human supervision—has emerged as the dominant paradigm in modern machine learning. It has driven the rise of large language models that acquire universal representations by pre-training on massive text corpora. However, progress in computer vision has lagged behind, as the most powerful image encoding models still rely heavily on human-generated metadata, such as web captions, for training.

Today, we’re releasing DINOv3, a generalist, state-of-the-art computer vision model trained with SSL that produces superior high-resolution visual features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks including object detection and semantic segmentation. — Read More

#image-recognition

Rick's Cafe AI 10:52 am on June 20, 2025
Tags: Image Recognition

Midjourney launches its first AI video generation model, V1

Midjourney, one of the most popular AI image generation startups, announced on Wednesday the launch of its much-anticipated AI video generation model, V1.

V1 is an image-to-video model, in which users can upload an image — or take an image generated by one of Midjourney’s other models — and V1 will produce a set of four five-second videos based on it. Much like Midjourney’s image models, V1 is only available through Discord, and it’s only available on the web at launch. — Read More

#image-recognition

Rick's Cafe AI 8:14 am on June 16, 2025
Tags: Image Recognition

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world. — Read More

#image-recognition

Rick's Cafe AI 1:35 pm on April 10, 2025
Tags: Image Recognition

OmniSVG: A Unified Scalable Vector Graphics Generation Model

Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows. — Read More

#image-recognition

Rick's Cafe AI 9:44 am on April 1, 2025
Tags: Image Recognition

OpenAI rolls out image generation powered by GPT-4o to ChatGPT

OpenAI is integrating new image generation capabilities directly into ChatGPT — this feature is dubbed “Images in ChatGPT.” Users can now use GPT-4o to generate images within ChatGPT itself.

This initial release focuses solely on image creation and will be available across ChatGPT Plus, Pro, Team, and Free subscription tiers. The free tier’s usage limit is the same as DALL-E, spokesperson Taya Christianson told The Verge, but added that they “didn’t have a specific number to share” and ”these may change over time based on demand.“ Per the ChatGPT FAQ, free users were previously able to generate “three images per day with DALL·E 3.” As for the fate of DALL-E, Christianson said “fans” will “still have access via a custom GPT.” — Read More

#image-recognition

Rick's Cafe AI 1:40 pm on December 6, 2024
Tags: Image Recognition

Veo and Imagen 3: Announcing new video and image generation models on Vertex AI

Generative AI is leading to real business growth and transformation. Among enterprise companies with gen AI in production, 86% report an increase in revenue¹, with an estimated 6% growth. That’s why Google is investing in its AI technology with new models like Veo, our most advanced video generation model, and Imagen 3, our highest quality image generation model. … Veo, now available on Vertex AI in private preview, empowers companies to effortlessly generate high-quality videos from simple text or image prompts, while Imagen 3 generates the most realistic and highest quality images from simple text prompts, surpassing previous versions of Imagen in detail, lighting, and artifact reduction. Imagen 3 will be available to all Vertex AI customers starting next week. — Read More

#image-recognition

Rick's Cafe AI 10:22 am on December 5, 2024
Tags: Image Recognition

Pixtral 12B

We introduce Pixtral-12B, a 12–billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license. – Read More

Webpage: https://mistral.ai/news/pixtral-12b/
Inference code: https://github.com/mistralai/mistral-inference/
Evaluation code: https://github.com/mistralai/mistral-evals/

#image-recognition

Recent Activity

Rick's Cafe AI

The latest in Artificial Intelligence carefully curated into its own special blend

Tag Archives: Image Recognition

Why aren’t video codec intrinsics used to train generative AI?

Qwen Image Model — New Open Source Leader?

Why do LLMs freak out over the seahorse emoji?

DINOv3: Self-supervised learning for vision at unprecedented scale

Midjourney launches its first AI video generation model, V1

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

OmniSVG: A Unified Scalable Vector Graphics Generation Model

OpenAI rolls out image generation powered by GPT-4o to ChatGPT

Veo and Imagen 3: Announcing new video and image generation models on Vertex AI

Pixtral 12B