Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Read More

Paper

#image-recognition, #nvidia

Enhancing Vision-language Understanding with Advanced Large Language Models

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision language models. We believe the primary reason for GPT-4’s advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. In our experiment, we found that only performing the pretraining on raw image-text pairs could produce unnatural language outputs that lack coherency including repetition and fragmented sentences. To address this problem, we curate a high-quality, well-aligned dataset in the second stage to finetune our model using a conversational template. This step proved crucial for augmenting the model’s generation reliability and overall usability. Notably, our model is highly computationally efficient, as we only train a projection layer utilizing approximately 5 million aligned image-text pairs. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/. Read More

Paper

demo links here: Link1Link2Link3Link4Link5Link6

#chatbots, #image-recognition

Sony World Photography Award 2023: Winner refuses award after revealing AI creation

The winner of a major photography award has refused his prize after revealing his work was in fact an AI creation.

German artist Boris Eldagsen’s entry, entitled Pseudomnesia: The Electrician, won the creative open category at last week’s Sony World Photography Award.

He said he used the picture to test the competition and to create a discussion about the future of photography. Read More

#fake, #image-recognition

Stability AI debuts next-gen photorealistic image generation model

Generative artificial intelligence company Stability AI Ltd. today released an updated version of its popular open-source photorealistic image generation model.

…The latest model is called Stable Diffusion XL, and it’s the latest addition to the Stable Diffusion suite. It’s being made available through an application programming interface and caters to enterprise developers. Using SDXL, developers will be able to create more detailed imagery. The company says it represents a key step forward in its image generation models. Read More

#image-recognition, #vfx

OpenAI looks beyond diffusion with ‘consistency’-based image generator

The field of image generation moves quickly. Though the diffusion models used by popular tools like Midjourney and Stable Diffusion may seem like the best we’ve got, the next thing is always coming — and OpenAI might have hit on it with “consistency models,” which can already do simple tasks an order of magnitude faster than the likes of DALL-E.

The paper was put online as a preprint last month, and was not accompanied by the understated fanfare OpenAI reserves for its major releases. That’s no surprise: This is definitely just a research paper, and it’s very technical. But the results of this early and experimental technique are interesting enough to note. Read More

#image-recognition

AI image creator comes to Microsoft Bing

Microsoft’s Bing search engine and Edge browser are now equipped with an AI-powered image creator.

Why it matters: The tool uses OpenAI’s DALL-E to generate images from text prompts, and its rollout today reflects how quickly Microsoft has been building on its OpenAI partnership.

Read More

#big7, #image-recognition

Runway debuts AI model that can generate videos from text

Startup Runway AI Inc. today debuted Gen-2, an artificial intelligence model that can generate brief video clips based on text prompts.

… Gen-2, the startup’s new AI model for generating videos, is an improved version of an existing neural network called Gen-1 that debuted in February. …Runway’s original Gen-1 neural network takes an existing video as input along with a text prompt that describes what edits should be made. A user could, for example, supply Gen-1 with a video of a green car and a text prompt that reads “paint the car red”. The model will then automatically make the corresponding edits. Read More

#image-recognition

Midjourney V5 is Out Now – Next Steps in Photorealistic Experience with AI Art

Arecent breakthrough in AI, you might have missed: the highly awaited Midjourney V5 is out now. The independent research lab has just released their latest version of the famous AI art generator. Some already call it “a world of photorealistic wonder” in terms of creating breathtaking images from text prompts. Wonder or not, the newly trained model promises significant improvements in language understanding, accuracy, and stylistic flexibility. Let’s try it out together and see what this update is capable of.

V5 is the second deep-learning model from Midjourney and has been in the works for the past five months. It claims to use completely different neural architecture and new aesthetic techniques compared to its predecessor. As developers put it: “You might hear it characterized as newly trained, bigger-brained, that it knows more, understands more, or listens better. All these things are true of V5.“ Of course, we had to try for ourselves. And lo-and-behold, this release does create wonders, even if it is still just an alpha test. Read More

#image-recognition, #vfx

A Face Recognition Site Crawled the Web for Dead People’s Photos

PimEyes appears to have scraped a major ancestry website for pics, without permission. Experts fear the images could be used to identify living relatives.

Finding out Taylor Swift was her 11th cousin twice-removed wasn’t even the most shocking discovery Cher Scarlett made while exploring her family history. “There’s a lot of stuff in my family that’s weird and strange that we wouldn’t know without Ancestry,” says Scarlett, a software engineer and writer based in Kirkland, Washington. “I didn’t even know who my mum’s paternal grandparents were.”

Ancestry.com isn’t the only site that Scarlett checks regularly. In February 2022, the facial recognition search engine PimEyes surfaced non-consensual explicit photos of her at age 19, reigniting decades-old trauma. She attempted to get the pictures removed from the platform, which uses images scraped from the internet to create biometric “faceprints” of individuals. Since then, she’s been monitoring the site to make sure the images don’t return.

In January, she noticed that PimEyes was returning pictures of children that looked like they came from Ancestry.com URLs. As an experiment, she searched for a grayscale version of one of her own baby photos. It came up with a picture of her own mother, as an infant, in the arms of her grandparents—taken, she thought, from an old family photo that her mother had posted on Ancestry. Searching deeper, Scarlett found other images of her relatives, also apparently sourced from the site. They included a black-and-white photo of her great-great-great-grandmother from the 1800s, and a picture of Scarlett’s own sister, who died at age 30 in 2018. The images seemed to come from her digital memorial, Ancestry, and Find a Grave, a cemetery directory owned by Ancestry.

PimEyes, Scarlett says, has scraped images of the dead to populate its database. By indexing their facial features, the site’s algorithms can help those images identify living people through their ancestral connections, raising privacy and data protection concerns, as well as ethical ones.

Read More

#image-recognition, #ethics

Online storm erupts over AI work in Dutch museum’s ‘Girl with a Pearl Earring’ display

Mauritshuis currently has 170 works on display as part of its “My Girl with a Pearl” initiative while Vermeer’s masterpiece is on loan

The Mauritshuis museum in The Hague, Netherlands, is facing criticism for showing an image made using artificial intelligence (AI) which is inspired by Vermeer’s famous Girl with a Pearl Earring.

The work by Berlin-based Julian van Dieken, who describes himself as a “digital creator”, is one of five images out of around 3,480 submitted for the My Girl with a Pearl initiative whereby devotees of the famous painting were invited to send their own versions of the famous girl image.

The winning entries are on show at the Mauritshuis while Vermeer’s 1665 original masterpiece is on loan to the Rijksmuseum in Amsterdam (until 4 June); 170 entries are shown on a loop in a digital frame. Read More

#image-recognition