Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Read More
Paper
Tag Archives: Image Recognition
Enhancing Vision-language Understanding with Advanced Large Language Models
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision language models. We believe the primary reason for GPT-4’s advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. In our experiment, we found that only performing the pretraining on raw image-text pairs could produce unnatural language outputs that lack coherency including repetition and fragmented sentences. To address this problem, we curate a high-quality, well-aligned dataset in the second stage to finetune our model using a conversational template. This step proved crucial for augmenting the model’s generation reliability and overall usability. Notably, our model is highly computationally efficient, as we only train a projection layer utilizing approximately 5 million aligned image-text pairs. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/. Read More
Paper
demo links here: Link1Link2Link3Link4Link5Link6
Sony World Photography Award 2023: Winner refuses award after revealing AI creation
The winner of a major photography award has refused his prize after revealing his work was in fact an AI creation.
German artist Boris Eldagsen’s entry, entitled Pseudomnesia: The Electrician, won the creative open category at last week’s Sony World Photography Award.
He said he used the picture to test the competition and to create a discussion about the future of photography. Read More
Stability AI debuts next-gen photorealistic image generation model
Generative artificial intelligence company Stability AI Ltd. today released an updated version of its popular open-source photorealistic image generation model.
…The latest model is called Stable Diffusion XL, and it’s the latest addition to the Stable Diffusion suite. It’s being made available through an application programming interface and caters to enterprise developers. Using SDXL, developers will be able to create more detailed imagery. The company says it represents a key step forward in its image generation models. Read More
OpenAI looks beyond diffusion with ‘consistency’-based image generator
The field of image generation moves quickly. Though the diffusion models used by popular tools like Midjourney and Stable Diffusion may seem like the best we’ve got, the next thing is always coming — and OpenAI might have hit on it with “consistency models,” which can already do simple tasks an order of magnitude faster than the likes of DALL-E.
The paper was put online as a preprint last month, and was not accompanied by the understated fanfare OpenAI reserves for its major releases. That’s no surprise: This is definitely just a research paper, and it’s very technical. But the results of this early and experimental technique are interesting enough to note. Read More
AI image creator comes to Microsoft Bing
Microsoft’s Bing search engine and Edge browser are now equipped with an AI-powered image creator.
Why it matters: The tool uses OpenAI’s DALL-E to generate images from text prompts, and its rollout today reflects how quickly Microsoft has been building on its OpenAI partnership.
- Since expanding the relationship significantly two months ago, the tech giant has also launched a new AI-powered Bing and Edge browser and announced plans to bring AI into its Microsoft 365 tools (Word, Excel, etc.).
#big7, #image-recognition
Runway debuts AI model that can generate videos from text
Startup Runway AI Inc. today debuted Gen-2, an artificial intelligence model that can generate brief video clips based on text prompts.
… Gen-2, the startup’s new AI model for generating videos, is an improved version of an existing neural network called Gen-1 that debuted in February. …Runway’s original Gen-1 neural network takes an existing video as input along with a text prompt that describes what edits should be made. A user could, for example, supply Gen-1 with a video of a green car and a text prompt that reads “paint the car red”. The model will then automatically make the corresponding edits. Read More
Midjourney V5 is Out Now – Next Steps in Photorealistic Experience with AI Art
Arecent breakthrough in AI, you might have missed: the highly awaited Midjourney V5 is out now. The independent research lab has just released their latest version of the famous AI art generator. Some already call it “a world of photorealistic wonder” in terms of creating breathtaking images from text prompts. Wonder or not, the newly trained model promises significant improvements in language understanding, accuracy, and stylistic flexibility. Let’s try it out together and see what this update is capable of.
V5 is the second deep-learning model from Midjourney and has been in the works for the past five months. It claims to use completely different neural architecture and new aesthetic techniques compared to its predecessor. As developers put it: “You might hear it characterized as newly trained, bigger-brained, that it knows more, understands more, or listens better. All these things are true of V5.“ Of course, we had to try for ourselves. And lo-and-behold, this release does create wonders, even if it is still just an alpha test. Read More
A Face Recognition Site Crawled the Web for Dead People’s Photos
PimEyes appears to have scraped a major ancestry website for pics, without permission. Experts fear the images could be used to identify living relatives.
Finding out Taylor Swift was her 11th cousin twice-removed wasn’t even the most shocking discovery Cher Scarlett made while exploring her family history. “There’s a lot of stuff in my family that’s weird and strange that we wouldn’t know without Ancestry,” says Scarlett, a software engineer and writer based in Kirkland, Washington. “I didn’t even know who my mum’s paternal grandparents were.”
Ancestry.com isn’t the only site that Scarlett checks regularly. In February 2022, the facial recognition search engine PimEyes surfaced non-consensual explicit photos of her at age 19, reigniting decades-old trauma. She attempted to get the pictures removed from the platform, which uses images scraped from the internet to create biometric “faceprints” of individuals. Since then, she’s been monitoring the site to make sure the images don’t return.
In January, she noticed that PimEyes was returning pictures of children that looked like they came from Ancestry.com URLs. As an experiment, she searched for a grayscale version of one of her own baby photos. It came up with a picture of her own mother, as an infant, in the arms of her grandparents—taken, she thought, from an old family photo that her mother had posted on Ancestry. Searching deeper, Scarlett found other images of her relatives, also apparently sourced from the site. They included a black-and-white photo of her great-great-great-grandmother from the 1800s, and a picture of Scarlett’s own sister, who died at age 30 in 2018. The images seemed to come from her digital memorial, Ancestry, and Find a Grave, a cemetery directory owned by Ancestry.
PimEyes, Scarlett says, has scraped images of the dead to populate its database. By indexing their facial features, the site’s algorithms can help those images identify living people through their ancestral connections, raising privacy and data protection concerns, as well as ethical ones.
Online storm erupts over AI work in Dutch museum’s ‘Girl with a Pearl Earring’ display
Mauritshuis currently has 170 works on display as part of its “My Girl with a Pearl” initiative while Vermeer’s masterpiece is on loan
The Mauritshuis museum in The Hague, Netherlands, is facing criticism for showing an image made using artificial intelligence (AI) which is inspired by Vermeer’s famous Girl with a Pearl Earring.
The work by Berlin-based Julian van Dieken, who describes himself as a “digital creator”, is one of five images out of around 3,480 submitted for the My Girl with a Pearl initiative whereby devotees of the famous painting were invited to send their own versions of the famous girl image.
The winning entries are on show at the Mauritshuis while Vermeer’s 1665 original masterpiece is on loan to the Rijksmuseum in Amsterdam (until 4 June); 170 entries are shown on a loop in a digital frame. Read More