Latte: Latent Diffusion Transformer for Video Generation

We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation. — Read More

#nlp, #image-recognition

Retrieval-Augmented Generation for Large Language Models: A Survey

Large Language Models (LLMs) demonstrate significant capabilities but face challenges such as hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This enhances the accuracy and credibility of the models, particularly for knowledge-intensive tasks, and allows for continuous knowledge updates and integration of domain-specific information. RAG synergistically merges LLMs’ intrinsic knowledge with the vast, dynamic repositories of external databases. This comprehensive review paper offers a detailed examination of the progression of RAG paradigms, encompassing the Naive RAG, the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the tripartite foundation of RAG frameworks, which includes the retrieval , the generation and the augmentation techniques. The paper highlights the state-of-the-art technologies embedded in each of these critical components, providing a profound understanding of the advancements in RAG systems. Furthermore, this paper introduces the metrics and benchmarks for assessing RAG models, along with the most up-to-date evaluation framework. In conclusion, the paper delineates prospective avenues for research, including the identification of challenges, the expansion of multi-modalities, and the progression of the RAG infrastructure and its ecosystem. — Read More

Original Paper

#nlp, #performance

Sora, Groq, and Virtual Reality

Matthew Ball wrote a fun essay earlier this month entitled On Spatial Computing, Metaverse, the Terms Left Behind and Ideas Renewed, tracing the various terms that have been used to describe, well, that’s what the essay is about: virtual realityaugmented realitymixed realityMetaverse, are words that have been floating around for decades now, both in science fiction and in products, to describe what Apple is calling spatial computing.

Personally, I agree with Ball that “Metaverse” is the best of the lot, particularly given Ball’s succinct description of the concept in his conclusion:

I liked the term Metaverse because it worked like the Internet, but for 3D. It wasn’t about a device or even computing at large, just as the Internet was not about PC nor the client-server model. The Metaverse is a vast and interconnected network of real-time 3D experiences. For passthrough or optical MR to scale, a “3D Internet” is required – which means overhauls to networking infrastructure and protocols, advances in computing infrastructure, and more. This is, perhaps the one final challenge with the term – it describes more of an end state than a transition. — Read More

#metaverse, #vfx

Microsoft, OpenAI say U.S. rivals use artificial intelligence in hacking

Russia, China and other U.S. adversaries are using the newest wave of artificial intelligence tools to improve their hacking abilities and find new targets for online espionage, according to a report Wednesday from Microsoft and its close business partner OpenAI. — Read More

#cyber, #russia, #china

If you thought Sora was impressive now watch it with AI generated sound from ElevenLabs

Artificial intelligence speech startup ElevenLabs offered an insight into what its planning to release in the future, adding sound effects to AI generated video for the first time.

Best known for its near human-like text-to-speech and synthetic voice services, ElevenLabs added artificially generated sound effects to videos produced using OpenAI’s Sora.

OpenAI unveiled its impressive Sora text-to-video artificial intelligence model last week, showcasing some of the most realistic, consistent and longest AI generated video to date. — Read More

#audio, #vfx

Sora, and the Future of VFX Compositing

… The Future (You will experience this moment soon)

There’s a moment that stays with you—the first time you witness your thoughts materialize into visual marvels on the screen. It’s akin to the first successful alchemists turning lead into gold, except our lead is the raw, unshaped ideas, and our gold, the breathtaking visuals rendered from the ether of our imagination. The advent of AI-driven tools like OpenAI’s Sora has been nothing short of a revelation, a glimpse into a future where creating temporally consistent video content is as effortless as describing a sunrise to a friend. — Read More

#vfx

Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings

We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in chess and other competitive games. We invite the entire community to join this effort by contributing new models and evaluating them by asking questions and voting for your favorite answer. — Read More

You can compare models’ relative performance for yourself, or add new models, here.

#performance

AI Sound Effects

We were blown away by the Sora announcement but felt it needed something… What if you could describe a sound and generate it with AI? — Read More

#audio

Groq

Groq is on a mission to set the standard for GenAI inference speed, helping real-time AI applications come to life today. Using a new type of end-to-end processing unit system, called a LPU Inference Engine, with LPU standing for Language Processing Unit™, Groq provides the fastest inference for computationally intensive applications with a sequential component to them, such as AI language applications (LLMs). Groq supports standard machine learning (ML) frameworks such as PyTorch, TensorFlow, and ONNX for inference. Groq does not currently support ML training with the LPU Inference Engine. — Read More

#nlp, #nvidia

Billy Joel With AI ‘Turn the Lights Back On’

Read More

#videos