Multi-modal models that can process both text and images are a growing area of research in artificial intelligence. However, training these models presents a unique challenge: language models deal with discrete values (words and tokens), while image generation models must handle continuous pixel values.
Current multi-modal models use techniques that reduce the quality of representing data. In a new research paper, scientists from Meta and the University of Southern California introduce Transfusion, a novel technique that enables a single model to seamlessly handle both discrete and continuous modalities. — Read More