LEGO:Language Enhanced Multi-modal Grounding Model

Multi-modal large language models have demonstrated impressive performance across various tasks in different modalities. However, existing multi-modal models primarily emphasize capturing global information within each modality while neglecting the importance of perceiving local information across modalities. Consequently, these models lack the ability to effectively understand the fine-grained details of input data, limiting their performance in tasks that require a more nuanced understanding. To address this limitation, there is a compelling need to develop models that enable fine-grained understanding across multiple modalities, thereby enhancing their applicability to a wide range of tasks. In this paper, we propose LEGO, a language enhanced multi-modal grounding model. Beyond capturing global information like other multi-modal models, our proposed model excels at tasks demanding a detailed understanding of local information within the input. It demonstrates precise identification and localization of specific regions in images or moments in videos. To achieve this objective, we design a diversified dataset construction pipeline, resulting in a multi-modal, multi-granularity dataset for model training. The code, dataset, and demo of our model can be found at https: //github.com/lzw-lzw/LEGO.  – Read More

#nlp, #multi-modal

OpenVoice: Versatile Instant Voice Cloning

We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language. OpenVoice is also computationally efficient, costing tens of times less than commercially available APIs that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. Prior to its public release, our internal version of OpenVoice was used tens of millions of times by users worldwide between May and October 2023, serving as the backend of MyShell.  – Read More

#nlp, #audio

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers’ computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.  – Read More

#nlp

You don’t need hosted LLMs, do you?

A comparison of self-hosted LLMs and OpenAI: cost, text generation quality, development speed, and privacy.

During the LLM hype, you can find a lot of articles like “Fine-tune your Private LLaMA/Falcon/Another Popular LLM”, “Train Your Own Private ChatGPT”, “How to Create a Local LLM” and others.

At the same time, only few people tell why you need it. I mean, are you really sure you need your own self-hosted LLM? Maybe the OpenAI API could be the best choice for you.  – Read More

#strategy, #nlp

Google launches Gemini, the AI model it hopes will take down GPT-4

Google has been an ‘AI-first company’ for nearly a decade. Now, a year into the AI era brought on by ChatGPT, it’s finally making a big move.

It’s the beginning of a new era of AI at Google, says CEO Sundar Pichai: the Gemini era. Gemini is Google’s latest large language model, which Pichai first teased at the I/O developer conference in June and is now launching to the public. To hear Pichai and Google DeepMind CEO Demis Hassabis describe it, it’s a huge leap forward in an AI model that will ultimately affect practically all of Google’s products. “One of the powerful things about this moment,” Pichai says, “is you can work on one underlying technology and make it better and it immediately flows across our products.”  — Read More

Introducing Gemini

#nlp, #big7

Decoding LLMs: Creating Transformer Encoders and Multi-Head Attention Layers in Python from Scratch

Today, Computational Natural Language Processing (NLP) is a rapidly evolving endeavour in which the power of computation meets linguistics. The linguistic side of it is mainly attributed to the theory of Distributive Semantics by John Rupert Firth. He once said the following:

“You shall know a word by the company it keeps”

So, the semantic representation of a word is determined by the context in which it is being used. It is precisely in attendance to this assumption that the paper “Attention is all you need” by Ashish Vaswani et. al. [1] assumes its groundbreaking relevance. It set the transformer architecture as the core of many of the rapidly growing tools like BERT, GPT4, Llama, etc.

In this article, we examine the key mathematical operations at the heart of the encoder segment in the transformer architecture. — Read More

#nlp, #devops

CogVLM: Visual Expert for Pretrained Language Models

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at this https URL. — Read More

#nlp

Large Language Models, ALBERT — A Lite BERT for Self-supervised Learning

In recent years, the evolution of large language models has skyrocketed. BERT became one of the most popular and efficient models allowing to solve a wide range of NLP tasks with high accuracy. After BERT, a set of other models appeared later on the scene demonstrating outstanding results as well.

The obvious trend that became easy to observe is the fact that with time large language models (LLMs) tend to become more complex by exponentially augmenting the number of parameters and data they are trained on. Research in deep learning showed that such techniques usually lead to better results. Unfortunately, the machine learning world has already dealt with several problems regarding LLMs, and scalability has become the main obstacle in effective training, storing and using them.

As a consequence, new LLMs have been recently developed to tackle scalability issues. In this article, we will discuss ALBERT which was invented in 2020 with an objective of significant reduction of BERT parameters. — Read More

#nlp

OpenAI turbocharges GPT-4 and makes it cheaper

OpenAI announced more improvements to its large language models, GPT-4 and GPT-3.5, including updated knowledge bases and a much longer context window. The company says it will also follow Google and Microsoft’s lead and begin protecting customers against copyright lawsuits.

GPT-4 Turbo, currently available via an API preview, has been trained with information dating to April 2023, the company announced Monday at its first-ever developer conference. The earlier version of GPT-4 released in March only learned from data dated up to September 2021. OpenAI plans to release a production-ready Turbo model in the next few weeks but did not give an exact date. — Read More

#chatbots, #nlp

“Math is hard” — if you are an LLM – and why that matters

Some Reply Guy on X assured me yesteday that “transformers can multiply”. Even pointed me to a paper, allegedly offering proof.

The paper turns out to be pretty great, doing exactly the right test, but it doesn’t prove what its title alleges. More like the opposite.

The paper alleges “GPT Can Solve Mathematical Problems Without a Calculator.” But it doesn’t really show that, except in the sense that I can shoot free throws in the NBA, Sure, I can toss the ball in the air, and sometimes I might even sink a shot, the more so with practice; but I am probably going to miss a lot, too. And 70% would be great for free throws; for multiplication it sucks. 47323 * 19223 = 909690029 and it shall always be; no partial credit for coming close. — Read More

#nlp