China’s gigantic multi-modal AI is no one-trick pony

Sporting 1.75 trillion parameters, Wu Dao 2.0 is roughly ten times the size of Open AI’s GPT-3.

When Open AI’s GPT-3 model made its debut in May of 2020, its performance was widely considered to be the literal state of the art. Capable of generating text indiscernible from human-crafted prose, GPT-3 set a new standard in deep learning. But oh what a difference a year makes. Researchers from the Beijing Academy of Artificial Intelligence announced on Tuesday the release of their own generative deep learning model, Wu Dao, a mammoth AI seemingly capable of doing everything GPT-3 can do, and more.

First off, Wu Dao is flat out enormous. It’s been trained on 1.75 trillion parameters (essentially, the model’s self-selected coefficients) which is a full ten times larger than the 175 billion GPT-3 was trained on and 150 billion parameters larger than Google’s Switch TransformersRead More

#nlp, #china-ai, #multi-modal

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code. Read More

#transfer-learning

Learning Transferable Visual Models From Natural Language Supervision

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision.We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the ac-curacy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28million training examples it was trained on. Read More

#image-recognition, #nlp

Towards General Purpose Vision Systems

A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head for each new task or dataset. In this work, we propose a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text. The system supports a wide range of vision tasks such as classification, localization, question answering, captioning, and more. We evaluate the system’s ability to learn multiple skills simultaneously, to perform tasks with novel skill-concept combinations, and to learn new skills efficiently and without forgetting. Read More

#image-recognition