DeepNet: Scaling Transformers to 1,000 Layers

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DEEPNORM) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DEEPNORM a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction. Read More

#training

Tackling multiple tasks with a single visual language model

One key aspect of intelligence is the ability to quickly learn how to perform a new task when given a brief instruction. For instance, a child may recognise real animals at the zoo after seeing a few pictures of the animals in a book, despite differences between the two. But for a typical visual model to learn a new task, it must be trained on tens of thousands of examples specifically labelled for that task. If the goal is to count and identify animals in an image, as in “three zebras”, one would have to collect thousands of images and annotate each image with their quantity and species. This process is inefficient, expensive, and resource-intensive, requiring large amounts of annotated data and the need to train a new model each time it’s confronted with a new task. As part of DeepMind’s mission to solve intelligence, we’ve explored whether an alternative model could make this process easier and more efficient, given only limited task-specific information.

Today, in the preprint of our paper, we introduce Flamingo, a single visual language model (VLM) that sets a new state of the art in few-shot learning on a wide range of open-ended multimodal tasks. This means Flamingo can tackle a number of difficult problems with just a handful of task-specific examples (in a “few shots”), without any additional training required. Flamingo’s simple interface makes this possible, taking as input a prompt consisting of interleaved images, videos, and text and then output associated language.  Read More

#image-recognition, #nlp