LongNet: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between tokens; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence. — Read More

#performance

An Elo Style Leaderboard for Language Models

We use the Elo rating system to calculate the relative performance of the models. Elo  is a method for calculating the relative skill levels of players in zero-sum games, which was invented as an improved chess-rating system. The difference in the ratings between two models serves as a predictor of the model’s relative performance.You can view the voting data, basic analyses, and calculation procedure in this notebook. We will periodically release new leaderboards. — Read More

You can compare models’ relative performance for yourself, or add new models, here.

#chatbots, #performance

Navigating the High Cost of AI Compute

The generative AI boom is compute-bound. It has the unique property that adding more compute directly results in a better product. Usually, R&D investment is more directly tied to how valuable a product was, and that relationship is markedly sublinear. But this is not currently so with artificial intelligence and, as a result, a predominant factor driving the industry today is simply the cost of training and inference. 

While we don’t know the true numbers, we’ve heard from reputable sources that the supply of compute is so constrained, demand outstrips it by a factor of 10(!) So we think it’s fair to say that, right now, access to compute resources — at the lowest total cost — has become a determining factor for the success of AI companies.

In fact, we’ve seen many companies spend more than 80% of their total capital raised on compute resources!

In this post, we try to break down the cost factors for an AI company. The absolute numbers will of course change over time, but we don’t see immediate relief from AI companies being bound by their access to compute resources. So, hopefully, this is a helpful framework for thinking through the landscape. Read More

#performance

Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher. Read More

#performance

Train 18-billion-parameter GPT models with a single GPU on your personal computer! Open source project Colossal-AI has added new features!

When it comes to training large AI models, people will think about using thousands of GPUs, expensive training costs, and only a few tech giants can afford them. While AI users, like researchers from startups or universities, could do nothing but get overwhelmed by news about large models~

Now, a PC with only one GPU can train GPT with up to 18 billion parameters, and a laptop can also train a model with more than one billion parameters. Compared with the existing mainstream solutions, the parameter capacity can be increased by more than ten times!

Such a significant improvement comes from Colossal-AI, which is an efficient training system for general large AI models. Best of all, it’s completely open-sourced and requires only minimal modifications to allow existing deep learning projects to be trained with much larger models on a single consumer-grade graphics card, allowing everyone to train large AI models at home! In particular, it makes downstream tasks and application deployments such as large AI model fine-tuning and inference much easier! Read More

#performance

Good News About the Carbon Footprint of Machine Learning Training

Machine learning (ML) has become prominent in information technology, which has led some to raise concerns about the associated rise in the costs of computation, primarily the carbon footprint, i.e., total greenhouse gas emissions. While these assertions rightfully elevated the discussion around carbon emissions in ML, they also highlight the need for accurate data to assess true carbon footprint, which can help identify strategies to mitigate carbon emission in ML.

In “The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink”, accepted for publication in IEEE Computer, we focus on operational carbon emissions — i.e., the energy cost of operating ML hardware, including data center overheads — from training of natural language processing (NLP) models and investigate best practices that could reduce the carbon footprint. We demonstrate four key practices that reduce the carbon (and energy) footprint of ML workloads by large margins, which we have employed to help keep ML under 15% of Google’s total energy use. Read More

#big7, #performance

FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding

The few-shot natural language understanding (NLU) task has attracted much recent attention. However, prior methods have been evaluated under a disparate set of protocols, which hinders fair comparison and measuring progress of the field. To address this issue, we introduce an evaluation framework that improves previous evaluation procedures in three key aspects, i.e., test performance, dev test correlation, and stability. Under this new evaluation framework, we re-evaluate several state-of-the-art few-shot methods for NLU tasks. Our framework reveals new insights: (1) both the absolute performance and relative gap of the methods were not accurately estimated in prior literature; (2) no single method dominates most tasks with consistent performance; (3) improvements of some methods diminish with a larger pretrained model; and (4) gains from different methods are often complementary and the best combined model performs close to a strong fully-supervised baseline. We open-source our toolkit, FewNLU, that implements our evaluation framework along with a number of state-of-the-art methods. Read More

#nlp, #performance

Exploring the Limits of Large Scale Pre-training

Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks. Delving deeper to understand the reasons that give rise to these phenomena, we show that the saturation behavior we observe is closely related to the way that representations evolve through the layers of the models. We showcase an even more extreme scenario where performance on upstream and downstream are at odds with each other. That is, to have a better downstream performance, we need to hurt upstream accuracy. Read More

#performance

Google AI and Princeton discover this about Deep Learning

Much of an ML model’s learning results depend on the model’s learning rate. The learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences the extent to which newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model “learns”.

The importance of Learning Rate can’t be underestimated. That is why there is a lot of research towards both discovering new learning rate schedules (how LR should change over time) and comparing existing ones. Researchers at Google AI, Tel Aviv University, and Princeton collaborated together to write Disentangling Adaptive Gradient Methods from Learning Rates. The paper looks at “how adaptive gradient methods interact with the learning rate schedule.” In this article, I will share some interesting takeaways from the paper that might help you in your ML journeys.  Read More

#performance

8-bit Optimizers via Block-Wise Quantization

Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent, but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable imbedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT’14 machine translation, MoCo v2 contrastive ImageNet retraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-source1 our 8-bit optimizers as a drop-in replacement that only requires a two-line code change. Read More

#performance