Train 18-billion-parameter GPT models with a single GPU on your personal computer! Open source project Colossal-AI has added new features!

When it comes to training large AI models, people will think about using thousands of GPUs, expensive training costs, and only a few tech giants can afford them. While AI users, like researchers from startups or universities, could do nothing but get overwhelmed by news about large models~

Now, a PC with only one GPU can train GPT with up to 18 billion parameters, and a laptop can also train a model with more than one billion parameters. Compared with the existing mainstream solutions, the parameter capacity can be increased by more than ten times!

Such a significant improvement comes from Colossal-AI, which is an efficient training system for general large AI models. Best of all, it’s completely open-sourced and requires only minimal modifications to allow existing deep learning projects to be trained with much larger models on a single consumer-grade graphics card, allowing everyone to train large AI models at home! In particular, it makes downstream tasks and application deployments such as large AI model fine-tuning and inference much easier! Read More

#performance

Good News About the Carbon Footprint of Machine Learning Training

Machine learning (ML) has become prominent in information technology, which has led some to raise concerns about the associated rise in the costs of computation, primarily the carbon footprint, i.e., total greenhouse gas emissions. While these assertions rightfully elevated the discussion around carbon emissions in ML, they also highlight the need for accurate data to assess true carbon footprint, which can help identify strategies to mitigate carbon emission in ML.

In “The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink”, accepted for publication in IEEE Computer, we focus on operational carbon emissions — i.e., the energy cost of operating ML hardware, including data center overheads — from training of natural language processing (NLP) models and investigate best practices that could reduce the carbon footprint. We demonstrate four key practices that reduce the carbon (and energy) footprint of ML workloads by large margins, which we have employed to help keep ML under 15% of Google’s total energy use. Read More

#big7, #performance

FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding

The few-shot natural language understanding (NLU) task has attracted much recent attention. However, prior methods have been evaluated under a disparate set of protocols, which hinders fair comparison and measuring progress of the field. To address this issue, we introduce an evaluation framework that improves previous evaluation procedures in three key aspects, i.e., test performance, dev test correlation, and stability. Under this new evaluation framework, we re-evaluate several state-of-the-art few-shot methods for NLU tasks. Our framework reveals new insights: (1) both the absolute performance and relative gap of the methods were not accurately estimated in prior literature; (2) no single method dominates most tasks with consistent performance; (3) improvements of some methods diminish with a larger pretrained model; and (4) gains from different methods are often complementary and the best combined model performs close to a strong fully-supervised baseline. We open-source our toolkit, FewNLU, that implements our evaluation framework along with a number of state-of-the-art methods. Read More

#nlp, #performance

Exploring the Limits of Large Scale Pre-training

Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks. Delving deeper to understand the reasons that give rise to these phenomena, we show that the saturation behavior we observe is closely related to the way that representations evolve through the layers of the models. We showcase an even more extreme scenario where performance on upstream and downstream are at odds with each other. That is, to have a better downstream performance, we need to hurt upstream accuracy. Read More

#performance

Google AI and Princeton discover this about Deep Learning

Much of an ML model’s learning results depend on the model’s learning rate. The learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences the extent to which newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model “learns”.

The importance of Learning Rate can’t be underestimated. That is why there is a lot of research towards both discovering new learning rate schedules (how LR should change over time) and comparing existing ones. Researchers at Google AI, Tel Aviv University, and Princeton collaborated together to write Disentangling Adaptive Gradient Methods from Learning Rates. The paper looks at “how adaptive gradient methods interact with the learning rate schedule.” In this article, I will share some interesting takeaways from the paper that might help you in your ML journeys.  Read More

#performance

8-bit Optimizers via Block-Wise Quantization

Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent, but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable imbedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT’14 machine translation, MoCo v2 contrastive ImageNet retraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-source1 our 8-bit optimizers as a drop-in replacement that only requires a two-line code change. Read More

#performance

Moore’s Law, AI, and the pace of progress

It seems to be a minority view nowadays to believe in Moore’s Law, the routine doubling of transistor density roughly every couple of years, or even the much gentler claim, that There’s Plenty [more] Room at the Bottom. There’s even a quip for it: the number of people predicting the death of Moore’s law doubles every two years. This is not merely a populist view by the uninformed.

…Besides mere physical inevitability, improvements to transistor density are taking an economic toll. Building the fabs that manufacture transistors is becoming very expensive, as high as $20 billion each, and TSMC expects to spend $100 billion just over the three years to expand capacity. This cost increases with each cutting-edge node.

This bleak industry view contrasts with the massively increasing demands of scale from AI, that has become a center of attention, in large part due to OpenAI’s attention on the question, and their successful results with their various GPT-derived models. There, too, the economic factor exacerbates the divide; models around GPT-3’s size are the domain of only a few eager companies, and whereas before there was an opportunity to reap quick advances from scaling single- or few-machine models to datacenter scale, now all compute advances require new hardware of some kind, whether better computer architectures or bigger (pricier) data centers. Read More

#performance

Aggregating Nested Transformers

Although hierarchical structures are popular in recent vision transformers, they require sophisticated designs and massive datasets to work well. In this work, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical manner. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture with minor code changes upon the original vision transformer and obtains improved performance compared to existing methods. Our empirical results show that the proposed method NesT converges faster and requires much less training data to achieve good generalization. For example, a NesT with 68M parameters trained on ImageNet for 100/300 epochs achieves 82.3%/83.8% accuracy evaluated on 224 × 224 image size, outperforming previous methods with up to 57% parameter reduction. Training a NesT with 6M parameters from scratch on CIFAR10 achieves 96% accuracy using a single GPU, setting a new state of the art for vision transformers. Beyond image classification, we extend the key idea to image generation and show NesT leads to a strong decoder that is 8×faster than previous transformer based generators. Furthermore, we also propose a novel method for visually interpreting the learned model. Read More

#performance

Multimodal datasets: misogyny, pornography, andmalignant stereotypes

We have now entered the era of trillion parameter machine learning models trained on billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has given rise to formidable bodies of critical work that has called for caution while generating these large datasets. These address concerns surrounding the dubious curation practices used to generate these datasets, the sordid quality of alt-text data available on the world wide web, the problematic content of the CommonCrawl dataset often used as a source for training large language models, and the entrenched biases in large-scale visio-linguistic models (such as OpenAI’s CLIP model) trained on opaque datasets (WebImageText). In the backdrop of these specific calls of caution, we examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset. We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content. We outline numerous implications, concerns and downstream harms regarding the current state of large scale datasets while raising open questions for various stakeholders including the AI community, regulators, policy makers and data subjects. Read More

#bias

#performance

Does Your Dermatology Classifier Know What It Doesn’t Know? Detecting the Long-Tail of Unseen Conditions

Supervised deep learning models have proven to be highly effective in classification of dermatological conditions. These models rely on the availability of abundant labeled training examples. However, in the real-world, many dermatological conditions are individually too infrequent for per-condition classification with supervised learning. Although individually infrequent, these conditions may collectively be common and therefore are clinically significant in aggregate. To prevent models from generating erroneous outputs on such examples, there remains a considerable unmet need for deep learning systems that can better detect such infrequent conditions. These infrequent ‘outlier’ conditions are seen very rarely (or not at all) during training. In this paper, we frame this task as an out-of-distribution (OOD) detection problem. We set up a benchmark ensuring that outlier conditions are disjoint between the model training, validation, and test sets. Unlike traditional OOD detection benchmarks where the task is to detect dataset distribution shift, we aim at the more challenging task of detecting subtle semantic differences. We propose a novel hierarchical outlier detection (HOD) loss, which assigns multiple abstention classes corresponding to each training outlier class and jointly performs a coarse classification of inliers vs. outliers, along with fine-grained classification of the individual classes. We demonstrate that the proposed HOD loss based approach outperforms leading methods that leverage outlier data during training. Further, performance is significantly boosted by using recent representation learning methods (BiT, SimCLR, MICLe). Further, we explore ensembling strategies for OOD detection and propose a diverse ensemble selection process for the best result. We also perform a subgroup analysis over conditions of varying risk levels and different skin types to investigate how OOD performance changes over each subgroup and demonstrate the gains of our framework in comparison to baseline. Furthermore, we go beyond traditional performance metrics and introduce a cost matrix for model trust analysis to approximate downstream clinical impact. We use this cost matrix to compare the proposed method against the baseline, thereby making a stronger case for its effectiveness in real-world scenarios. Read More

#performance, #machine-learning