Honey I Shrunk the Model: Why Big Machine Learning Models Must Go Small

Bigger is not always better for machine learning. Yet, deep learning models and the datasets on which they’re trained keep expanding, as researchers race to outdo one another while chasing state-of-the-art benchmarks. However groundbreaking they are, the consequences of bigger models are severe for both budgets and the environment alike. For example, GPT-3, this summer’s massive, buzzworthy model for natural language processing, reportedly cost $12 million to train. What’s worse, UMass Amherst researchers found that the computing power required to train a large AI model can produce over 600,000 pounds of CO2 emissions – that’s five times the amount of the typical car over its lifespan.

At the pace the machine learning industry is moving today, there are no signs of these compute-intensive efforts slowing down. Research from OpenAI showed that between 2012 and 2018, computing power for deep learning models grew a shocking 300,000x, outpacing Moore’s Law. The problem lies not only in training these algorithms, but also running them in production, or the inference phase. For many teams, practical use of deep learning models remains out of reach, due to sheer cost and resource constraints. Read More

#iot, #performance

Zero-shot Learning for Relation Extraction

Most existing supervised and few-shot learning relation extraction methods have relied on labeled training data. However, in real-world scenarios, there exist many relations for which there is no available training data. We address this issue from the perspective of zero-shot learning (ZSL) which is similar to the way humans learn and recognize new concepts with no prior knowledge. We propose a zero-shot learning relation extraction (ZSLRE) framework, which focuses on recognizing novel relations that have no corresponding labeled data available for training. Our proposed ZSLRE model aims to recognize new relations based on prototypical networks that are modified to utilize side (auxiliary) information. The additional use of side information allows those modified prototype networks to recognize novel relations in addition to recognized previously known relations. We construct side information from labels and their synonyms, hypernyms of name entities, and keywords. We build an automatic hypernym extraction framework to help get hypernyms of various name entities directly from web. We demonstrate using extensive experiments on two public datasets (NYT and FewRel)that our proposed model significantly outperforms state-of-the-art methods on supervised learning, few-shot learning and zero-shot learning tasks. Our experimental results also demonstrate the effectiveness and robustness of our proposed model in a combination scenario. Once accepted for publication, we will publish ZSLRE’s source code and datasets to enable reproducibility and encourage further research. Read More

#performance

The way we train AI is fundamentally flawed

The process used to build most of the machine-learning models we use today can’t tell if they will work in the real world or not—and that’s a problem.

It’s no secret that machine-learning models tuned and tweaked to near-perfect performance in the lab often fail in real settings. This is typically put down to a mismatch between the data the AI was trained and tested on and the data it encounters in the world, a problem known as data shift. For example, an AI trained to spot signs of disease in high-quality medical images will struggle with blurry or cropped images captured by a cheap camera in a busy clinic.

Now a group of 40 researchers across seven different teams at Google have identified another major cause for the common failure of machine-learning models. Called “underspecification,” it could be an even bigger problem than data shift. Read More

#performance, #training

It’s Hard For Neural Networks to Learn the Game of Life

Efforts to improve the learning abilities of neural networks have focused mostly on the role of optimization methods rather than on weight initializations. Recent findings, however, suggest that neural networks rely on lucky random initial weights of subnetworks called “lottery tickets” that converge quickly to a solution [8].To investigate how weight initializations affect performance, we examine small convolutional networks that are trained to predict nsteps of the two-dimensional cellular automaton Conway’s Game of Life[3], the update rules of which can be implemented efficiently in a2n+ 1layer convolutional network. We find that networks of this architecture trained on this task rarely converge. Rather, networks require substantially more parameters to consistently converge. In addition, near-minimal architectures are sensitive to tiny changes in parameters: changing the sign of a single weight can cause the network to fail to learn. Finally, we observe a critical valued0such that training minimal networks with examples in whichc ells are alive with probabilityd0dramatically increases the chance of convergence to a solution. We conclude that training convolutional neural networks to learn the input/output function represented by nsteps of Game of Life exhibits many characteristics predicted by the lottery ticket hypothesis [8], namely, that the sizeof the networks required to learn this function are often significantly larger than the minimal network required to implement the function. Read More

#performance

Google, Cambridge, DeepMind & Alan Turing Institute’s ‘Performer’ Transformer Slashes Compute Costs

It’s no coincidence that Transformer neural network architecture is gaining popularity across so many machine learning research fields. Best known for natural language processing (NLP) tasks, Transformers not only enabled OpenAI’s 175 billion parameter language model GPT-3 to deliver SOTA performance, the power- and potential-packed architecture also helped DeepMind’s AlphaStar bot defeat professional StarCraft players. Researchers have now introduced a way to make Transformers more compute-efficient, scalable and accessible. Read More

#big7, #performance

‘Less Than One’-Shot Learning: Learning N Classes From M<N Samples

Deep neural networks require large training sets but suffer from high computational cost and long training times. Training on much smaller training sets while maintaining nearly the same accuracy would be very beneficial. In the few-shot learning setting, a model must learn a new class given only a small number of samples from that class. One-shot learning is an extreme form of few-shot learning where the model must learn a new class from a single example. We propose the ‘less than one’-shot learning task where models must learn N new classes given only M < N examples and we show that this is achievable with the help of soft labels. We use a soft-label generalization of the k-Nearest Neighbors classifier to explore the intricate decision landscapes that can be created in the ‘less than one’-shot learning setting. We analyze these decision landscapes to derive theoretical lower bounds for separating N classes using M < N soft-label samples and investigate the robustness of the resulting systems. Read More

#performance

TernaryBERT: Quantization Meets Distillation

A BERTology contribution by Huawei

The ongoing trend of building ever larger models like BERT and GPT-3 has been accompanied by a complementary effort to reduce their size at little or no cost in accuracy. Effective models are built either via distillation (Pre-trained Distillation, DistilBERT, MobileBERT, TinyBERT), quantization (Q-BERT, Q8BERT) or parameter pruning.

On September 27, Huawei introduced TernaryBERT, a model that leverages both distillation and quantization to achieve accuracy comparable to the original BERT model with ~15x decrease in size. What is truly remarkable about TernaryBERT is that its weights are ternarized, i.e. have one of three values: -1, 0, or 1 (and can hence be stored in only two bits). Read More

#nlp, #performance

Algorithms are not enough

The next breakthrough in AI requires a rethinking of our hardware

Today’s AI has a problem: it is expensive. Training Resnet-152, a modern computer vision model, is estimated to cost around 10 Billion floating point operations, which is dwarfed by modern language models. Training GPT-3, the recent natural language model from OpenAI, is estimated to cost 300 Billion Trillion floating point operations, which costs at least $5M on commercial GPUs. Compare this to the human brain, which can recognize faces, answer questions, and drive cars with as little as a banana and a cup of coffee. Read More

#nvidia, #performance

Google Teases Large Scale Reinforcement Learning Infrastructure

“The new infrastructure reduces the training time from eight hours down to merely one hour compared to a strong baseline.”

The current state-of-the-art reinforcement learning techniques require many iterations over many samples from the environment to learn a target task. For instance, the game Dota 2 learns from batches of 2 million frames every 2 seconds. The infrastructure that handles RL at this scale should be not only good at collecting a large number of samples, but also be able to quickly iterate over these extensive amounts of samples during training. Read More

#performance, #reinforcement-learning

Super Learner versus Deep Neural Network

Deep Learning has taken a prominent place for tasks involving predictive modelling and pattern recognition. Deep Learning with its auto feature extraction and feed-forward methods gives the confidence to extract low-level features in order to identify high-level identities in big data applications. However, deep neural networks have drawbacks, which include many hyperparameters tuning together, slow convergence in smaller datasets and issues explaining why a particular decision was been made. While traditional machine learning algorithms can address these drawbacks, they are not typically capable of achieving the performance levels registered by deep neural networks. To improve performance, ensemble methods are used to combine multiple base learners. Read More

#neural-networks, #performance