Accelerating AI computing to the speed of light

Artificial intelligence and machine learning are already an integral part of our everyday lives online. … As the demands for AI online continue to grow, so does the need to speed up AI performance and find ways to reduce its energy consumption. Now a team of researchers has come up with a system that could help: an optical computing core prototype that uses phase-change material. This system is fast, energy efficient and capable of accelerating the neural networks used in AI and machine learning. The technology is also scalable and directly applicable to cloud computing. The team published these findings Jan. 4 in Nature Communications. Read More

#nvidia, #performance

Light-carrying chips advance machine learning

Researchers found that so-called photonic processors, with which data is processed by means of light, can process information very much more rapidly and in parallel than electronic chips. Read More

#performance

DeepMind researchers claim neural networks can outperform neurosymbolic models

So-called neurosymbolic models, which combine algorithms with symbolic reasoning techniques, appear to be much better-suited to predicting, explaining, and considering counterfactual possibilities than neural networks. But researchers at DeepMind claim neural networks can outperform neurosymbolic models under the right testing conditions. In a preprint paper, coauthors describe an architecture for spatiotemporal reasoning about videos in which all components are learned and all intermediate representations are distributed (rather than symbolic) throughout the layers of the neural network. The team says that it surpasses the performance of neurosymbolic models across all questions in a popular dataset, with the greatest advantage on the counterfactual questions. Read More

#neural-networks, #performance

Honey I Shrunk the Model: Why Big Machine Learning Models Must Go Small

Bigger is not always better for machine learning. Yet, deep learning models and the datasets on which they’re trained keep expanding, as researchers race to outdo one another while chasing state-of-the-art benchmarks. However groundbreaking they are, the consequences of bigger models are severe for both budgets and the environment alike. For example, GPT-3, this summer’s massive, buzzworthy model for natural language processing, reportedly cost $12 million to train. What’s worse, UMass Amherst researchers found that the computing power required to train a large AI model can produce over 600,000 pounds of CO2 emissions – that’s five times the amount of the typical car over its lifespan.

At the pace the machine learning industry is moving today, there are no signs of these compute-intensive efforts slowing down. Research from OpenAI showed that between 2012 and 2018, computing power for deep learning models grew a shocking 300,000x, outpacing Moore’s Law. The problem lies not only in training these algorithms, but also running them in production, or the inference phase. For many teams, practical use of deep learning models remains out of reach, due to sheer cost and resource constraints. Read More

#iot, #performance

Zero-shot Learning for Relation Extraction

Most existing supervised and few-shot learning relation extraction methods have relied on labeled training data. However, in real-world scenarios, there exist many relations for which there is no available training data. We address this issue from the perspective of zero-shot learning (ZSL) which is similar to the way humans learn and recognize new concepts with no prior knowledge. We propose a zero-shot learning relation extraction (ZSLRE) framework, which focuses on recognizing novel relations that have no corresponding labeled data available for training. Our proposed ZSLRE model aims to recognize new relations based on prototypical networks that are modified to utilize side (auxiliary) information. The additional use of side information allows those modified prototype networks to recognize novel relations in addition to recognized previously known relations. We construct side information from labels and their synonyms, hypernyms of name entities, and keywords. We build an automatic hypernym extraction framework to help get hypernyms of various name entities directly from web. We demonstrate using extensive experiments on two public datasets (NYT and FewRel)that our proposed model significantly outperforms state-of-the-art methods on supervised learning, few-shot learning and zero-shot learning tasks. Our experimental results also demonstrate the effectiveness and robustness of our proposed model in a combination scenario. Once accepted for publication, we will publish ZSLRE’s source code and datasets to enable reproducibility and encourage further research. Read More

#performance

The way we train AI is fundamentally flawed

The process used to build most of the machine-learning models we use today can’t tell if they will work in the real world or not—and that’s a problem.

It’s no secret that machine-learning models tuned and tweaked to near-perfect performance in the lab often fail in real settings. This is typically put down to a mismatch between the data the AI was trained and tested on and the data it encounters in the world, a problem known as data shift. For example, an AI trained to spot signs of disease in high-quality medical images will struggle with blurry or cropped images captured by a cheap camera in a busy clinic.

Now a group of 40 researchers across seven different teams at Google have identified another major cause for the common failure of machine-learning models. Called “underspecification,” it could be an even bigger problem than data shift. Read More

#performance, #training

It’s Hard For Neural Networks to Learn the Game of Life

Efforts to improve the learning abilities of neural networks have focused mostly on the role of optimization methods rather than on weight initializations. Recent findings, however, suggest that neural networks rely on lucky random initial weights of subnetworks called “lottery tickets” that converge quickly to a solution [8].To investigate how weight initializations affect performance, we examine small convolutional networks that are trained to predict nsteps of the two-dimensional cellular automaton Conway’s Game of Life[3], the update rules of which can be implemented efficiently in a2n+ 1layer convolutional network. We find that networks of this architecture trained on this task rarely converge. Rather, networks require substantially more parameters to consistently converge. In addition, near-minimal architectures are sensitive to tiny changes in parameters: changing the sign of a single weight can cause the network to fail to learn. Finally, we observe a critical valued0such that training minimal networks with examples in whichc ells are alive with probabilityd0dramatically increases the chance of convergence to a solution. We conclude that training convolutional neural networks to learn the input/output function represented by nsteps of Game of Life exhibits many characteristics predicted by the lottery ticket hypothesis [8], namely, that the sizeof the networks required to learn this function are often significantly larger than the minimal network required to implement the function. Read More

#performance

Google, Cambridge, DeepMind & Alan Turing Institute’s ‘Performer’ Transformer Slashes Compute Costs

It’s no coincidence that Transformer neural network architecture is gaining popularity across so many machine learning research fields. Best known for natural language processing (NLP) tasks, Transformers not only enabled OpenAI’s 175 billion parameter language model GPT-3 to deliver SOTA performance, the power- and potential-packed architecture also helped DeepMind’s AlphaStar bot defeat professional StarCraft players. Researchers have now introduced a way to make Transformers more compute-efficient, scalable and accessible. Read More

#big7, #performance

‘Less Than One’-Shot Learning: Learning N Classes From M<N Samples

Deep neural networks require large training sets but suffer from high computational cost and long training times. Training on much smaller training sets while maintaining nearly the same accuracy would be very beneficial. In the few-shot learning setting, a model must learn a new class given only a small number of samples from that class. One-shot learning is an extreme form of few-shot learning where the model must learn a new class from a single example. We propose the ‘less than one’-shot learning task where models must learn N new classes given only M < N examples and we show that this is achievable with the help of soft labels. We use a soft-label generalization of the k-Nearest Neighbors classifier to explore the intricate decision landscapes that can be created in the ‘less than one’-shot learning setting. We analyze these decision landscapes to derive theoretical lower bounds for separating N classes using M < N soft-label samples and investigate the robustness of the resulting systems. Read More

#performance

TernaryBERT: Quantization Meets Distillation

A BERTology contribution by Huawei

The ongoing trend of building ever larger models like BERT and GPT-3 has been accompanied by a complementary effort to reduce their size at little or no cost in accuracy. Effective models are built either via distillation (Pre-trained Distillation, DistilBERT, MobileBERT, TinyBERT), quantization (Q-BERT, Q8BERT) or parameter pruning.

On September 27, Huawei introduced TernaryBERT, a model that leverages both distillation and quantization to achieve accuracy comparable to the original BERT model with ~15x decrease in size. What is truly remarkable about TernaryBERT is that its weights are ternarized, i.e. have one of three values: -1, 0, or 1 (and can hence be stored in only two bits). Read More

#nlp, #performance