DeepNet: Scaling Transformers to 1,000 Layers

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DEEPNORM) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DEEPNORM a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction. Read More

#training

AI program that started at Maricopa County colleges expanding nationwide

An artificial intelligence program that started at Maricopa Community Colleges in collaboration with Intel and Dell Technologies will expand nationwide by 2023.

The technology companies are partnering with the American Association of Community Colleges (AACC) to bring the program, which uses Intel’s AI-based curriculum to offer an associate degree and certificate of completion in the industry, to all 50 states after beginning in the Valley in fall 2020.

“This is an exciting partnership that will build a robust and diverse talent pipeline to help support future jobs in AI and technology,” Carlos Contreras, senior director of AI and Digital Readiness at Intel, said in a press release. Read More

#training

A Cartel of Influential Datasets Is Dominating Machine Learning Research, New Study Suggests

A new paper from the University of California and Google Research has found that a small number of ‘benchmark’ machine learning datasets, largely from influential western institutions, and frequently from government organizations, are increasingly dominating the AI research sector.

The researchers conclude that this tendency to ‘default’ to highly popular open source datasets, such as ImageNet, brings up a number of practical, ethical and even political causes for concern.

Among their findings – based on core data from the Facebook-led community project Papers With Code (PWC) –  the authors contend that ‘widely-used datasets are introduced by only a handful of elite institutions’, and that this ‘consolidation’ has increased to 80% in recent years. Read More

The Paper

#bias, #training

AI’s Smarts Now Come With a Big Price Tag

As language models get more complex, they also get more expensive to create and run. Some companies are locked out.

Calvin Qi, who works at a search startup called Glean, would love to use the latest artificial intelligence algorithms to improve his company’s products.

Glean provides tools for searching through applications like Gmail, Slack, and Salesforce. Qi says new AI techniques for parsing language would help Glean’s customers unearth the right file or conversation a lot faster.

But training such a cutting-edge AI algorithm costs several million dollars. So Glean uses smaller, less capable AI models that can’t extract as much meaning from text. Read More

#deep-learning, #machine-learning, #training

Never invest your time in learning complex things.

The data scientist hype train has come to a grinding halt . It has been a joy ride for me for I was one of the people who got hooked into data science as soon as it came out. Math, engineering and the ability to predict stuff was very attractive indeed for a self-professed geek . I couldn’t resist and soon I was devouring one book after the other. I started with Springer Publications (Max Kuhn) , Tevor Hastie, a lot of Orielly books and followed it up with Statistics and Math courses until I had the math and the techniques (Linear/Logistic Regression, SVM,Random Forests, Decision Trees and few 20 others) down pat. Sounds great right, not quite.

Then came the Deep Learning revolution. I was first exposed to it thanks to Jeremy Howard who in my opinion still runs the best damn Deep learning course on the internet. He explains vision, NLP and even structured data machine learning. The guy is literally able to translate gobbledygook for the masses ( Me :-)) Plug: https://www.fast.ai/ . Read More

#data-science, #training

Can You Learn an Algorithm? Generalizing fromEasy to Hard Problems with Recurrent Networks

Deep neural networks are powerful machines for visual pattern recognition, but reasoning tasks that are easy for humans may still be difficult for neural models. Humans possess the ability to extrapolate reasoning strategies learned on simple problems to solve harder examples, often by thinking for longer. For example, a person who has learned to solve small mazes can easily extend the very same search techniques to solve much larger mazes by spending more time. In computers, this behavior is often achieved through the use of algorithms, which scale to arbitrarily hard problem instances at the cost of more computation. In contrast, the sequential computing budget of feed-forward neural networks is limited by their depth, and networks trained on simple problems have no way of extending their reasoning to accommodate harder problems. In this work, we show that recurrent networks trained to solve simple problems with few recurrent steps can indeed solve much more complex problems simply by performing additional recurrences during inference. We demonstrate this algorithmic behavior of recurrent networks on prefix sum computation, mazes, and chess. In all three domains, networks trained on simple problem instances are able to extend their reasoning abilities at test time simply by “thinking for longer.” Read More

#training, #recurrent-neural-networks

Ultra-Low Precision 4-bit Training of Deep Neural Networks

In this paper, we propose a number of novel techniques and numerical representation formats that enable, for the very first time, the precision of training systems to be aggressively scaled from 8-bits to 4-bits. To enable this advance, we explore a novel adaptive Gradient Scaling technique (GradScale) that addresses the challenges of insufficient range and resolution in quantized gradients as well as explores the impact of quantization errors observed during model training. We theoretically analyze the role of bias in gradient quantization and propose solutions that mitigate the impact of this bias on model convergence. Finally, we examine our techniques on a spectrum of deep learning models in computer vision, speech and NLP. In combination with previously proposed solutions for 4-bit quantization of weight and activation tensors, 4-bit training shows non-significant loss in accuracy across application domains while enabling significant hardware acceleration (>7×over state of the art FP16 systems). Read More

#training

The way we train AI is fundamentally flawed

The process used to build most of the machine-learning models we use today can’t tell if they will work in the real world or not—and that’s a problem.

It’s no secret that machine-learning models tuned and tweaked to near-perfect performance in the lab often fail in real settings. This is typically put down to a mismatch between the data the AI was trained and tested on and the data it encounters in the world, a problem known as data shift. For example, an AI trained to spot signs of disease in high-quality medical images will struggle with blurry or cropped images captured by a cheap camera in a busy clinic.

Now a group of 40 researchers across seven different teams at Google have identified another major cause for the common failure of machine-learning models. Called “underspecification,” it could be an even bigger problem than data shift. Read More

#performance, #training

21 amazing Youtube channels for you to learn AI, Machine Learning, and Data Science for free

This is the perfect moment to start learning something new, and why not start with AI?

I know the pandemic is keeping everyone at home, home working is becoming the new normal for many of us, and it is hard to find good presential training these days, but it does not mean that you need to stop learning!

I would say that this is the perfect moment to start learning something new, and why not start with Data Science? Read More

#training

Tasks, stability, architecture, and compute:Training more effective learned optimizers,and using them to train themselves

Much as replacing hand-designed features with learned functions has revolutionized how we solve perceptual tasks, we believe learned algorithms will transform how we train models. In this work we focus on general-purpose learned optimizers capable of training a wide variety of problems with no user-specified hyperparameters. We introduce a new, neural network parameterized, hierarchical optimizer with access to additional features such as validation loss to enable automatic regularization. Most learned optimizers have been trained on only a single task, or a small number of tasks. We train our optimizers on thousands of tasks, making use of orders of magnitude more compute, resulting in optimizers that generalize better to unseen tasks. The learned optimizers not only perform well, but learn behaviors that are distinct from existing first order optimizers. For instance, they generate update steps that have implicit regularization and adapt as the problem hyperparameters (e.g. batch size) or architecture (e.g. neural network width) change. Finally,these learned optimizers show evidence of being useful for out of distribution tasks such as training themselves from scratch. Read More

#machine-learning, #training