Ultra-Low Precision 4-bit Training of Deep Neural Networks

In this paper, we propose a number of novel techniques and numerical representation formats that enable, for the very first time, the precision of training systems to be aggressively scaled from 8-bits to 4-bits. To enable this advance, we explore a novel adaptive Gradient Scaling technique (GradScale) that addresses the challenges of insufficient range and resolution in quantized gradients as well as explores the impact of quantization errors observed during model training. We theoretically analyze the role of bias in gradient quantization and propose solutions that mitigate the impact of this bias on model convergence. Finally, we examine our techniques on a spectrum of deep learning models in computer vision, speech and NLP. In combination with previously proposed solutions for 4-bit quantization of weight and activation tensors, 4-bit training shows non-significant loss in accuracy across application domains while enabling significant hardware acceleration (>7×over state of the art FP16 systems). Read More

#training

Deep reinforcement-learning architecture combines pre-learned skills to create new sets of skills on the fly

A team of researchers from the University of Edinburgh and Zhejiang University has developed a way to combine deep neural networks (DNNs) to create a new type of system with a new kind of learning ability. The group describes their new architecture and its performance in the journal Science Robotics.

Deep neural networks are able to learn functions by training on multiple examples repeatedly. To date, they have been used in a wide variety of applications such as recognizing faces in a crowd or deciding whether a loan applicant is credit-worthy. In this new effort, the researchers have combined several DNNs developed for different applications to create a new system with the benefits of all of its constituent DNNs. Read More

#reinforcement-learning

Scale Neural Network Training with SageMaker Distributed

As a machine learning practitioner, you might find yourself in the following situations. You might have found just the perfect state of the art transformer-based model, only to find that when you try to fine-tune it you run into memory issues. You might have just added billions of parameters to your model, which should improve your model performance, but this too only gets you an out of memory issue. You might be able to comfortably fit a model on a single GPU, but are struggling to take advantage of all your GPU’s and find that your model still takes days to train.

Should you just accept the status quo and limit your applications to models that fit within the existing hardware capacity or that train within an acceptable time? Of course not! Enter model parallelism and data parallelism on Amazon SageMaker. Read More

#mlaas