In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each incoming example. The result is a sparsely-activated model – with an outrageous number of parameters – but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs, and training instability. We address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques mitigate the instabilities, and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large (Raffel et al., 2019) to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieve a 4x speedup over the T5-XXL model. Read More
#performanceTag Archives: Performance
Re-imagining Algorithmic Fairness in India and Beyond
Conventional algorithmic fairness is West-centric, as seen in its sub-groups, values, and optimisations. In this paper, we de-center algorithmic fairness and analyse AI power in India. Based on 36 qualitative interviews and a discourse analysis of algorithmic deployments in India, we find that several assumptions of algorithmic fairness are challenged in India. We find that data is not always reliable due to socio-economic factors, users are given third world treatment by ML makers, and AI signifies unquestioning aspiration. We contend that localising model fairness alone can be window dressing in India, where the distance between models and oppressed communities is large. Instead, we re-imagine algorithmic fairness in India and provide a roadmap to re-contextualise data and models, empower oppressed communities, and enable Fair-ML ecosystems. Read More
AI chips in the real world: Interoperability, constraints, cost, energy efficiency, and models
The answer to the question of how to make the best of AI hardware may not be solely, or even primarily, related to hardware
How do you make the best out of the proliferating array of emerging custom silicon hardware while not spreading yourself thin to keep up with each and every one of them?
If we were to put a price tag on that question, it would be in the multi-billion dollar territory. That’s what the combined estimated value of the different markets it touches upon is. As AI applications are exploding, so is the specialized hardware that supports them. Read More
Artificial Intelligence is a Supercomputing problem
The next generation of Artificial Intelligence applications impose new and demanding computing infrastructures. How are the computer systems that support artificial intelligence? How did we get here? Who has access to these systems? What is our responsibility as Artificial Intelligence practitioners?
[These posts will be used in the master course Supercomputers Architecture at UPC Barcelona Tech with the support of the BSC]
Part 1
Part 2
Machine learning at the speed of light: New paper demonstrates use of photonic structures for AI
As we enter the next chapter of the digital age, data traffic continues to grow exponentially. To further enhance artificial intelligence and machine learning, computers will need the ability to process vast amounts of data as quickly and as efficiently as possible.
Conventional computing methods are not up to the task, but in looking for a solution, researchers have seen the light—literally.
Light-based processors, called photonic processors, enable computers to complete complex calculations at incredible speeds. New research published this week in the journal Nature examines the potential of photonic processors for artificial intelligence applications. The results demonstrate for the first time that these devices can process information rapidly and in parallel, something that today’s electronic chips cannot do. Read More
Accelerating AI computing to the speed of light
Artificial intelligence and machine learning are already an integral part of our everyday lives online. … As the demands for AI online continue to grow, so does the need to speed up AI performance and find ways to reduce its energy consumption. Now a team of researchers has come up with a system that could help: an optical computing core prototype that uses phase-change material. This system is fast, energy efficient and capable of accelerating the neural networks used in AI and machine learning. The technology is also scalable and directly applicable to cloud computing. The team published these findings Jan. 4 in Nature Communications. Read More
Light-carrying chips advance machine learning
Researchers found that so-called photonic processors, with which data is processed by means of light, can process information very much more rapidly and in parallel than electronic chips. Read More
DeepMind researchers claim neural networks can outperform neurosymbolic models
So-called neurosymbolic models, which combine algorithms with symbolic reasoning techniques, appear to be much better-suited to predicting, explaining, and considering counterfactual possibilities than neural networks. But researchers at DeepMind claim neural networks can outperform neurosymbolic models under the right testing conditions. In a preprint paper, coauthors describe an architecture for spatiotemporal reasoning about videos in which all components are learned and all intermediate representations are distributed (rather than symbolic) throughout the layers of the neural network. The team says that it surpasses the performance of neurosymbolic models across all questions in a popular dataset, with the greatest advantage on the counterfactual questions. Read More
Honey I Shrunk the Model: Why Big Machine Learning Models Must Go Small
Bigger is not always better for machine learning. Yet, deep learning models and the datasets on which they’re trained keep expanding, as researchers race to outdo one another while chasing state-of-the-art benchmarks. However groundbreaking they are, the consequences of bigger models are severe for both budgets and the environment alike. For example, GPT-3, this summer’s massive, buzzworthy model for natural language processing, reportedly cost $12 million to train. What’s worse, UMass Amherst researchers found that the computing power required to train a large AI model can produce over 600,000 pounds of CO2 emissions – that’s five times the amount of the typical car over its lifespan.
At the pace the machine learning industry is moving today, there are no signs of these compute-intensive efforts slowing down. Research from OpenAI showed that between 2012 and 2018, computing power for deep learning models grew a shocking 300,000x, outpacing Moore’s Law. The problem lies not only in training these algorithms, but also running them in production, or the inference phase. For many teams, practical use of deep learning models remains out of reach, due to sheer cost and resource constraints. Read More
Zero-shot Learning for Relation Extraction
Most existing supervised and few-shot learning relation extraction methods have relied on labeled training data. However, in real-world scenarios, there exist many relations for which there is no available training data. We address this issue from the perspective of zero-shot learning (ZSL) which is similar to the way humans learn and recognize new concepts with no prior knowledge. We propose a zero-shot learning relation extraction (ZSLRE) framework, which focuses on recognizing novel relations that have no corresponding labeled data available for training. Our proposed ZSLRE model aims to recognize new relations based on prototypical networks that are modified to utilize side (auxiliary) information. The additional use of side information allows those modified prototype networks to recognize novel relations in addition to recognized previously known relations. We construct side information from labels and their synonyms, hypernyms of name entities, and keywords. We build an automatic hypernym extraction framework to help get hypernyms of various name entities directly from web. We demonstrate using extensive experiments on two public datasets (NYT and FewRel)that our proposed model significantly outperforms state-of-the-art methods on supervised learning, few-shot learning and zero-shot learning tasks. Our experimental results also demonstrate the effectiveness and robustness of our proposed model in a combination scenario. Once accepted for publication, we will publish ZSLRE’s source code and datasets to enable reproducibility and encourage further research. Read More
#performance