A critical analysis of metrics used for measuring progress in artificial intelligence

Comparing model performances on benchmark datasets is an integral part of measuring and driving progress in artificial intelligence. A model’s performance on a benchmark dataset is commonly assessed based on a single or a small set of performance metrics. While this enables quick comparisons, it may also entail the risk of inadequately reflecting model performance if the metric does not sufficiently cover all performance characteristics. Currently, it is unknown to what extent this might impact current benchmarking efforts. To address this question, we analysed the current landscape of performance metrics based on data covering 3867 machine learning model performance results from the web-based open platform ‘Papers with Code’. Our results suggest that the large majority of metrics currently used to evaluate classification AI benchmark tasks have properties that may result in an inadequate reflection of a classifiers’ performance, especially when used with imbalanced datasets. While alternative metrics that address problematic properties have been proposed, they are currently rarely applied as performance metrics in benchmarking tasks. Finally, we noticed that the reporting of metrics was partly inconsistent and partly unspecific, which may lead to ambiguities when comparing model performances. Read More

#performance

Can Your AI Differentiate Cats from Covid-19?Sample Efficient Uncertainty Estimation for Deep Learning Safety

Deep Neural Networks (DNNs) are known to make highly overconfident predictions on Out-of-Distribution data. Recent research has shown that uncertainty-aware models, such as, Bayesian Neural Network (BNNs) and Deep Ensembles,are less susceptible to this issue. However research in this area has been largely confined to the big data setting. In this work, we show that even state-of-the-art BNNs and Ensemble models tend to make overconfident predictions when the amount of training data is insufficient. This is especially concerning for emerging applications in physical sciences and healthcare where over-confident and inaccurate predictions can lead to disastrous consequences. To address the issue of accurate uncertainty (or confidence) estimation in the small-data regime, we propose a probabilistic generalization of the popular sample-efficient non-parametric kNN approach. We demonstrate the usefulness of the proposed approach on a real-world application of COVID-19 diagnosis from chest X-Rays by (a) highlighting surprising failures of existing techniques, and (b) achieving superior uncertainty quantification as compared to state-of-the-art. Read More

#accuracy, #performance

A solution to the learning dilemma for recurrent networks of spiking neurons

Recurrently connected networks of spiking neurons underlie the astounding information processing capabilities of the brain. Yet in spite of extensive research, how they can learn through synaptic plasticity to carry out complex network computations remains unclear. We argue that two pieces of this puzzle were provided by experimental data from neuroscience. A mathematical result tells us how these pieces need to be combined to enable biologically plausible online network learning through gradient descent, in particular deep reinforcement learning. This learning method–called e-prop–approaches the performance of backpropagation
through time (BPTT), the best-known method for training recurrent neural
networks in machine learning. In addition, it suggests a method for powerful on-chip learning in energy-efficient spike-based hardware for artificial intelligence. Read More

#performance, #recurrent-neural-networks

There’s plenty of room at the Top: What will drive computer performance after Moore’s law?

The doubling of the number of transistors on a chip every 2 years, a seemly inevitable trend that has been called Moore’s law, has contributed immensely to improvements in computer performance. However, silicon-based transistors cannot get much smaller than they are today, and other approaches should be explored to keep performance growing. Leiserson et al. review recent examples and argue that the most promising place to look is at the top of the computing stack, where improvements in software, algorithms, and hardware architecture can bring the much-needed boost. Read More

#performance, #nvidia

Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges

Machine learning has evolved into an enabling technology for a wide range of highly successful applications. The potential for this success to continue and accelerate has placed machine learning (ML) at the top of research, economic and political agendas. Such unprecedented interest is fueled by a vision of ML applicability extending to healthcare, transportation, defense and other domains of great societal importance. Achieving this vision requires the use of ML in safety-critical applications that demand levels of assurance beyond those needed for current ML applications. Our paper provides a comprehensive survey of the state-of-the-art in the assurance of ML, i.e. in the generation of evidence that ML is sufficiently safe for its intended use. The survey covers the methods capable of providing such evidence at different stages of the machine learning lifecycle, i.e. of the complex, iterative process that starts with the collection of the data used to train an ML component for a system, and ends with the deployment of that component within the system. The paper begins with a systematic presentation of the ML lifecycle and its stages. We then define assurance desiderata for each stage, review existing methods that contribute to achieving these desiderata, and identify open challenges that require further research. Read More

#accuracy, #assurance, #performance

A Benchmark for Machine Learning from an Academic/Industry Cooperative

MLPerf is a consortium involving more than 40 leading companies and university researchers, which has released several rounds of results. MLPerf’s goals are:

Accelerate progress in ML via fair and useful measurement

Encourage innovation across state-of-the-art ML systems

Serve both industrial and research communities

Enforce replicability to ensure reliable results

Keep benchmark effort affordable so all can play Read More

#mlperf, #nvidia, #performance

The Vision Behind MLPerf

A broad ML benchmark suite for measuring the performance of ML software frameworks, ML hardware accelerators, and ML cloud and edge platforms.

… since 2012 the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time (by comparison, Moore’s Law had an 18-month doubling period). Since 2012, this metric has grown by more than 300,000x (an 18-month doubling period would yield only a 12x increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.” Read More

#mlperf, #nvidia, #performance