Reinforcement learning (RL) is a framework that lets agents learn decision making from experience. One of the many variants of RL is off-policy RL, where an agent is trained using a combination of data collected by other agents (off-policy data) and data it collects itself to learn generalizable skills like robotic walking and grasping. In contrast, fully off-policy RL is a variant in which an agent learns entirely from older data, which is appealing because it enables model iteration without requiring a physical robot. With fully off-policy RL, one can train several models on the same fixed dataset collected by previous agents, then select the best one. However, fully off-policy RL comes with a catch: while training can occur without a real robot, evaluation of the models cannot. Furthermore, ground-truth evaluation with a physical robot is too inefficient to test promising approaches that require evaluating a large number of models, such as automated architecture search with AutoML. Read More
Daily Archives: June 20, 2019
Off-Policy Evaluation via Off-Policy Classification
In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE for value-based methods, which are of particular interest in deep RL, with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even illconditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem with the Q-function as the decision function. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task. Read More
AlphaStar: An Evolutionary Computation Perspective
In January 2019, DeepMind revealed AlphaStar to the world—the first artificial intelligence (AI) system to beat a professional player at the game of StarCraft II—representing a milestone in the progress of AI. AlphaStar draws on many areas of AI research, including deep learning, reinforcement learning, game theory, and evolutionary computation (EC). In this paper we analyze AlphaStar primarily through the lens of EC, presenting a new look at the system and relating it to many concepts in the field. We highlight some of its most interesting aspects—the use of Lamarckian evolution,competitive co-evolution, and quality diversity. In doing so,we hope to provide a bridge between the wider EC community and one of the most significant AI systems developed in recent times. Read More
The Power of Self-Learning Systems
AI Codes its Own ‘AI Child’ – AutoML
a16z Podcast: The History and Future of Machine Learning
How have we gotten to where were are with machine learning? Where are we going?
a16z Operating Partner Frank Chen and Carnegie Mellon professor Tom Mitchell first stroll down memory lane, visiting the major landmarks: the symbolic approach of the 1970s, the “principled probabalistic methods” of the 1980s, and today’s deep learning phase. Then they go on to explore the frontiers of research. Along the way, they cover:
– How planning systems from the 1970s and early 1980s were stymied by the “banana in the tailpipe” problem
– How the relatively slow neurons in our visual cortex work together to deliver very speedy and accurate recognition
– How fMRI scans of the brain reveal common neural patterns across people when they are exposed to common nouns like chair, car, knife, and so on
– How the computer science community is working with social scientists (psychologists, economists, and philosophers) on building measures for fairness and transparency for machine learning models
– How we want our self-driving cars to have reasonable answers to the Trolley Problem, but no one sitting for their DMV exam is ever asked how they would respond
– How there were inflated expectations (and great social fears) for AI in the 1980s, and how the US concerns about Japan compare to our concerns about China today
– Whether this is the best time ever for AI and ML research and what continues to fascinate and motivate Tom after decades in the field
Read More