Few selected papers I've enjoyed reading. More comprehensive list can be found in [
1. Playing Atari with Deep Reinforcement Learning, Mnih et al, 2013. Algorithm: DQN
Playing Atari with Deep Reinforcement Learning
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw
2. Deep Reinforcement Learning with Double Q-learning, Hasselt et al 2015. Algorithm: Double DQN
Deep Reinforcement Learning with Double Q-learning
The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common, whether they harm performance, and whether they can generally be prevented.
3. Prioritized Experience Replay, Schaul et al, 2015. Algorithm: Prioritized Experience Replay (PER)
Prioritized Experience Replay
Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequ
1. Trust Region Policy Optimization, Schulman et al, 2015. Algorithm: TRPO
Trust Region Policy Optimization
We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO).
2. High-Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al, 2015. Algorithm: GAE.
High-Dimensional Continuous Control Using Generalized Advantage Estimation
Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are th
3. Proximal Policy Optimization Algorithms, Schulman et al, 2017. Algorithm: PPO-Clip, PPO-Penalty.
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standar
4. Soft Actor-Critic: Off Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, 2018. Algorithm: SAC.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergenc
1. Unifying Count-Based Exploration and Intrinsic Motivation, Bellemare et al, 2016. Algorithm: CTS-based Pseudocounts.
Unifying Count-Based Exploration and Intrinsic Motivation
We consider an agent's uncertainty about its environment and the problem of generalizing this uncertainty across observations. Specifically, we focus on the problem of exploration in non-tabular reinforcement learning. Drawing inspiration from the intrinsi
2. Curiosity-driven Exploration by Self supervised Prediction, Pathak et al, 2017. Algorithm: Intrinsic Curiosity Module (ICM).
Curiosity-driven Exploration by Self-supervised Prediction
In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful l
3. Exploration by Random Network Distillation, Burda et al, 2018. Algorithm: RND.
Exploration by Random Network Distillation
We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed
4. Variational Intrinsic Control, Gregor et al, 2016. Algorithm: VIC.
Variational Intrinsic Control
In this paper we introduce a new unsupervised reinforcement learning method for discovering the set of intrinsic options available to an agent. This set is learned by maximizing the number of different states an agent can reliably reach, as measured by the
5. Diversity is All You Need: Learning Skills without a Reward Function, Eysenbach et al, 2018. Algorithm: DIAYN.
Diversity is All You Need: Learning Skills without a Reward Function
Intelligent creatures can explore their environments and learn useful skills without supervision. In this paper, we propose DIAYN ('Diversity is All You Need'), a method for learning useful skills without a reward function. Our proposed method learns skill
1.Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al, 2000. Contributions: Established policy gradient theorem and showed convergence of policy gradient algorithm for arbitrary policy classes.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
Policy Gradient Methods for Reinforcement Learning with Function Approximation Part of: Advances in Neural Information Processing Systems 12 (NIPS 1999) [PDF] [BibTeX] Authors Abstract Abstract Missing Neural Information Processing Systems (NIPS) Papers pu
2. A Natural Policy Gradient, Kakade, 2002. Contributions: Brought natural gradients into RL, later leading to TRPO, ACKTR, and several other methods in deep RL.
A Natural Policy Gradient
A Natural Policy Gradient Part of: Advances in Neural Information Processing Systems 14 (NIPS 2001) [PDF] [BibTeX] Authors Abstract Abstract Missing Neural Information Processing Systems (NIPS) Papers published at the Neural Information Processing Systems
3. Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning, Lee et al, 2019: Entropy-regularized RL
Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning
In this paper, we present a new class of Markov decision processes (MDPs), called Tsallis MDPs, with Tsallis entropy maximization, which generalizes existing maximum entropy reinforcement learning (RL). A Tsallis MDP provides a unified framework for the or
1. Learning Contact-Rich Manipulation Skills with Guided Policy Search, Levine et al, 2015. Guded policy search
Learning Contact-Rich Manipulation Skills with Guided Policy Search
Autonomous learning of object manipulation skills can enable robots to acquire rich behavioral repertoires that scale to the variety of objects found in the real world. However, current motion skill learning methods typically restrict the behavior to a com
2. Trajectory-based Probabilistic Policy Gradient for Learning Locomotion Behaviors, Choi and Kim, 2018. Deep Latent Policy Gradient
Trajectory-based Probabilistic Policy Gradient for Learning Locomotion Behaviors - Disney Research
In this paper, we propose a trajectory-based reinforcement learning method named deep latent policy gradient (DLPG) for learning locomotion skills.
3. DeepMimic: Example-Guided Deep Reinforcement Learningof Physics-Based Character Skills, Peng et al., 2018: Deep mimic
DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills
A longstanding goal in character animation is to combine data-driven specification of behavior with a system that can execute a similar behavior in a physical simulation, thus enabling realistic responses to perturbations and environmental variation. We sh