Few selected papers I've enjoyed reading. More comprehensive list can be found in [https://spinningup.openai.com/en/latest/spinningup/keypapers.html ].
Model-Free RL
1. Playing Atari with Deep Reinforcement Learning, Mnih et al, 2013. Algorithm: DQN
https://arxiv.org/abs/1312.5602
Playing Atari with Deep Reinforcement Learning
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw
arxiv.org
2. Deep Reinforcement Learning with Double Q-learning, Hasselt et al 2015. Algorithm: Double DQN
https://arxiv.org/abs/1509.06461
Deep Reinforcement Learning with Double Q-learning
The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common, whether they harm performance, and whether they can generally be prevented.
arxiv.org
3. Prioritized Experience Replay, Schaul et al, 2015. Algorithm: Prioritized Experience Replay (PER)
https://arxiv.org/abs/1511.05952
Prioritized Experience Replay
Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequ
arxiv.org
Policy Gradient
1. Trust Region Policy Optimization, Schulman et al, 2015. Algorithm: TRPO
https://arxiv.org/abs/1502.05477
Trust Region Policy Optimization
We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO).
arxiv.org
2. High-Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al, 2015. Algorithm: GAE.
https://arxiv.org/abs/1506.02438
High-Dimensional Continuous Control Using Generalized Advantage Estimation
Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are th
arxiv.org
3. Proximal Policy Optimization Algorithms, Schulman et al, 2017. Algorithm: PPO-Clip, PPO-Penalty.
https://arxiv.org/abs/1707.06347
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standar
arxiv.org
4. Soft Actor-Critic: Off Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, 2018. Algorithm: SAC.
https://arxiv.org/abs/1801.01290
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergenc
arxiv.org
Exploration Methods
1. Unifying Count-Based Exploration and Intrinsic Motivation, Bellemare et al, 2016. Algorithm: CTS-based Pseudocounts.
https://arxiv.org/abs/1606.01868
Unifying Count-Based Exploration and Intrinsic Motivation
We consider an agent's uncertainty about its environment and the problem of generalizing this uncertainty across observations. Specifically, we focus on the problem of exploration in non-tabular reinforcement learning. Drawing inspiration from the intrinsi
arxiv.org
2. Curiosity-driven Exploration by Self supervised Prediction, Pathak et al, 2017. Algorithm: Intrinsic Curiosity Module (ICM).
https://arxiv.org/abs/1705.05363
Curiosity-driven Exploration by Self-supervised Prediction
In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful l
arxiv.org
3. Exploration by Random Network Distillation, Burda et al, 2018. Algorithm: RND.
https://arxiv.org/abs/1810.12894
Exploration by Random Network Distillation
We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed
arxiv.org
4. Variational Intrinsic Control, Gregor et al, 2016. Algorithm: VIC.
https://arxiv.org/abs/1611.07507
Variational Intrinsic Control
In this paper we introduce a new unsupervised reinforcement learning method for discovering the set of intrinsic options available to an agent. This set is learned by maximizing the number of different states an agent can reliably reach, as measured by the
arxiv.org
5. Diversity is All You Need: Learning Skills without a Reward Function, Eysenbach et al, 2018. Algorithm: DIAYN.
https://arxiv.org/abs/1802.06070
Diversity is All You Need: Learning Skills without a Reward Function
Intelligent creatures can explore their environments and learn useful skills without supervision. In this paper, we propose DIAYN ('Diversity is All You Need'), a method for learning useful skills without a reward function. Our proposed method learns skill
arxiv.org
Theoretical Side
1.Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al, 2000. Contributions: Established policy gradient theorem and showed convergence of policy gradient algorithm for arbitrary policy classes.
https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation
Policy Gradient Methods for Reinforcement Learning with Function Approximation
Policy Gradient Methods for Reinforcement Learning with Function Approximation Part of: Advances in Neural Information Processing Systems 12 (NIPS 1999) [PDF] [BibTeX] Authors Abstract Abstract Missing Neural Information Processing Systems (NIPS) Papers pu
papers.nips.cc
2. A Natural Policy Gradient, Kakade, 2002. Contributions: Brought natural gradients into RL, later leading to TRPO, ACKTR, and several other methods in deep RL.
https://papers.nips.cc/paper/2073-a-natural-policy-gradient
A Natural Policy Gradient
A Natural Policy Gradient Part of: Advances in Neural Information Processing Systems 14 (NIPS 2001) [PDF] [BibTeX] Authors Abstract Abstract Missing Neural Information Processing Systems (NIPS) Papers published at the Neural Information Processing Systems
papers.nips.cc
3. Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning, Lee et al, 2019: Entropy-regularized RL
https://arxiv.org/abs/1902.00137
Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning
In this paper, we present a new class of Markov decision processes (MDPs), called Tsallis MDPs, with Tsallis entropy maximization, which generalizes existing maximum entropy reinforcement learning (RL). A Tsallis MDP provides a unified framework for the or
arxiv.org
Robotics Domain
1. Learning Contact-Rich Manipulation Skills with Guided Policy Search, Levine et al, 2015. Guded policy search
https://arxiv.org/abs/1501.05611
Learning Contact-Rich Manipulation Skills with Guided Policy Search
Autonomous learning of object manipulation skills can enable robots to acquire rich behavioral repertoires that scale to the variety of objects found in the real world. However, current motion skill learning methods typically restrict the behavior to a com
arxiv.org
2. Trajectory-based Probabilistic Policy Gradient for Learning Locomotion Behaviors, Choi and Kim, 2018. Deep Latent Policy Gradient
https://la.disneyresearch.com/publication/trajectory-based-probabilistic-policy-gradient-for-learning-locomotion-behaviors/
Trajectory-based Probabilistic Policy Gradient for Learning Locomotion Behaviors - Disney Research
In this paper, we propose a trajectory-based reinforcement learning method named deep latent policy gradient (DLPG) for learning locomotion skills.
la.disneyresearch.com
3. DeepMimic: Example-Guided Deep Reinforcement Learningof Physics-Based Character Skills, Peng et al., 2018: Deep mimic
https://arxiv.org/abs/1804.02717
DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills
A longstanding goal in character animation is to combine data-driven specification of behavior with a system that can execute a similar behavior in a physical simulation, thus enabling realistic responses to perturbations and environmental variation. We sh
arxiv.org