Enginius/Machine Learning

Curation of RL papers

해리s 2019. 12. 6. 04:00

Few selected papers I've enjoyed reading. More comprehensive list can be found in [https://spinningup.openai.com/en/latest/spinningup/keypapers.html].

 

Model-Free RL

1. Playing Atari with Deep Reinforcement Learning, Mnih et al, 2013. Algorithm: DQN

https://arxiv.org/abs/1312.5602

 

Playing Atari with Deep Reinforcement Learning

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw

arxiv.org

2. Deep Reinforcement Learning with Double Q-learning, Hasselt et al 2015. Algorithm: Double DQN

https://arxiv.org/abs/1509.06461

 

Deep Reinforcement Learning with Double Q-learning

The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common, whether they harm performance, and whether they can generally be prevented.

arxiv.org

3. Prioritized Experience Replay, Schaul et al, 2015. Algorithm: Prioritized Experience Replay (PER)

https://arxiv.org/abs/1511.05952

 

Prioritized Experience Replay

Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequ

arxiv.org

 

Policy Gradient

1. Trust Region Policy Optimization, Schulman et al, 2015. Algorithm: TRPO

https://arxiv.org/abs/1502.05477

 

Trust Region Policy Optimization

We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO).

arxiv.org

2. High-Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al, 2015. Algorithm: GAE.

https://arxiv.org/abs/1506.02438

 

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are th

arxiv.org

3. Proximal Policy Optimization Algorithms, Schulman et al, 2017. Algorithm: PPO-Clip, PPO-Penalty.

https://arxiv.org/abs/1707.06347

 

Proximal Policy Optimization Algorithms

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standar

arxiv.org

4. Soft Actor-Critic: Off Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al, 2018. Algorithm: SAC.

https://arxiv.org/abs/1801.01290

 

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergenc

arxiv.org

 

Exploration Methods

1. Unifying Count-Based Exploration and Intrinsic Motivation, Bellemare et al, 2016. Algorithm: CTS-based Pseudocounts.

https://arxiv.org/abs/1606.01868

 

Unifying Count-Based Exploration and Intrinsic Motivation

We consider an agent's uncertainty about its environment and the problem of generalizing this uncertainty across observations. Specifically, we focus on the problem of exploration in non-tabular reinforcement learning. Drawing inspiration from the intrinsi

arxiv.org

2. Curiosity-driven Exploration by Self supervised Prediction, Pathak et al, 2017. Algorithm: Intrinsic Curiosity Module (ICM).

https://arxiv.org/abs/1705.05363

 

Curiosity-driven Exploration by Self-supervised Prediction

In many real-world scenarios, rewards extrinsic to the agent are extremely sparse, or absent altogether. In such cases, curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills that might be useful l

arxiv.org

3. Exploration by Random Network Distillation, Burda et al, 2018. Algorithm: RND.

https://arxiv.org/abs/1810.12894

 

Exploration by Random Network Distillation

We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed

arxiv.org

4. Variational Intrinsic Control, Gregor et al, 2016. Algorithm: VIC.

https://arxiv.org/abs/1611.07507

 

Variational Intrinsic Control

In this paper we introduce a new unsupervised reinforcement learning method for discovering the set of intrinsic options available to an agent. This set is learned by maximizing the number of different states an agent can reliably reach, as measured by the

arxiv.org

5. Diversity is All You Need: Learning Skills without a Reward Function, Eysenbach et al, 2018. Algorithm: DIAYN.

https://arxiv.org/abs/1802.06070

 

Diversity is All You Need: Learning Skills without a Reward Function

Intelligent creatures can explore their environments and learn useful skills without supervision. In this paper, we propose DIAYN ('Diversity is All You Need'), a method for learning useful skills without a reward function. Our proposed method learns skill

arxiv.org

 

Theoretical Side

1.Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al, 2000. Contributions: Established policy gradient theorem and showed convergence of policy gradient algorithm for arbitrary policy classes.

https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation

 

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Policy Gradient Methods for Reinforcement Learning with Function Approximation Part of: Advances in Neural Information Processing Systems 12 (NIPS 1999) [PDF] [BibTeX] Authors Abstract Abstract Missing Neural Information Processing Systems (NIPS) Papers pu

papers.nips.cc

2. A Natural Policy Gradient, Kakade, 2002. Contributions: Brought natural gradients into RL, later leading to TRPO, ACKTR, and several other methods in deep RL.

https://papers.nips.cc/paper/2073-a-natural-policy-gradient

 

A Natural Policy Gradient

A Natural Policy Gradient Part of: Advances in Neural Information Processing Systems 14 (NIPS 2001) [PDF] [BibTeX] Authors Abstract Abstract Missing Neural Information Processing Systems (NIPS) Papers published at the Neural Information Processing Systems

papers.nips.cc

3. Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning, Lee et al, 2019: Entropy-regularized RL

https://arxiv.org/abs/1902.00137

 

Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

In this paper, we present a new class of Markov decision processes (MDPs), called Tsallis MDPs, with Tsallis entropy maximization, which generalizes existing maximum entropy reinforcement learning (RL). A Tsallis MDP provides a unified framework for the or

arxiv.org

 

Robotics Domain

1. Learning Contact-Rich Manipulation Skills with Guided Policy Search, Levine et al, 2015. Guded policy search

https://arxiv.org/abs/1501.05611

 

Learning Contact-Rich Manipulation Skills with Guided Policy Search

Autonomous learning of object manipulation skills can enable robots to acquire rich behavioral repertoires that scale to the variety of objects found in the real world. However, current motion skill learning methods typically restrict the behavior to a com

arxiv.org

2. Trajectory-based Probabilistic Policy Gradient for Learning Locomotion Behaviors, Choi and Kim, 2018. Deep Latent Policy Gradient

https://la.disneyresearch.com/publication/trajectory-based-probabilistic-policy-gradient-for-learning-locomotion-behaviors/

 

Trajectory-based Probabilistic Policy Gradient for Learning Locomotion Behaviors - Disney Research

In this paper, we propose a trajectory-based reinforcement learning method named deep latent policy gradient (DLPG) for learning locomotion skills.

la.disneyresearch.com

3. DeepMimic: Example-Guided Deep Reinforcement Learningof Physics-Based Character Skills, Peng et al., 2018: Deep mimic

https://arxiv.org/abs/1804.02717

 

DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills

A longstanding goal in character animation is to combine data-driven specification of behavior with a system that can execute a similar behavior in a physical simulation, thus enabling realistic responses to perturbations and environmental variation. We sh

arxiv.org