A curated list of awesome exploration RL resources (continually updated)
Updated on 2024.03.07
Here is a collection of research papers for Exploration methods in Reinforcement Learning (ERL). The repository will be continuously updated to track the frontier of ERL. Welcome to follow and star!
The balance of exploration and exploitation is one of the most central problems in reinforcement learning. In order to give readers an intuitive feeling for exploration, we provide a visualization of a typical hard exploration environment in MiniGrid below. In this task, a series of actions to achieve the goal often require dozens or even hundreds of steps, in which the agent needs to fully explore different state-action spaces in order to learn the skills required to achieve the goal.
A typical hard-exploration environment: MiniGrid-ObstructedMaze-Full-v0.
In general, we can divide reinforcement learning process into two phases: collect phase and train phase. In the collect phase, the agent chooses actions based on the current policy and then interacts with the environment to collect useful experience. In the train phase, the agent uses the collected experience to update the current policy to obtain a better performing policy.
According to the phase the exploration component is explicitly applied, we simply divide the methods in Exploration RL
into two main categories: Augmented Collecting Strategy
, Augmented Training Strategy
:
Augmented Collecting Strategy
represents a variety of different exploration strategies commonly used in the collect phase, which we further divide into four categories:
Action Selection Perturbation
Action Selection Guidance
State Selection Guidance
Parameter Space Perturbation
Augmented Training Strategy
represents a variety of different exploration strategies commonly used in the train phase, which we further divide into seven categories:
Count Based
Prediction Based
Information Theory Based
Entropy Augmented
Bayesian Posterior Based
Goal Based
(Expert) Demo Data
Note that there may be overlap between these categories, and an algorithm may belong to several of them. For other detailed survey on exploration methods in RL, you can refer to Tianpei Yang et al and Susan Amin et al.
Here are the links to the papers that appeared in the taxonomy:
[1] Go-Explore: Adrien Ecoffet et al, 2021
[2] NoisyNet, Meire Fortunato et al, 2018
[3] DQN-PixelCNN: Marc G. Bellemare et al, 2016
[4] #Exploration Haoran Tang et al, 2017
[5] EX2: Justin Fu et al, 2017
[6] ICM: Deepak Pathak et al, 2018
[7] RND: Yuri Burda et al, 2018
[8] NGU: Adrià Puigdomènech Badia et al, 2020
[9] Agent57: Adrià Puigdomènech Badia et al, 2020
[10] VIME: Rein Houthooft et al, 2016
[11] EMI: Wang et al, 2019
[12] DIYAN: Benjamin Eysenbach et al, 2019
[13] SAC: Tuomas Haarnoja et al, 2018
[14] BootstrappedDQN: Ian Osband et al, 2016
[15] PSRL: Ian Osband et al, 2013
[16] HER Marcin Andrychowicz et al, 2017
[17] DQfD: Todd Hester et al, 2018
[18] R2D3: Caglar Gulcehre et al, 2019
format:
- [title](paper link) (presentation type, openreview score [if the score is public])
- author1, author2, author3, ...
- Key: key problems and insights
- ExpEnv: experiment environments
A Theoretical Explanation of Deep RL Performance in Stochastic Environments
DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization
METRA: Scalable Unsupervised RL with Metric-Aware Abstraction
Text2Reward: Reward Shaping with Language Models for Reinforcement Learning
Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning
Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning
Simple Hierarchical Planning with Diffusion
Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks
PAE: Reinforcement Learning from External Knowledge for Efficient Exploration
In-context Exploration-Exploitation for Reinforcement Learning
Learning to Act without Actions
On the Importance of Exploration for Generalization in Reinforcement Learning
Monte Carlo Tree Search with Boltzmann Exploration
Breadcrumbs to the Goal: Supervised Goal Selection from Human-in-the-Loop Feedback
MIMEx: Intrinsic Rewards from Masked Input Modeling
Accelerating Exploration with Unlabeled Prior Data
On the Convergence and Sample Complexity Analysis of Deep Q-Networks with ε-Greedy Exploration
Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion
CQM: Curriculum Reinforcement Learning with a Quantized World Model
Safe Exploration in Reinforcement Learning: A Generalized Formulation and Algorithms
Successor-Predecessor Intrinsic Exploration
Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration
ELDEN: Exploration via Local Dependencies
A Study of Global and Episodic Bonuses for Exploration in Contextual MDPs
Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments
Representations and Exploration for Deep Reinforcement Learning using Singular Value Decomposition
Reparameterized Policy Learning for Multimodal Trajectory Optimization
Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning
Fast Rates for Maximum Entropy Exploration
Guiding Pretraining in Reinforcement Learning with Large Language Models
Go Beyond Imagination: Maximizing Episodic Reachability with World Models
Efficient Online Reinforcement Learning with Offline Data
Anti-Exploration by Random Network Distillation
The Impact of Exploration on Convergence and Performance of Multi-Agent Q-Learning Dynamics
An Adaptive Entropy-Regularization Framework for Multi-Agent Reinforcement Learning
Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning
Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection (Oral: 10, 8, 8)
The Role of Coverage in Online Reinforcement Learning (Oral: 8, 8, 5)
Near-optimal Policy Identification in Active Reinforcement Learning (Oral: 8,8,8)
Planning Goals for Exploration (Spotlight: 8, 8, 8, 8, 6)
Pink Noise Is All You Need: Colored Noise Exploration in Deep Reinforcement Learning (Spotlight: 8, 8, 8)
Learning About Progress From Experts (Spotlight: 8, 8, 6)
DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems (Spotlight: 10, 8, 8, 8)
Does Zero-Shot Reinforcement Learning Exist? (Spotlight: 10, 8, 8,3)
Human-level Atari 200x faster (Poster: 8, 8, 3)
Learning Achievement Structure for Structured Exploration in Domains with Sparse Reward (Poster: 8, 8, 5, 5)
Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-Free RL (Poster: 8, 8, 6, 6)
Latent State Marginalization as a Low-cost Approach to Improving Exploration (Poster: 6, 6, 6)
Revisiting Curiosity for Exploration in Procedurally Generated Environments (Poster: 8, 8, 5, 3, 3)
MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations (Poster: 8, 6, 6, 6)
Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective (Poster: 8, 6, 6, 6, 6)
EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model (Poster: 6, 6, 6, 6)
Guarded Policy Optimization with Imperfect Online Demonstrations (Oral: 8, 8, 6, 5)
Redeeming Intrinsic Rewards via Constrained Optimization (Poster: 8, 7, 7)
You Only Live Once: Single-Life Reinforcement Learning via Learned Reward Shaping (Poster: 6, 6, 5, 5)
Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation (Poster: 8, 7, 6)
Model-based Lifelong Reinforcement Learning with Bayesian Exploration (Poster: 7, 6, 6)
On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL (Poster: 7, 6, 5, 5)
DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning (Poster: 8, 7, 4)
Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning
Active Exploration for Inverse Reinforcement Learning (Poster: 7, 7, 7, 7)
Exploration-Guided Reward Shaping for Reinforcement Learning under Sparse Rewards (Poster: 6, 6, 4)
Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations (Poster: 6, 6, 5, 5)
Incentivizing Combinatorial Bandit Exploration (Poster: 7, 6, 5, 3)
From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses (Oral)
The Importance of Non-Markovianity in Maximum State Entropy Exploration (Oral)
Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning (Spotlight)
Thompson Sampling for (Combinatorial) Pure Exploration (Spotlight)
Near-Optimal Algorithms for Autonomous Exploration and Multi-Goal Stochastic Shortest Path (Spotlight)
Safe Exploration for Efficient Policy Evaluation and Comparison (Spotlight)
The Information Geometry of Unsupervised Reinforcement Learning (Oral: 8, 8, 8)
When should agents explore? (Spotlight: 8, 8, 6, 6)
Learning more skills through optimistic exploration (Spotlight: 8, 8, 8, 6)
Learning Long-Term Reward Redistribution via Randomized Return Decomposition (Spotlight: 8, 8, 8, 5)
Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration (Spotlight: 8, 8, 8, 6, 6)
Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning (Spotlight: 8, 8, 8, 6)
Learning Altruistic Behaviours in Reinforcement Learning without External Rewards (Spotlight: 8, 8, 6, 6)
Anti-Concentrated Confidence Bonuses for Scalable Exploration (Poster: 8, 6, 5)
Lipschitz-constrained Unsupervised Skill Discovery (Poster: 8, 6, 6, 6)
LIGS: Learnable Intrinsic-Reward Generation Selection for Multi-Agent Learning (Poster: 8, 6, 5, 5)
Multi-Stage Episodic Control for Strategic Exploration in Text Games (Spotlight: 8, 8, 6, 6)
On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning (Poster: 8, 8, 5, 5)
Interesting Object, Curious Agent: Learning Task-Agnostic Exploration (Oral: 9, 8, 8, 8)
Tactical Optimism and Pessimism for Deep Reinforcement Learning (Poster: 9, 7, 6, 6)
Which Mutual-Information Representation Learning Objectives are Sufficient for Control? (Poster: 7, 6, 6, 5)
On the Theory of Reinforcement Learning with Once-per-Episode Feedback (Poster: 6, 5, 5, 4)
MADE: Exploration via Maximizing Deviation from Explored Regions (Poster: 7, 7, 6, 5)
Adversarial Intrinsic Motivation for Reinforcement Learning (Poster: 7, 7, 6)
Information Directed Reward Learning for Reinforcement Learning (Poster: 9, 8, 7, 6)
Dynamic Bottleneck for Robust Self-Supervised Exploration (Poster: 8, 6, 6, 6)
Hierarchical Skills for Efficient Exploration (Poster: 7, 6, 6, 6)
Exploration-Exploitation in Multi-Agent Competition: Convergence with Bounded Rationality (spotlight: 8, 6, 6)
NovelD: A Simple yet Effective Exploration Criterion (Poster: 7, 6, 6, 6)
Episodic Multi-agent Reinforcement Learning with Curiosity-driven Exploration (Poster: 7, 6, 6, 5)
Learning Diverse Policies in MOBA Games via Macro-Goals (Poster: 7, 6, 5, 5)
CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery (not accepted now: 8, 8, 6, 3)
A Contextual-Bandit Approach to Personalized News Article Recommendation WWW 2010
(More) Efficient Reinforcement Learning via Posterior Sampling NeurIPS 2013
An empirical evaluation of thompson sampling NeurIPS 2011
A Tutorial on Thompson Sampling arxiv 2017
Unifying Count-Based Exploration and Intrinsic Motivation NeurIPS 2016
Deep Exploration via Bootstrapped DQN NeurIPS 2016
VIME: Variational information maximizing exploration NeurIPS 2016
#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning NeurIPS 2017
EX2: Exploration with Exemplar Models for Deep Reinforcement Learning NeurIPS 2017
Hindsight Experience Replay NeurIPS 2017
Curiosity-driven exploration by self-supervised prediction ICML 2017
Deep Q-learning from Demonstrations AAAI 2018
Noisy Networks For Exploration ICLR 2018
Exploration by random network distillation ICLR 2018
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor ICML 2018
Large-Scale Study of Curiosity-Driven Learning ICLR 2019
Diversity is all you need: Learning skills without a reward function ICLR 2019
Episodic Curiosity through Reachability ICLR 2019
EMI: Exploration with Mutual Information ICML 2019
Making Efficient Use of Demonstrations to Solve Hard Exploration Problems arxiv 2019
Optimistic Exploration even with a Pessimistic Initialisation ICLR 2020
RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments ICLR 2020
Never give up: Learning directed exploration strategies ICLR 2020
Agent57: Outperforming the atari human benchmark ICML 2020
Neural Contextual Bandits with UCB-based Exploration ICML 2020
Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments ICLR 2021
First return then explore Nature 2021
Our purpose is to provide a starting paper guide to who are interested in exploration methods in RL. If you are interested in contributing, please refer to HERE for instructions in contribution.
Awesome Exploration RL is released under the Apache 2.0 license.