References on Optimal Control, Reinforcement Learning and Motion Planning
REPS
Relative Entropy Policy Search, Peters J. et al. (2010).ExpectiMinimax
Optimal strategy in games with chance nodes, Melkó E., Nagy B. (2007).Sparse sampling
A sparse sampling algorithm for near-optimal planning in large Markov decision processes, Kearns M. et al. (2002).MCTS
Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search, Rémi Coulom, SequeL (2006).UCT
Bandit based Monte-Carlo Planning, Kocsis L., Szepesvári C. (2006).OPD
Optimistic Planning for Deterministic Systems, Hren J., Munos R. (2008).OLOP
Open Loop Optimistic Planning, Bubeck S., Munos R. (2010).SOOP
Optimistic Planning for Continuous-Action Deterministic Systems, Buşoniu L. et al. (2011).OPSS
Optimistic planning for sparsely stochastic systems, L. Buşoniu, R. Munos, B. De Schutter, and R. Babuska (2011).HOOT
Sample-Based Planning for Continuous ActionMarkov Decision Processes, Mansley C., Weinstein A., Littman M. (2011).HOLOP
Bandit-Based Planning and Learning inContinuous-Action Markov Decision Processes, Weinstein A., Littman M. (2012).BRUE
Simple Regret Optimization in Online Planning for Markov Decision Processes, Feldman Z. and Domshlak C. (2014).LGP
Logic-Geometric Programming: An Optimization-Based Approach to Combined Task and Motion Planning, Toussaint M. (2015). 🎞️
AlphaGo
Mastering the game of Go with deep neural networks and tree search, Silver D. et al. (2016).AlphaGo Zero
Mastering the game of Go without human knowledge, Silver D. et al. (2017).AlphaZero
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Silver D. et al. (2017).TrailBlazer
Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning, Grill J. B., Valko M., Munos R. (2017).MCTSnets
Learning to search with MCTSnets, Guez A. et al. (2018).ADI
Solving the Rubik's Cube Without Human Knowledge, McAleer S. et al. (2018).OPC/SOPC
Continuous-action planning for discounted infinite-horizon nonlinear optimal control with Lipschitz values, Buşoniu L., Pall E., Munos R. (2018).PI²
A Generalized Path Integral Control Approach to Reinforcement Learning, Theodorou E. et al. (2010).PI²-CMA
Path Integral Policy Improvement with Covariance Matrix Adaptation, Stulp F., Sigaud O. (2010).iLQG
A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems, Todorov E. (2005). :octocat:
iLQG+
Synthesis and stabilization of complex behaviors through online trajectory optimization, Tassa Y. (2012).MPCC
Optimization-based autonomous racing of 1:43 scale RC cars, Liniger A. et al. (2014). 🎞️ | 🎞️
MIQP
Optimal trajectory planning for autonomous driving integrating logical constraints: An MIQP perspective, Qian X., Altché F., Bender P., Stiller C. de La Fortelle A. (2016).Robust DP
Robust Dynamic Programming, Iyengar G. (2005).Tube-MPPI
Robust Sampling Based Model Predictive Control with Sparse Objective Information, Williams G. et al. (2018). 🎞️
RA-QMDP
Risk-averse Behavior Planning for Autonomous Driving under Uncertainty, Naghshvar M. et al. (2018).StoROO
X-Armed Bandits: Optimizing Quantiles and Other Risks, Torossian L., Garivier A., Picheny V. (2019).ICS
Will the Driver Seat Ever Be Empty?, Fraichard T. (2014).SafeOPT
Safe Controller Optimization for Quadrotors with Gaussian Processes, Berkenkamp F., Schoellig A., Krause A. (2015). 🎞️ :octocat:
SafeMDP
Safe Exploration in Finite Markov Decision Processes with Gaussian Processes, Turchetta M., Berkenkamp F., Krause A. (2016). :octocat:
RSS
On a Formal Model of Safe and Scalable Self-driving Cars, Shalev-Shwartz S. et al. (2017).CPO
Constrained Policy Optimization, Achiam J., Held D., Tamar A., Abbeel P. (2017). :octocat:
RCPO
Reward Constrained Policy Optimization, Tessler C., Mankowitz D., Mannor S. (2018).BFTQ
A Fitted-Q Algorithm for Budgeted MDPs, Carrara N. et al. (2018).SafeMPC
Learning-based Model Predictive Control for Safe Exploration, Koller T, Berkenkamp F., Turchetta M. Krause A. (2018).CCE
Constrained Cross-Entropy Method for Safe Reinforcement Learning, Wen M., Topcu U. (2018). :octocat:
LTL-RL
Reinforcement Learning with Probabilistic Guarantees for Autonomous Driving, Bouton M. et al. (2019).Envelope MOQ-Learning
A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation, Yang R. et al (2019).HJI-reachability
Safe learning for control: Combining disturbance estimation, reachability analysis and reinforcement learning with systematic exploration, Heidenreich C. (2017).MPC-HJI
On Infusing Reachability-Based Safety Assurance within Probabilistic Planning Frameworks for Human-Robot Vehicle Interactions, Leung K. et al. (2018).Lyapunov-Net
Safe Interactive Model-Based Learning, Gallieri M. et al. (2019).ATACOM
Robot Reinforcement Learning on the Constraint Manifold, Liu P. et al (2021).TS
On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples, Thompson W. (1933).UCB1 / UCB2
Finite-time Analysis of the Multiarmed Bandit Problem, Auer P., Cesa-Bianchi N., Fischer P. (2002).Empirical Bernstein / UCB-V
Exploration-exploitation tradeoff using variance estimates in multi-armed bandits, Audibert J-Y, Munos R., Szepesvari C. (2009).kl-UCB
The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond, Garivier A., Cappé O. (2011).KL-UCB
Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation, Cappé O. et al. (2013).IDS
Information Directed Sampling and Bandits with Heteroscedastic Noise Kirschner J., Krause A. (2018).LinUCB
A Contextual-Bandit Approach to Personalized News Article Recommendation, Li L. et al. (2010).OFUL
Improved Algorithms for Linear Stochastic Bandits, Abbasi-yadkori Y., Pal D., Szepesvári C. (2011).Successive Elimination
Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems, Even-Dar E. et al. (2006).LUCB
PAC Subset Selection in Stochastic Multi-armed Bandits, Kalyanakrishnan S. et al. (2012).UGapE
Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence, Gabillon V., Ghavamzadeh M., Lazaric A. (2012).Sequential Halving
Almost Optimal Exploration in Multi-Armed Bandits, Karnin Z. et al (2013).M-LUCB / M-Racing
Maximin Action Identification: A New Bandit Framework for Games, Garivier A., Kaufmann E., Koolen W. (2016).Track-and-Stop
Optimal Best Arm Identification with Fixed Confidence, Garivier A., Kaufmann E. (2016).LUCB-micro
Structured Best Arm Identification with Fixed Confidence, Huang R. et al. (2017).GP-UCB
Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design, Srinivas N., Krause A., Kakade S., Seeger M. (2009).HOO
X–Armed Bandits, Bubeck S., Munos R., Stoltz G., Szepesvari C. (2009).DOO/SOO
Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness, Munos R. (2011).StoOO
From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning, Munos R. (2014).StoSOO
Stochastic Simultaneous Optimistic Optimization, Valko M., Carpentier A., Munos R. (2013).POO
Black-box optimization of noisy functions with unknown smoothness, Grill J-B., Valko M., Munos R. (2015).EI-GP
Bayesian Optimization in AlphaGo, Chen Y. et al. (2018)UCRL2
Near-optimal Regret Bounds for Reinforcement Learning, Jaksch T. (2010).
PSRL
Why is Posterior Sampling Better than Optimism for Reinforcement Learning?, Osband I., Van Roy B. (2016).
UCBVI
Minimax Regret Bounds for Reinforcement Learning, Azar M., Osband I., Munos R. (2017).
Q-Learning-UCB
Is Q-Learning Provably Efficient?, Jin C., Allen-Zhu Z., Bubeck S., Jordan M. (2018).
LSVI-UCB
Provably Efficient Reinforcement Learning with Linear Function Approximation, Jin C., Yang Z., Wang Z., Jordan M. (2019).
QVI
On the Sample Complexity of Reinforcement Learning with a Generative Model, Azar M., Munos R., Kappen B. (2012).OFU-LQ
Regret Bounds for the Adaptive Control of Linear Quadratic Systems, Abbasi-Yadkori Y., Szepesvari C. (2011).TS-LQ
Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems, Abeille M., Lazaric A. (2018).Coarse-Id
On the Sample Complexity of the Linear Quadratic Regulator, Dean S., Mania H., Matni N., Recht B., Tu S. (2017).NFQ
Neural fitted Q iteration - First experiences with a data efficient neural Reinforcement Learning method, Riedmiller M. (2005).DQN
Playing Atari with Deep Reinforcement Learning, Mnih V. et al. (2013). 🎞️
DDQN
Deep Reinforcement Learning with Double Q-learning, van Hasselt H., Silver D. et al. (2015).DDDQN
Dueling Network Architectures for Deep Reinforcement Learning, Wang Z. et al. (2015). 🎞️
PDDDQN
Prioritized Experience Replay, Schaul T. et al. (2015).NAF
Continuous Deep Q-Learning with Model-based Acceleration, Gu S. et al. (2016).Rainbow
Rainbow: Combining Improvements in Deep Reinforcement Learning, Hessel M. et al. (2017).Ape-X DQfD
Observe and Look Further: Achieving Consistent Performance on Atari, Pohlen T. et al. (2018). 🎞️
REINFORCE
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning, Williams R. (1992).Natural Gradient
A Natural Policy Gradient, Kakade S. (2002).TRPO
Trust Region Policy Optimization, Schulman J. et al. (2015). 🎞️
PPO
Proximal Policy Optimization Algorithms, Schulman J. et al. (2017). 🎞️
DPPO
Emergence of Locomotion Behaviours in Rich Environments, Heess N. et al. (2017). 🎞️
AC
Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton R. et al. (1999).NAC
Natural Actor-Critic, Peters J. et al. (2005).DPG
Deterministic Policy Gradient Algorithms, Silver D. et al. (2014).DDPG
Continuous Control With Deep Reinforcement Learning, Lillicrap T. et al. (2015). 🎞️ 1 | 2 | 3 | 4
MACE
Terrain-Adaptive Locomotion Skills Using Deep Reinforcement Learning, Peng X., Berseth G., van de Panne M. (2016). 🎞️ | 🎞️
A3C
Asynchronous Methods for Deep Reinforcement Learning, Mnih V. et al 2016. 🎞️ 1 | 2 | 3
SAC
Soft Actor-Critic : Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja T. et al. (2018). 🎞️
MPO
Maximum a Posteriori Policy Optimisation, Abdolmaleki A. et al (2018).CEM
Learning Tetris Using the Noisy Cross-Entropy Method, Szita I., Lörincz A. (2006). 🎞️
CMAES
Completely Derandomized Self-Adaptation in Evolution Strategies, Hansen N., Ostermeier A. (2001).NEAT
Evolving Neural Networks through Augmenting Topologies, Stanley K. (2002). 🎞️
iCEM
Sample-efficient Cross-Entropy Method for Real-time Planning, Pinneri C. et al. (2020).Dyna
Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming, Sutton R. (1990).PILCO
PILCO: A Model-Based and Data-Efficient Approach to Policy Search, Deisenroth M., Rasmussen C. (2011). (talk)DBN
Probabilistic MDP-behavior planning for cars, Brechtel S. et al. (2011).GPS
End-to-End Training of Deep Visuomotor Policies, Levine S. et al. (2015). 🎞️
DeepMPC
DeepMPC: Learning Deep Latent Features for Model Predictive Control, Lenz I. et al. (2015). 🎞️
SVG
Learning Continuous Control Policies by Stochastic Value Gradients, Heess N. et al. (2015). 🎞️
FARNN
Nonlinear Systems Identification Using Deep Dynamic Neural Networks, Ogunmolu O. et al. (2016). :octocat:
BPTT
Long-term Planning by Short-term Prediction, Shalev-Shwartz S. et al. (2016). 🎞️ 1 | 2
VIN
Value Iteration Networks, Tamar A. et al (2016). 🎞️
VPN
Value Prediction Network, Oh J. et al. (2017).DistGBP
Model-Based Planning with Discrete and Continuous Actions, Henaff M. et al. (2017). 🎞️ 1 | 2
Predictron
The Predictron: End-To-End Learning and Planning, Silver D. et al. (2017). 🎞️
MPPI
Information Theoretic MPC for Model-Based Reinforcement Learning, Williams G. et al. (2017). :octocat: 🎞️
PlaNet
Learning Latent Dynamics for Planning from Pixels, Hafner et al. (2018). 🎞️
NeuralLander
Neural Lander: Stable Drone Landing Control using Learned Dynamics, Shi G. et al. (2018). 🎞️
DBN+POMCP
Towards Human-Like Prediction and Decision-Making for Automated Vehicles in Highway Scenarios , Sierra Gonzalez D. (2019).MuZero
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, Schrittwiese J. et al. (2019). :octocat:
BADGR
BADGR: An Autonomous Self-Supervised Learning-Based Navigation System, Kahn G., Abbeel P., Levine S. (2020). 🎞️ :octocat:
H-UCRL
Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning, Curi S., Berkenkamp F., Krause A. (2020). :octocat:
Pseudo-count
Unifying Count-Based Exploration and Intrinsic Motivation, Bellemare M. et al (2016). 🎞️
HER
Hindsight Experience Replay, Andrychowicz M. et al. (2017). 🎞️
VHER
Visual Hindsight Experience Replay, Sahni H. et al. (2019).RND
Exploration by Random Network Distillation, Burda Y. et al. (OpenAI) (2018). 🎞️
Go-Explore
Go-Explore: a New Approach for Hard-Exploration Problems, Ecoffet A. et al. (Uber) (2018). 🎞️
C51-IDS
Information-Directed Exploration for Deep Reinforcement Learning, Nikolov N., Kirschner J., Berkenkamp F., Krause A. (2019). :octocat:
Plan2Explore
Planning to Explore via Self-Supervised World Models, Sekar R. et al. (2020). 🎞️ :octocat:
RIDE
RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments, Raileanu R., Rocktäschel T., (2020). :octocat:
OC
The Option-Critic Architecture, Bacon P-L., Harb J., Precup D. (2016).FuNs
FeUdal Networks for Hierarchical Reinforcement Learning, Vezhnevets A. et al. (2017).DeepLoco
DeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning , Peng X. et al. (2017). 🎞️ | 🎞️
DAC
DAC: The Double Actor-Critic Architecture for Learning Options, Zhang S., Whiteson S. (2019).H-REIL
Reinforcement Learning based Control of Imitative Policies for Near-Accident Driving, Cao Z. et al. (2020). 🎞️ 1, 2
PBVI
Point-based Value Iteration: An anytime algorithm for POMDPs, Pineau J. et al. (2003).cPBVI
Point-Based Value Iteration for Continuous POMDPs, Porta J. et al. (2006).POMCP
Monte-Carlo Planning in Large POMDPs, Silver D., Veness J. (2010).MOMDP
Intention-Aware Motion Planning, Bandyopadhyay T. et al. (2013).DNC
Hybrid computing using a neural network with dynamic external memory, Graves A. et al (2016). 🎞️
social perception
Behavior Planning of Autonomous Cars with Social Perception, Sun L. et al (2019).IT&E
Robots that can adapt like animals, Cully A., Clune J., Tarapore D., Mouret J-B. (2014). 🎞️
MAML
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Finn C., Abbeel P., Levine S. (2017). 🎞️
ME-TRPO
Model-Ensemble Trust-Region Policy Optimization, Kurutach T. et al. (2018). 🎞️
GrBAL / ReBAL
Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning, Nagabandi A. et al. (2018). 🎞️
IT&E
Learning and adapting quadruped gaits with the "Intelligent Trial & Error" algorithm, Dalin E., Desreumaux P., Mouret J-B. (2019). 🎞️
FAMLE
Fast Online Adaptation in Robotics through Meta-Learning Embeddings of Simulated Priors, Kaushik R., Anne T., Mouret J-B. (2020). 🎞️
PACOH
PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees, Rothfuss J., Fortuin V., Josifoski M., Krause A. (2021).SimGAN
SimGAN: Hybrid Simulator Identification for Domain Adaptation via Adversarial Reinforcement Learning, Jiang Y. et al. (2021). 🎞️ :octocat:
Minimax-Q
Markov games as a framework for multi-agent reinforcement learning, M. Littman (1994).MILP
Time-optimal coordination of mobile robots along specified paths, Altché F. et al. (2016). 🎞️
MIQP
An Algorithm for Supervised Driving of Cooperative Semi-Autonomous Vehicles, Altché F. et al. (2017). 🎞️
SA-CADRL
Socially Aware Motion Planning with Deep Reinforcement Learning, Chen Y. et al. (2017). 🎞️
MAgent
MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence, Zheng L. et al. (2017). 🎞️
MPPO
Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning, Long P. et al. (2017). 🎞️
COMA
Counterfactual Multi-Agent Policy Gradients, Foerster J. et al. (2017).MADDPG
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments, Lowe R. et al (2017). :octocat:
FTW
Human-level performance in first-person multiplayer games with population-based deep reinforcement learning, Jaderberg M. et al. (2018). 🎞️
MAPPO
The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games, Yu C. et al. (2021). |:octocat:](https://github.com/marlbenchmark/on-policy)DeepDriving
DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving, Chen C. et al. (2015). 🎞️
MERLIN
Unsupervised Predictive Memory in a Goal-Directed Agent, Wayne G. et al. (2018). 🎞️ 1 | 2 | 3 | 4 | 5 | 6
FERM
A Framework for Efficient Robotic Manipulation, Zhan A., Zhao R. et al. (2021). :octocat:
S4RL
S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning, Sinha S. et al (2021).SPI-BB
Safe Policy Improvement with Baseline Bootstrapping, Laroche R. et al (2019).AWAC
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets, Nair A. et al (2020).CQL
Conservative Q-Learning for Offline Reinforcement Learning, Kumar A. et al. (2020).DAgger
A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning, Ross S., Gordon G., Bagnell J. A. (2011).QMDP-RCNN
Reinforcement Learning via Recurrent Convolutional Neural Networks, Shankar T. et al. (2016). (talk)DQfD
Learning from Demonstrations for Real World Reinforcement Learning, Hester T. et al. (2017). 🎞️
GAIL
Generative Adversarial Imitation Learning, Ho J., Ermon S. (2016).Branched
End-to-end Driving via Conditional Imitation Learning, Codevilla F. et al. (2017). 🎞️ | talk
UPN
Universal Planning Networks, Srinivas A. et al. (2018). 🎞️
DeepMimic
DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills, Peng X. B. et al. (2018). 🎞️
R2P2
Deep Imitative Models for Flexible Inference, Planning, and Control, Rhinehart N. et al. (2018). 🎞️
PS-GAIL
Multi-Agent Imitation Learning for Driving Simulation, Bhattacharyya R. et al. (2018). 🎞️ :octocat:
Projection
Apprenticeship learning via inverse reinforcement learning, Abbeel P., Ng A. (2004).MMP
Maximum margin planning, Ratliff N. et al. (2006).BIRL
Bayesian inverse reinforcement learning, Ramachandran D., Amir E. (2007).MEIRL
Maximum Entropy Inverse Reinforcement Learning, Ziebart B. et al. (2008).LEARCH
Learning to search: Functional gradient techniques for imitation learning, Ratliff N., Siver D. Bagnell A. (2009).CIOC
Continuous Inverse Optimal Control with Locally Optimal Examples, Levine S., Koltun V. (2012). 🎞️
MEDIRL
Maximum Entropy Deep Inverse Reinforcement Learning, Wulfmeier M. (2015).GCL
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization, Finn C. et al. (2016). 🎞️
RIRL
Repeated Inverse Reinforcement Learning, Amin K. et al. (2017).Dijkstra
A Note on Two Problems in Connexion with Graphs, Dijkstra E. W. (1959).A*
A Formal Basis for the Heuristic Determination of Minimum Cost Paths , Hart P. et al. (1968).RRT*
Sampling-based Algorithms for Optimal Motion Planning, Karaman S., Frazzoli E. (2011). 🎞️
LQG-MP
LQG-MP: Optimized Path Planning for Robots with Motion Uncertainty and Imperfect State Information, van den Berg J. et al. (2010).PRM-RL
PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning, Faust A. et al. (2017).PF
Real-time obstacle avoidance for manipulators and mobile robots, Khatib O. (1986).VFH
The Vector Field Histogram - Fast Obstacle Avoidance For Mobile Robots, Borenstein J. (1991).VFH+
VFH+: Reliable Obstacle Avoidance for Fast Mobile Robots, Ulrich I., Borenstein J. (1998).Velocity Obstacles
Motion planning in dynamic environments using velocity obstacles, Fiorini P., Shillert Z. (1998).