Reinforcement-Learning-for-Decision-Making-in-self-driving-cars
An practical implementation of RL algorithms for Decision-Making in Autonomous Driving
Related video: Introduction to the concepts of Markov Decision Process (MDP) and Reinforcement Learning (RL) with a focus on the applications for Autonomous Driving |
My repository is structured as follow. Each time a main
, an agent
and an environment
are required
src
contains
environments
main_simple_road.py
is the main
for the simple_road
environment
brains
simple_brains.py
contains the definition of simple agents, in particular:
q_table
is a collections.defaultdict
q_table
is a pandas.DataFrame
learn()
method differs:
q_table
is a numpy.array
simple_DQN_tensorflow.py
contains the definition of DQN agent with tensorflow
simple_DQN_pytorch.py
contains the definition of DQN agent with tensorflow
import
statements if you do not want to use it
A very basic scenario, but useful to apply and understand RL concepts.
The envrionment is designed similarly to openai gym env
Environment definition:
In addition, he must respect different constrains
Let's make the agent learn how to do that!
To visualize the progress of the driver at each episode, a tkinter-base animation can be used
("G")
Note1:
main_simple_road.py
Note2: to disable the animation:
Tkinter
window, set in flag_tkinter = False
in main_simple_road.py
Tkinter
module, set the flag to False
and uncomment the first line in the definition of the class Road
in environments
Discrete State space:
[0, 3]
Discrete Action space:
no_change
- Maintain current speedspeed_up
- Acceleratespeed_up_up
- Hard Accelerateslow_down
- Decelerateslow_down_down
- Hard DecelerateReward Function: classified into 4 groups (See environments
road_env.py)
Transition function:
The transitions are based on the next velocity
- e.g. if the Agent goes for velocity = 3
, then its next position will be 3
cells further.
Termination Condition:
position = 18
+17
over 100
consecutive episodes.
18
Finally, hard constrains are used to eliminate certain undesirable behaviours
Create a conda environment:
conda create --name rl-for-ad python=3.6
Install the packages:
pip install -r requirements.txt
Using python 3 with following modules:
All the RL algorithms presented in these figures are implemented. Source
DP-, MC-, and TD-backups are implemented |
Model-based and model-free Control methods are implemented |
In src
, main_simple_road.py
is the central file you want to use.
Choose
the control Agent you want to use (uncomment the others)
Q-learning
(= max-SARSA
)SARSA
SARSA-lambda
expected-SARSA
Monte-Carlo
Dynamic Programming
the task, playing with flags
training
testing
hyper-parameter tuning
if you want the environment window to be display
flag_tkinter = False
for the simple environmentI tried different formats to store the q-values table:
Each one has its advantages and drawbacks.
The q-values are stored in a q-table
that looks like:
[id][-------------------------actions---------------------------] [--state features--]
no_change speed_up speed_up_up slow_down slow_down_down position velocity
0 -4.500 -4.500000 3.1441 -3.434166 -3.177462 0.0 0.0
1 -1.260 -1.260000 9.0490 0.000000 0.000000 2.0 2.0
2 0.396 0.000000 0.0000 0.000000 0.000000 4.0 2.0
3 2.178 0.000000 0.0000 0.000000 0.000000 6.0 2.0
SARSA-Lambda updates the model by giving reward to all the steps that contribute to the end return. It can consider
lambda=0
)lambda=1
)lambda in [0,1]
)It is useful to vizualize the Eligibility Trace
in the process of SARSA-Lambda
. Here is an example of the lambda = 0.2
and gamma = 0.99
The id
denotes the index of occurence: the smaller the index, the older the experience.
The first experience has been seem 6 steps ago. Therefore, its trace is 1 * (lambda * gamma) ** 6
= 0.000060
.
The trace decay is high due to small value of lambda
. For this reason, it is closer to SARSA rather than Monte Carlo.
[id][-------------------------actions---------------------------] [--state features--]
no_change speed_up speed_up_up slow_down slow_down_down position velocity
0 0.000060 0.000000 0.000 0.0 0.000000 0.0 3.0
1 0.000000 0.000304 0.000 0.0 0.000000 3.0 3.0
2 0.001537 0.000000 0.000 0.0 0.000000 7.0 4.0
3 0.000000 0.000000 0.000 0.0 0.007762 11.0 4.0
4 0.000000 0.000000 0.000 0.0 0.039204 13.0 2.0
5 0.000000 0.000000 0.198 0.0 0.000000 13.0 0.0
6 0.000000 0.000000 1.000 0.0 0.000000 15.0 2.0
7 0.000000 0.000000 0.000 0.0 0.000000 19.0 4.0
Optimal Policy and Value Function |
The Policy Iteration algorithm approximates the optimal Policy \pi*
Observations:
position = 12
3
when passing position = 12
. Otherwise, an important negative reward is given.position = 12
with velocity >= 4
are therfore very low. There is no chance for these state to slow down enough before passing the pedestrian. Hence, they cannot escape the high penalty.I noticed that convergence of Policy Iteration is faster (~10 times) than Value Iteration
# Duration of Value Iteration = 114.28 - counter = 121 - delta_value_functions = 9.687738053543171e-06
# Duration of Policy Iteration = 12.44 - counter = 5 - delta_policy = 0.0 with theta = 1e-3 and final theta = 1e-5
no_change
whatever the initial velocity. This cannot be the optimal policy.Model-free and Model-based agents all propose trajectories that are close or equal to the optimal one. For instance the following episode (list of "state-action" pairs):
[[0, 3], 'no_change', [3, 3], 'no_change', [6, 3], 'no_change', [9, 3], 'slow_down', [11, 2], 'no_change', [13, 2], 'speed_up', [16, 3], 'no_change', [19, 3]]
Which makes sense:
It is possible to play with hyper-parameters and appreciate their impacts:
hyper_parameters = (
method_used, # the control RL-method
gamma_learning = 0.99,
learning_rate_learning = 0.02,
eps_start_learning = 1.0,
eps_end_training = 0.02,
eps_decay_training = 0.998466
)
I noticed that the decay rate of the epsilon
has a substantial impact on the convergence and the performance of the model-free agents
I implemented an epsilon decay scheduling. At each episode: eps = max(eps_end, eps_decay * eps)
eps_end
is reached at episode_id = log10(eps_end/eps_start) / log10(eps_decay)
episode_id
episodes , eps_decay = (eps_end / eps_start) ** (1/episode_id)
eps_decay_training = 0.998466
(i.e. 3000 episodes) helps converging to a robust solution for all model-free agents
I implemented the action_masking mechanism described in Simon Chauvin, "Hierarchical Decision-Making for Autonomous Driving"
It helps reducing exploration and it ensures safety.
This example of q-values for position = 1
, when driving at maximal speed = 5, the agent is prevented from speed_up
actions (q = -inf
).
velocity no_change speed_up speed_up_up slow_down slow_down_down
0 -3.444510 -0.892310 -0.493900 -inf -inf
1 1.107690 1.506100 1.486100 -5.444510 -inf
2 3.506100 3.486100 2.782100 -0.892310 -7.444510
3 5.486100 4.782100 -inf 1.506100 -2.892310
4 6.782100 -inf -inf 3.486100 -0.493900
5 -inf -inf -inf 4.782100 1.486100
In the Bellman equation, if the episode is terminated, then q_target = r
(next state has value 0)
Monitoring of the convergence of a given q-value |
The q-value estimate for the state/action pair ([16, 3], "no_change")
converges to the optimal value (+40
= reward[reach end with correct velocity]
)
Overview of the parameters used for the environment
Weights of the Q-table
Plots in the training phase
Plots of the final Q-table
Best actions learnt by model-free agent after 4000 episodes |
Due to the specification of the inital state, the model-free Agent cannot explore all the states. |
Returns for each episode, during the training of a model-free agent |
The orange curve shows the average return over 100 consecutive episodes. It reaches the success threshold after 2400 episodes.
After training, env_configuration.json is generated to summarize the configuration.
{
"min_velocity":0,
"previous_action":null,
"initial_state":[
0,
3,
12
],
"max_velocity_2":2,
"state_ego_velocity":3,
"obstacle1_coord":[
12,
2
],
"actions_list":[
"no_change",
"speed_up",
"speed_up_up",
"slow_down",
"slow_down_down"
],
"goal_velocity":3,
"goal_coord":[
19,
1
],
"previous_state_position":0,
"obstacle":null,
"initial_position":[
0,
0
],
"previous_state_velocity":3,
"state_features":[
"position",
"velocity"
],
"state_obstacle_position":12,
"obstacle2_coord":[
1,
3
],
"rewards_dict":{
"goal_with_bad_velocity":-40,
"negative_speed":-15,
"under_speed":-15,
"action_change":-2,
"over_speed":-10,
"over_speed_near_pedestrian":-40,
"over_speed_2":-10,
"per_step_cost":-3,
"goal_with_good_velocity":40
},
"max_velocity_1":4,
"max_velocity_pedestrian":2,
"using_tkinter":false,
"state_ego_position":0,
"reward":0
}
I am working on a more complex environment with a richer state space. Fine-tuning of hyper-parameter for DQN also belongs to the to-do list. Stay tuned!