Random network distillation on Montezuma's Revenge and Super Mario Bros.
- Visit RNN_Policy branch for RNN Policy implementation instead of CNN Policy.
- Implementation for SuperMarioBros-1-1-v0 has been added. It's been configured to receive no rewards until reaching the flag! Visit mario branch for the code.
Implementation of the Exploration by Random Network Distillation on Montezuma's Revenge Atari game. The algorithm simply consists of generating intrinsic rewards based on the novelty that the agent faces and using these rewards to reduce the sparsity of the game. The main algorithm to train the agent is Proximal Policy Optimization which is able to combine extrinsic and intrinsic rewards easily and has fairly less variance during training.
RNN Policy | CNN Policy | Super Mario Bros |
---|---|---|
RNN Policy | CNN Policy |
---|---|
Kernel_size
of this part of the original implementation is wrong; it should be 3 (same as the DQN nature paper) but it is 4.RewardForwardFilter
in the original implementation is definitely wrong, as it's been pointed here and solved here.By using the max and skip frames of 4, max frames per episode should be 4500 so 4500 * 4 = 18000 as it has been mentioned in the paper.
Parameters | Value |
---|---|
total rollouts per environment | 30000 |
max frames per episode | 4500 |
rollout length | 128 |
number of environments | 128 |
number of epochs | 4 |
number of mini batches | 4 |
learning rate | 1e-4 |
extrinsic gamma | 0.999 |
intrinsic gamma | 0.99 |
lambda | 0.95 |
extrinsic advantage coefficient | 2 |
intrinsic advantage coefficient | 1 |
entropy coefficient | 0.001 |
clip range | 0.1 |
steps for initial normalization | 50 |
predictor proportion | 0.25 |
PPO-RND
├── Brain
│ ├── brain.py
│ └── model.py
├── Common
│ ├── config.py
│ ├── logger.py
│ ├── play.py
│ ├── runner.py
│ └── utils.py
├── demo
│ ├── CNN_Policy.gif
│ └── RNN_Policy.gif
├── main.py
├── Models
│ └── 2020-10-20-15-39-45
│ └── params.pth
├── Plots
│ ├── CNN
│ │ ├── ep_reward.png
│ │ ├── RIR.png
│ │ └── visited_rooms.png
│ └── RNN
│ ├── ep_reward.png
│ ├── RIR.png
│ └── visited_rooms.png
├── README.md
└── requirements.txt
pip3 install -r requirements.txt
usage: main.py [-h] [--n_workers N_WORKERS] [--interval INTERVAL] [--do_test]
[--render] [--train_from_scratch]
Variable parameters based on the configuration of the machine or user's choice
optional arguments:
-h, --help show this help message and exit
--n_workers N_WORKERS
Number of parallel environments.
--interval INTERVAL The interval specifies how often different parameters
should be saved and printed, counted by iterations.
--do_test The flag determines whether to train the agent or play
with it.
--render The flag determines whether to render each agent or
not.
--train_from_scratch The flag determines whether to train from scratch or
continue previous tries.
python3 main.py --n_workers=128 --interval=100
python3 main.py --n_workers=128 --interval=100 --train_from_scratch
python3 main.py --do_test