Playing Mario with Deep Reinforcement Learning
This project contains code to train a model that automatically plays the first level of Super Mario World using only raw pixels as the input (no hand-engineered features). The used technique is deep Q-learning, as described in the Atari paper (Summary), combined with a Spatial Transformer.
The training method is deep Q-learning with a replay memory, i.e. the model observes sequences of screens, saves them into its memory and later trains on them, where "training" means that it learns to accurately predict the expected action reward values ("action" means "press button X") based on the collected memories. The replay memory has by default a size of 250k entries. When it starts to get full, new entries replace older ones. For the training batches, examples are chosen randomly (uniform distribution) and rewards of memories are reestimated based on what the network has learned so far.
Each example's input has the following structure:
T is currently set to 4 (note that this includes the last state of the sequence). Screens are captured at every 5th frame. Each example's output are the action reward values of the chosen action (received direct reward + discounted Q-value of the next state). The model can choose two actions per state: One arrow button (up, down, right, left) and one of the other control buttons (A, B, X, Y). This is different from the Atari-model, in which the agent could only pick one button at a time. (Without this change, the agent could theoretically not make many jumps, which force you to keep the A button pressed and move to the right.) As the reward function is constructed in such a way that it is almost never 0, exactly two of each example's output values are expected to be non-zero.
The agent gets the following rewards:
+0.5
if the agent moved to the right, +1.0
if it moved fast to the right (8 pixels or more compared to the last game state), -1.0
if it moved to the left and -1.5
if it moved fast to the left (-8 pixels or more).+2.0
while the level-finished-animation is playing.-3.0
while the death animation is playing.The gamma
(discount for expected/indirect rewards) is set to 0.9
.
Training the model only on score increases (like in the Atari paper) would most likely not work, because enemies respawn when their spawning location moves outside of the screen, so the agent could just kill them again and again, each time increasing its score.
A selective MSE is used to train the agent. That is, for each example gradients are calculated just like they would be for a MSE. However, the gradients of all action values are set to 0 if their target reward was 0. That's because each example contains only the received reward for one pair of chosen buttons (arrow button, other button). Other pairs of actions would have been possible, but the agent didn't choose them and so the reward for them is unclear. Their reward values (per example) are set to 0, but not because they were truely 0, but instead because we don't know what reward the agent would have received if it had chosen them. Backpropagating gradient for them (i.e. if the agent predicts a value unequal to 0) is therefore not reasonable.
This implementation can afford to differentiate between the chosen and not chosen buttons (in the target vector) based on the reward being unequal to 0, because the received reward of a chosen button is (here) almost never exactly 0 (due to the construction of the reward function). Other implementations might need to take more care of this step.
The policy is an epsilon-greedy one, which starts at epsilon=0.8 and anneals that down to 0.1 at the 400k-th chosen action. Whenever according to the policy a random action should be chosen, the agent throws a coin (i.e. 50:50 chance) and either randomizes one of its two (arrows, other buttons) actions or it randomizes both of them.
The model consists of three branches:
At the end of the branches, everything is merged to one vector, fed through a hidden layer, before reaching the output neurons. These output neurons predict the expected reward per pressed button.
Overview of the network:
The Spatial Transformer requires a localization network, which is shown below:
Both networks have overall about 6.6M parameters.
The agent is trained only on the first level (first to the right in the overworld at the start). Other levels suffer significantly more from various difficulties with which the agent can hardly deal. Some of these are:
The first level has hardly any of these difficulties and therefore lends itself to DQN, which is why it is used here. Training on any level and then testing on another one is also rather difficult, because each level seems to introduce new things, like new and quite different enemies or new mechanics (climbing, new items, objects that squeeze you to death, etc.).
luarocks install packageName
): nn
, cudnn
, paths
, image
, display
. display is usually not part of torch.git clone https://github.com/qassemoquab/stnbhwd.git
cd stnbhwd
luarocks make stnbhwd-scm-1.rockspec
sudo apt-get install sqlite3 libsqlite3-dev
luarocks install lsqlite3
source/src/libray/lua.cpp
and insert the following code under namespace {
:
#ifndef LUA_OK
#define LUA_OK 0
#endif
#ifdef LUA_ERRGCMM
REGISTER_LONG_CONSTANT("LUA_ERRGCMM", LUA_ERRGCMM, CONST_PERSISTENT | CONST_CS);
#endif
This makes the emulator run in lua 5.1. Newer versions (than beta23) of lsnes rr2 might not need this.source/include/core/controller.hpp
and change the function do_button_action
from private to public. Simply cut the line void do_button_action(const std::string& name, short newstate, int mode);
in the private:
block and paste it into the public:
block.source/src/lua/input.cpp
and before lua::functions LUA_input_fns(...
(at the end of the file) insert:
int do_button_action(lua::state& L, lua::parameters& P)
{
auto& core = CORE();
std::string name;
short newstate;
int mode;
P(name, newstate, mode);
core.buttons->do_button_action(name, newstate, mode);
return 1;
}
This method was necessary to actually press buttons from custom lua scripts. All of the emulator's default lua functions for that would just never work, because core.lua2->input_controllerdata
apparently never gets set (which btw will let these functions silently fail, i.e. without any error).source/src/lua/input.cpp
, at the block lua::functions LUA_input_fns(...
, add do_button_action
to the lua commands that can be called from lua scripts loaded in the emulator. To do that, change the line {"controller_info", controller_info},
to {"controller_info", controller_info}, {"do_button_action", do_button_action},
.source/
.make
.
options.build
.libwxgtk3.0-dev
and not version 2.8-dev, as that package's official page might tell you to do.source/
execute sudo cp lsnes /usr/bin/ && sudo chown root:root /usr/bin/lsnes
. After that, you can start lsnes by simply typing lsnes
in a console window.sudo mkdir /media/ramdisk
sudo chmod 777 /media/ramdisk
sudo mount -t tmpfs -o size=128M none /media/ramdisk && mkdir /media/ramdisk/mario-ai-screenshots
SCREENSHOT_FILEPATH
in config.lua
.git clone https://github.com/aleju/mario-ai.git
.cd
into the created directory.lsnes
in a terminal window.Configure -> Settings -> Advanced
and set the lua memory limit to 1024MB. (Only has to be done once.)Configure -> Settings -> Controller
). Play until the overworld pops up. There, move to the right and start that level. Play that level a bit and save a handful or so of states via the emulator's File -> Save -> State
to the subdirectory states/train
. Name doesn't matter, but they have to end in .lsmv
. (Try to spread the states over the whole level.)th -ldisplay.start
. If that doesn't work you haven't installed display yet, use luarocks install display
.http://localhost:8000/
in your browser.Tools -> Run Lua script...
and select train.lua
.Tools -> Reset Lua VM
.learned/
. Note that you can keep the replay memory (memory.sqlite
) and train a new network with it.You can test the model using test.lua
. Don't expect it to play amazingly well. The agent will still die a lot, even more so if you ended the training on a bad set of parameters.