Minghan Li, Fan Mo, Ancong Wu
The Pig Chase Challenge task basically requires us to design an agent in cooperation with another to catch a pig in a fence. Success in catching the pig rewards 25 points. The agent can also choose to go to the lapis blocks to get 5 points and end the game early. The agent will be tested with multiple kinds of cooperators and benchmark its overall performance. Further descriptions of this challenge can be found here.
Figure 1: The overview of the pig chase challenge from https://github.com/Microsoft/malmo-challenge/blob/master/ai_challenge/pig_chase/README.md.
After installing all dependencies, put all files of the
python test.py to see the agent's performance.
python train.py if you want to retrain the agent.
To use GPUs, install tensorflow 0.8 rather than 1.0, and run
The difficulty of this challenge comes from the uncertainty of the collaborator’s behaviors and the conditions of success(requires two agents to corner the pig rather than to move directly to the pig's location). Under the context of this task, we hope the agent can learn strategies like flanking and ambushing as well as taking advantage of the behavioral pattern of the collaborator to catch the pig. Therefore, temporal abstraction and inference are needed for this specific task. We make two main contributions in this work:
Because of the our limited resources, we have to split the learning process into data collecting on CPUs and training on GPUs. As we know the model will be likely to overfit the dataset in this way, especially for the model like neural network. However, the result does show that with HiDDeN, the agent is able to learn some high level strategies and emerges collaborative patterns.
Figure 2: The Critic contains a Deep Q-Network, outputs Q-values for each goal given the current state. The Meta is the central controller of the hierarchical model, receives Q-values from the critic and specifies the goal. The Actor is an AStar agent, which moves greedily to the current goal. The particle filter module is used to encode the behavior of our collaborator.
The temporal abstractions (high level strategies) are usually very difficult to define, even to be learned . Thus, we use the concept of sub goals , which are specific coordinates in this task. Therefore the Q-value function we use is Q(s, g) instead of Q(s, a). It avoids training the agent using primitive actions and speeds up the data collecting process.
Critic Module: The Critic uses a fully connected neural network with 4 hidden layers, each layer has 1024 neurons with a rectifier nonlinearity. It takes modified state feature vector and the goal as input, i.e. Q(s, g). To stabilize the training process and break the correlations among the data, we also use experience replay and target network  in our model.
Q-learning is known to have the overestimation problem, especially under the non-stationary and stochastic environments. This can be addressed by Double Q learning . To incorporate Double Q learning into DQN, we take the method from , using the target network to estimate the current network’s Q value. Now we have a new update rule:
Meta Module: The Meta produces goal based on the Q value from the Critic, and it also uses particle filter  to update the agent’s belief about the behavioral pattern of the collaborator. We use a vector to encode the collaborator’s type by using the noisy reading from its behavior. By doing resampling from the normalized probability vector in each episode, we can make our agent more adaptive to the changes of the collaborator’s behavior. The normalized vector will also be concatenated with the state feature vector as the input of the Critic.
Actor Module: It basically is an AStar agent, which receives goal from Meta and act greedily to it. The code for the Actor Module is modified from the provided AStar agent .
Off Policy Learning: At the data collecting process we don’t use Meta to output a goal, but instead, we use our current coordinate as the goal to update all the previous states within an episode. In this manner, say if our episode is 25 steps long, then we can gather 325(sum 1 to 25) data within one episode. We use three behavior policies to interact with the collaborator: Always chasing the pig, random walking and always going to the lapis block. Since we know that combining TD learning, function approximation and off policy learning will easily cause divergence, we tried different tricks to avoid that. However, that still causes our agent behave strangely and get stuck in some states.
Since we dont't deploy Project Malmo on GPUs, the whole learning process is split into two stages: data collecting on CPUs and training on GPUs, aka we will be using the offline version of the algorithm to train our agent.
Figure 3: The results of HiDDeN vs Focused and Focused vs Focused. The Red Line represents the Fouced agent and the Blue one represents HiDDeN agent. We can see our method indeed outperforms the astar heuristic.