Building a Pac-Man Agent using Reinforcement Learning

Glenn Wysen and Jace Bruner - Fall 2019

emir
Fig.1 - A pseudo-randomly generated Pac-Man board

Overview

Our main objective with this project was to use reinforcement learning to make a Pac-Man agent able to win any game of Pac-Man. We stared by getting a foundational understanding of value iteration by creating an agent that took in a Markov decision process (MDP) and outputted an optimal path. Once we were comfortable with the math behind that agent, we built upon it and eventually ended up needing to create a different Q-learning agent, which learned by trial and error from interactions with the environment around it.

Value Iteration Agent

Our first approach to this problem was to build a value iteration agent that could take in a grid with a "goal" square and a "death" square and find the Q-values for each position on the grid. The result of this is shown in Fig. 2. Each square's color corresponds to how likely the agent is to "win" the game if it follows the path given by the arrows. In order to not make the solution trivial, a noise factor of 0.8 was implemented so that the agent will go straight 80% of the time and go left 10% and right 10%.

A second case of value iteration can be used on a bridge-like map as shown in Fig.3. In this game state, the agent gets a small reward for going to the left exit and a large reward for going to the right. The downside is that if the agent goes left or right (which it will do 20% of the time) while on the bridge it suffers a severe negative reward. The purpose of this model was to cement our understanding of how noise (turning left or right when told to go straight) can affect an agent seeking different rewards.

Another aspect of the agent we needed to implement was the risk vs. reward criteria. To do this, we used the board state shown in Fig.4 with two positive terminal states and a row of negative terminal states on the bottom (could represent a dangerous fall off a cliff, or perhaps a street where the agent would be hit by a car). In this scenario we were able to edit parameters like the noise (probability of moving in the desired direction) to change whether the agent would take the shorter riskier path (red arrow) or the longer safer path (green arrow). We also had to implement a system where taking extra steps would reduce the score so that the agent would complete the path in the fewest steps possible.

The final aspect of the value iteration agent we implemented was prioritized sweeping value iteration. This process, modeled after the one described in this paper, aims to achieve the fast real time performance of Temporal Differencing and Q-Learning while still maintaining the accuracy of observational classic methods. Using prioritized sweeping value iteration we were able to compute thousands of simulations of a board in a fraction of a second.

value iteration simple
Fig.2 - A simple board after five rounds of value iteration.
bridge value iteration
Fig.3 - Value iteration ran on a much different board.
discount grid - choose a path
Fig.4 - Board state with an important choice.

Q-Learning

The issue with using value iteration to build an agent is that it doesn't actually learn from experience. Instead, it traverses a MDP and finds the best policy without ever interacting with the environment. This is undesirable behavior for a Pac-Man agent because the environment is dynamic (more like the real world than a stable board state). The Q-Learning agent learns through trial and error interactions as it moves through the board. This process updates the values in the states it leaves behind such that it “leaves learning in its wake" (Fig.5).

The next part of the Q-Learning agent we implemented was a way for it to randomly explore paths during some trials. In the current system the agent would never find new paths if it always followed the path with the highest Q value. To solve this we introduced epsilon-greedy action selection where the agent moved the "correct" direction according to Q values with probability 1-ε but chose a random available direction with probability ε. This allowed the agent to sometimes happen upon a better solution than the current best solution, thus changing the Q values for future agents to traverse. An example of epsilon-greedy action selection is shown below (Fig.6) where the agent's goal is to move forward and its actions change the angle of its two legs.

doorway
Fig.5 - The same board state from before, this time with Q-Learning implemented.
doorway
Fig.6 - A simple crawler whose goal it is to move forward. When the epsilon value is decreased, the agent takes fewer random deviations and finds a steady path that it knows works.

Putting It All Together

The final step was to put all that we had learned and built together into an agent on a Pac-Man board. In order to make the agent behave even better, we added approximate Q-Learning in this part where the agent learns weights for certain features in the game board. The agent first runs through a training phase where it simulates thousands of games and learns optimal moves from certain states. Then it does a trial phase where epsilon and alpha are set to zero which essentially turns off Q-Learning and exploration. This final agent is able to win every game put in front of it.

full Pac-Man game
Fig.7 - The agent beating the Pac-Man game from the beginning of the page with random ghost movements.

Limitations and Future Work

Currently the agent sees ghosts as always bad, even when it eats the large pellet in the game which puts the ghosts in a "scared" mode and removes their ability to kill Pac-Man. A nice adaptation of this project would be to implement a score bonus for eating ghosts while they are in the "scared" state and send the ghosts back to a starting location. Additionally, if there were some way to adapt this so it could work on a real Pac-Man game it would be fun to see it get high scores.