Solving Obstacle Tower Challenge using Imitation Learning from Extracted Features

Source: Deep Learning on Medium

Solving Obstacle Tower Challenge using Imitation Learning from Extracted Features

By: Yi Han, Kai Qin, Wenwen Xu

Inspiration

When playing video games, we encounter AI of all levels of sophistication, often implemented by a team of game developers. The more authentic and interactive the AI, the more creativity, time and budget was poured into making it more genuine. Consider the lively and charismatic character Elizabeth in the game Bioshock Infinite: months of building interactable stages, performing motion capture, and dialog tree expansion was spent by a team of programmers, level designers, and animation directors for her AI. Our goal for studying deep reinforcement learning is hopefully to discover some efficient way of generating game AI.

Definition

What is Reinforcement Learning? Let’s say you are agent 007 and your mission is to destroy th

e cybertruck in the game Goldeneye. You can choose to either blow up, shoot, or do nothing as action inputs. The rewards given to you for each action are 1, 0.2, 0 respectively. At the end of the cycle, you are rewarded for your input action by the game environment. The continuous process of picking an action based on environment output state/reward by an agent form the essence of reinforcement learning. Utilizing neural networks such as CNN or LSTM to enhance our agent and act as its decision making “brain”, we can formally define this process as deep reinforcement learning.

Learning the Ropes

We started by reading “Playing Atari with Deep Reinforcement Learning”[1], the paper from 2013 that excited a lot of interests into deep reinforcement learning by introducing an algorithm called deep Q network that could train Atari game agents. The process of taking game state images with rewards as input, using gradient descent to update the CNN, and eventually training an agent that can take actions maximizing the rewards received was used to produce the playing results shown in Figure 2 and 3.

We also ran PPO2, DDPG, and TRPO, three different algorithms with 500 timesteps each to train the agent in figure 4, 5, 6. During the class presentation, we can visualize that for this particular physics based simulation called walker2D, TRPO has better performance because the algorithm has been implemented to handle continuous action space. To understand the concept of policy gradient: directly optimizing the action space instead of reward value iteration, we surveyed lectures from a deep reinforcement learning course offered at UC Berkeley[2].

Obstacle Tower Challenge

After familiarizing ourselves with some of the tools of reinforcement learning, we found a game environment more complex than Atari games which was released in the 1970s and try to produce an intelligent agent to conquer this challenge. The Obstacle tower challenge(OTC) was introduced by a popular video game engine company Unity. The environment consists of randomly generated floors and our job is to train an agent that can navigate the floor and solve puzzles to advance to the next floor. Each floor will subsequently be more difficult to pass than the previous by introducing new challenges like locked doors and tile activatable doors. The game environment provides input state as frames and reward of 1 and 0.1 for completing a floor/picking up items respectively. The agent can enter a discrete action of 54 variations such as [forward, left, rotate camera left, jump]. We focused the last three weeks on delivering results in the Obstacle tower challenge for our project.

Direct OpenAI Baseline Implementation

We first trained agents using OpenAI Baseline’s PPO and Trust Region Policy Optimization (TRPO)[4]. After more than 8 hours of training, both agents showed similar abysmal performance of 1 solved floor on average, on par with results discussed in the original paper released by the challenge creators [6]. Agents were displaying a strategy of going in circles with no recognition of important features in the environment. We concluded the input state image for the agent combined with 54 variations of agent action is too complex for the agent to understand in only 8 hours.

Sampling the Winners

Since the challenge ended on April 30th of this year, we examined articles written by the 1st [5] and 4th place winner to extract some methods they used to achieve higher floor completion. Across the top winners, pre-training the agent using behavior cloning was a common practice that yielded better results in the initial stages of training. We decided to behavior clone this practice into our own implementation. Our best produced result shown during the class presentation was based on the two dataset collected by the first winner[5]. Dataset one includes about 47k labeled frames that classifies the features in each frame into 11 labels such as closed doors, goal, et al. and the second dataset contains roughly 2 million frames (~2.3 days) of human demonstration data. Using the human demonstration dataset on a simple 2 hidden layer CNN to pre-train a TRPO model, we were able to produce an agent that can reach floor 6 and 7 with less than 8 hours of pre-training time. The agent displays recognition of time extension orbs, doors, and keys to reliability navigate the first 5 floors with certainty.

Our Own Ideas and Implementation

With the labeled dataset, the first thing that came to our mind was transfer learning on some state of the art pre-trained ImageNet agent such as ResNet34 and ResNet50. Since the frame size is less than a quarter of the ImageNet image size, we decided to start with ResNet34, which has less parameters than ResNet50. The pipeline for our method was shown in figure 7. The last fully connected 1000×1 layer was replaced by three fully connected layers that results in 11 labels. After training 2 epochs over the dataset one with the CNN layer parameters frozen, we were able to achieve an error rate of around 12% in less than 3 mins of training time (figure 8). While reviewing the top losses, we found that many of the frames have multiple features and thus the true error rate of this agent should be less than 12%. One advantage of using an pre-trained agent would be less training time. Starting from scratch, the first place winner took a couple of hours on a single GPU to train his agent, while ours took less than 3mins to reach a comparable accuracy.

Figure 7

Figure 8

With an pre-trained game feature classifier at hand, we thought it would be interesting to combine this classifier with the reinforcement learning agent through transfer learning or environment augmentation. The first place winner used his classifier to append the predicated categories with the game frame to train his agent’s environment state reading ability. Based on his final score, this approach seemed to be working. We decided to explore another different way to utilize this classifier.

Motivated by the fact that the classifier was able to understand different features of the game frame, our first way of transfer learning was to transform the game frame into an array of features from output of the second to the last layer of the classifier (figure 9). This approach can be used on training agents with BC, PPO, TRPO or GAIL. One analogy of the method would be the player wearing a smart glass that pre-labels the input frame with key findings that are critical in solving the game.

Figure 9

To evaluate the performance of BC using ResNet34 extracted features, we compared the mean episodic rewards of both BC with normal frames and BC with extracted features. Both trained on the same dataset of 127 episodes with more than 385,525 action observation pairs. The results of mean episodic rewards vs. training epochs are shown in figure 10. BC using extracted features performs similar to OpenAI baseline PPO and TRPO, but worse than BC with original game frames. BC with original game frames took 49 mins to complete 1 epoch training. BC using extracted features took 7.22s to complete 1 epoch training. We hypothesized that the performance was due to the loss of location information when transforming the original frames into feature vectors. It is also possible that the neural network architecture of the TRPO implemented Stable Baseline was not suitable this BC problem with input vector size of 512.

Figure 10

Augmented Feature Extraction with Half Images Location Information

In order to mediate the loss of location information when transforming the original frames into feature vectors, Dr. Alex Dimakis suggested a new way of incorporating location information into the extracted feature vector by blocking half of the frame and append the result extracted features with the whole-image-extracted features. Combining the extracted feature vectors each of size 512×1 from both left and right images and the original frame, the new augmented extracted feature vector is of size 1536×1. The performance of BC with the augmented feature vector was comparable to the 512×1 extracted feature, and was worse than the original observation. Although the results are not promising possibly due to the simple architecture of the OpenAI implementation of the TRPO, the idea itself is still worth sharing with the public.

Figure 11

Future work

One next step would be first to replace the default OpenAI common neural network with a more complicated MLP architecture. Second, we would like to try to integrate the pre-trained feature classifier with a Deep reinforcement learning algorithm, thus the entire network could be front-to-end trainable using the reinforcement learning objective function.

One major resource we did not have time to examine is Google Deepmind, which is the world’s most advanced machine learning AI that beats professional players in the game Go and Starcraft. We will try to implement Deepmind’s practices in LSTM, self-play and adversarial play to continuously improve on our OTC project and games we move on to tackle.

Reference

[1] Araffin. RL Baselines Zoo: a Collection of Pre-Trained Reinforcement Learning Agents, 2018.

[2] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust Region Policy Optimization, 2015.

[3] OpenAI. Proximal Policy Optimization, 2017.

[4] Sergey Levine, Frederik Ebert, Avi Singh, Kelvin Xu, and Anusha Nagabandi. CS 285 at UC Berkeley: Deep Reinforcement Learning.

[5] Unity Technologies. Unity Obstacle Tower Challenge, 2019.

[6] Unixpickle. Competing in the Obstacle Tower Challenge, 2017.

[7] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning, 2013.

[8] Yi Han, Kai Qin, and Wenwen Xu. Shallow Dip into Deep Reinforcement Learning, 2019.

[9] Yi Han, Kai Qin, and Wenwen Xu. Solving Obstacle Tower Challenge using Imitation Learning from Extracted Features, 2019.

Medium post link:

https://medium.com/solving-obstacle-tower-challenge-using-imitation/solving-obstacle-tower-challenge-using-imitation-learning-from-extracted-features-62900e591b05?source=friends_link&sk=8b9e3446c73ea68d5e41bc234fec4b48