Proximal Policy Optimization Tutorial (Part 1/2: Actor-Critic Method)

Original article was published on Artificial Intelligence on Medium

The Actor model

The Actor model performs the task of learning what action to take under a particular observed state of the environment. In our case, it takes the RGB image of the game as input and gives a particular action like shoot or pass as output.

The Actor model

Let’s implement this first.

Here, we are first defining the input shape state_input for our neural net which is the shape of our RGB image. n_actions is the total number of actions available to us in this football environment and will be the total number of output nodes of the neural net.

I’m using the first few layers of a pretrained MobileNet CNN in order to process our input image. I’m also making these layers’ parameters non-trainable since we do not want to change their weights. Only the classification layers added on top of this feature extractor will be trained to predict the correct actions. Let’s combine these layers as Keras Model and compile it using a mean-squared error loss (for now, this will be changed to a custom PPO loss later in this tutorial).

The Critic model

We send the action predicted by the Actor to the football environment and observe what happens in the game. If something positive happens as a result of our action, like scoring a goal, then the environment sends back a positive response in the form of a reward. If an own goal occurs due to our action, then we get a negative reward. This reward is taken in by the Critic model.

The Critic model

The job of the Critic model is to learn to evaluate if the action taken by the Actor led our environment to be in a better state or not and give its feedback to the Actor, hence its name. It outputs a real number indicating a rating (Q-value) of the action taken in the previous state. By comparing this rating obtained from the Critic, the Actor can compare its current policy with a new policy and decide how it wants to improve itself to take better actions.

Let’s implement the Critic.

As you can see, the structure of the Critic neural net is almost the same as the Actor. The only major difference being, the final layer of Critic outputs a real number. Hence, the activation used is tanh and not softmax since we do not need a probability distribution here like with the Actor.

Now, an important step in the PPO algorithm is to run through this entire loop with the two models for a fixed number of steps known as PPO steps. So essentially, we are interacting with our environemt for certain number of steps and collecting the states, actions, rewards, etc. which we will use for training.

Tying it all together

Now that we have our two models defined, we can use them to interact with the football environment for a fixed number of steps and collect our experiences. These experiences will be used to update the policies of our models after we have a large enough batch of such samples. This is how to implement the loop collecting such sample experiences.

As you can see in the code above, we have defined a few python list objects that are be used to store information like the observed states, actions, rewards etc. when we are interacting with our environment for a total of ppo_steps. This gives us a batch of 128 sample experiences that will be used later on for training the Actor and Critic neural networks.

Following two videos explain this code line-by-line and also show how the end result looks like on the game screen.

To be continued…

That’s all for this part of the tutorial. We installed the Google Football Environment on our Linux system and implemented a basic framework to interact with this environment. Next, we defined the Actor and Critic models and used them to interact with and collect sample experiences from this game. Hope you were able to keep up so far, otherwise let me know down below in the comments if you were held up by something and I’ll try to help.

Next time we’ll see how to use these experiences we collected to train and improve the actor and critic models. We’ll go over the Generalized Advantage Estimation algorithm and use that to calculate a custom PPO loss for training these networks. So stick around!

EDIT: Here’s PART 2 of this tutorial series.