Deep-Q-Networks (1 of 2): How Computers Learn

Original article can be found here (source): Artificial Intelligence on Medium

Deep-Q-Networks (1 of 2): How Computers Learn

An in-depth look at classic Q-Learning and how it works.

How do human beings learn? Let’s picture a child (this example was taken from an excellent article by Thomas Simonini.

The child approaches a fireplace. They feel warm and cozy. Reward += 1

The child wants more of that feeling — they touch the fireplace. Ouch! reward-= 1 .

Now the child knows that fire is positive (warm and cozy) but not for touching (burn). This idea is exactly how reinforcement learning works, and it’s doing some crazy stuff.

DeepMind’s AlphaZero, successor of AlphaGo, managed to outperform Stockfish within just 4 hours, without any human input. This is part of a wave of model-free reinforcement learning, where — like a human child — the agent starts out without any human input and learns entirely on its own.

“Shedding new light on chess, Go and shogi”

RL strikes a decidedly unique area between the established supervised and unsupervised machine learning. Supervised ML (classification, Support-Vector Machines) is when the computer is given training samples with features and their respective labels to train on. Unsupervised ML (clustering) is essentially when the computer is given no labelled samples. Is reinforcement learning closer to one or the other, or is it it’s own field? I’ll come back to this later once we cover the groundwork.

I’m sold! How does this work anyway?

Before getting into Deep Q-Learning, it’s best to have an understanding of what classic Q-learning is.

Let’s take a simple example:

This is the environment MountainCar-v0 from OpenAI’s gym module for Python

The goal in this “game” is to get the car to the flag, the catch being that the car’s engine isn’t strong enough to get there in 1 try. First let’s define some terminology:

  • Agent — the “player”. It can take actions to change the environment.
  • Action(s) — the way in which the agent can interact with its environment.
  • State — the information about the current state of the environment.
  • Reward — how the agent can track if it is accomplishing its goal.

Let’s look a little deeper at this game.

Mountain Car 2 — Electric Boogaloo

In any RL environment, the agent takes actions, based on the state of the environment, that will try to maximize reward (or minimize punishment — we’ll get to that soon). So in this example:

Intuitively, we can understand the goal of this environment — get to the flag as efficiently as possible. How does the computer know that? Each time-step yields a reward of -1 and a reward of 0 for reaching the goal. The agent has 200 time-steps to reach the flag, and then the game ends. This is minimizing punishment — the agent is trying to get the highest possible value for reward, so it follows that the optimal actions will take the agent to the flag in the least amount of time.

Ok I understand — how the heck do we do it?

This is done using Q-tables. A Q-table is the walkthrough, or guidebook, that tells the agent what to do given a certain state. It contains every possible state in the environment and assigns n q-values based on the number of actions. In this situation, we have 3 q-values for each state in the environment (1 for left, 1 for nothing and 1 for right). The agent references this q-table, and given the state it chooses the action with the highest q-value.

The Q-table is initialized randomly, and every time it receives a reward the q-values for all the actions it took are correspondingly updated using a formula:

This is… quite the formula

Let’s break it down.

  • Learning rate: a measure (value between 0 and 1) of how easily the algorithm will give up the old in favour of the new.
  • Reward: the reward the agent gets now.
  • Discount factor: a measure (value between 0 and 1) of how much the agent values future reward over current reward. Long-term rewards are uncertain, so they should be discounted; they shouldn’t be discounted too much, or the agent will prioritize short-term rewards over the long-term overarching goal.
  • Max Q: the maximum of the q-values of the new state (after action aₜ) for all possible actions a.

The function Q takes in a certain state-action pair, and essentially spits out the “quality” of that action. It follows that our Q-table is just a collection of all the outputs for our Q functions! After taking each action, this Q-table is updated. The more the agent plays, the more rewards (negative or positive) the agent gets and the more optimal our Q-table gets.

Programming a Q-Learning algorithm in Python

For this project, I used OpenAI’s gym module, which is an open-source collection of many different RL environments (I also used it to do the Deep-Q-Network, stay tuned). I worked with the example we’ve already been using, the mountain-car. Here’s essentially what we want to do:

A great flowchart from Thomas Simonini explaining the structure of Q-Learning
Setting up imports and constants

What is that epsilon value doing there? Why would we want the agent to act randomly?

First of all, the agent is going to act randomly in the beginning anyway because the Q table is initialized randomly. Even after the Q-table starts improving, though, we still want the agent to explore. The agent shouldn’t get stuck iterating and optimizing a sub-optimal strategy — it should be trying to explore and finding potentially better strategies.

Of course, we still want the agent to optimize at one point; what’s the point of finding the optimal strategy if the agent never uses it? That’s why epsilon is set to decay over time, and in this case epsilon reaches 0 and stops decaying at the halfway point.

In this example I decay epsilon linearly, but there are other ways to decay epsilon too! Many programs decay it exponentially, but there is also some research that reward-based epsilon decay is superior. Reward-based decay is exactly what it sounds like: only decay epsilon when the agent gets a good enough reward. This can be thought of as “responsibility”, where as the agent performs better, we allow it more responsibility over its actions.

Now we have to initialize the Q-table, right?

Not quite. What are the dimensions of our table? Recall that the state has 2 values: position and velocity. How many possible states are there? Infinitely many. We can have a velocity of 0, 0.01, 0.001, 0.0001 and so on. Since our state information is continuous (as opposed to discrete), we have to quantize that information. This means we have to sort all possible states into a certain number of buckets. This gives us a finite number of possible states.

Setting up our discrete observation space and defining a helper function to access it

Now we initialize the q-table:

Now the agent is ready to start learning!

This, while complicated at first glance, actually isn’t that complicated. The while loop is where all the magic happens. It’s where the agent takes all its steps and updates its q-values; every time the environment ends we return to the top of the for loop. This is where the agent takes an action, evaluates the reward, and updates its Q-table. There’s documentation if you need more granular understanding!

Where’s that huge unwieldy equation from earlier?

Python makes it significantly more readable:

I don’t know about you, but that looks much more elegant than before 🙂

Results and Wrap-up

How did the agent do?

Here’s the graph of the agent rewards (I decided to run for 40,000 episodes):

It looks like the max levels out at around 20000 episodes

And here’s a gif showing the best-performing agent:

Cool, isn’t it?

Q-learning is a powerful tool, but it is entirely classical. It doesn’t make use of any of the powerful new tools (PyTorch, TensorFlow etc.). It also can’t be effectively used for more complicated tasks, since they become much too complex to model in such a simple way. In my next article, I’m going to talk about Deep-Q-Networks.

Here’s something to think about: how can we use neural networks and combine them with Q-learning (hint: “Apples and Oranges”)?

Let’s return to the question in the introduction: how do we classify RL? RL shows some similarities with unsupervised ML at first glance: the computer doesn’t have any labelled examples, nor does it have a large amount of labelled training data to come up with an absolute function approximation like neural nets.

However, it does have quite a lot of training data, and — here’s the key — it has an indication of how it’s doing based on the rewards it gets. I don’t provide the agent with the optimal actions for many different situations, but I provide it with overarching rewards that indicate the agent’s success/failure. In reality, reinforcement learning is unique and can’t be easily placed and — in my opinion — is a completely different field.

Key Takeaways:

  • Reinforcement Learning is an up-and-coming field in ML
  • RL agents interact with their environments by way of actions
  • Rewards motivate the agent to take certain actions
  • Q-tables represent a “guidebook” for the agent to choose an action given a state
  • Q-tables are filled with the values of the outputs of the Q function
  • The Q function looks complicated! But isn’t really.
  • The agent should take random actions early (exploration), and will take the optimal actions it discovers later (exploitation)
  • Q-learning is cool, but Deep-Q-Learning is cooler. Stay tuned.

Before you go

Thanks for reading! If you want to see more, you can check out my other articles. I will have part 2 of this series coming hopefully in the coming days. Feel free to reach out to me at dron.h.to@gmail.com. Stay safe!