Original article was published by Aditya Mittal on Artificial Intelligence on Medium

# An AI Algorithm That Can Win Almost ANY Game

The year was 1997. It seemed that the world chess champion, Garry Kasparov, would easily hold his own against Deep Blue, a computer algorithm that couldn’t possibly comprehend the 10¹²⁰ possibilities that a chess game could unfold in. However, rather than shorting out due to the immense number of computations it had to perform, the supercomputer Deep Blue actually ended up defeating Kasparov in the match by a score of 3.5–2.5. How could a computer possibly make these moves without trying out every single possible combination?

# Intro to Reinforcement Learning

Enter **reinforcement learning**, a subsection of machine learning where computers learn from trial and error using past feedback to improve. Rather than using a common mapping of input and output like with supervised learning, reinforcement learning uses **rewards** for positive behavior and punishment for negative behavior. Computers “train” in a virtual environment where they are passed a certain state that they are in, and have to take a certain action that will maximize the reward that it will get.

An example can be seen in the game of Pong, a game where two players have paddles and continuously bounce the ball back to one another by hitting it with their paddles. The goal of the game is to defeat your opponent by making him miss a ball. The reward function in this case would be positive for hitting the ball back or making your opponent miss the ball, while the reward function would be negative for missing the ball.

When the computer first starts training, it chooses actions randomly from the **action space**, a set of actions that the agent can take within the game. For the game of Pong, the action space would be moving up or down. However, as the computer attempts to maximize its reward, it recognizes patterns in the game, such as if the ball is moving up, you should move your paddle up. Eventually, the game will be able to play at a decent level.

One thing to note is that the model attempts to maximize its cumulative reward, not its instantaneous reward. An analogy can be seen in the game of chess; capturing a pawn may seem like the best instantaneous move, but gaining a better positioning will provide a better cumulative reward.

# Reinforcement Learning Terms

**Term #1: **Markov Decision Processes (MDPs)

*Feel free to skip the first paragraph if you’re not interested in a technical definition for Markov Decision Processes.*

One of the frameworks used to generalize reinforcement learning problems are **Markov Decision Processes **(MDPs). In short, MDPs work in the following way: the process starts in a state *s, *and the machine will choose an action *a *from the action set *A(s)* available in state *s. *Then, the machine is moved to a new state *s’*, and is given a reward *R(s, s’) *depending on its initial and current state. The probability that the process moves into state *s’* given action *a *and state *s* is P(*s,* *s’ | a*), assuming that *s *and *a *are conditionally independent from past actions made.

Wow! That was a lot to unpack! If you’re interested in learning more about the mathematics behind MDPs, click here. MDPs are used to describe an environment in reinforcement learning, and can be used to formalize nearly any reinforcement learning problem. However, this approach falls short in real-world scenarios where environments are often not fully mapped out.

This is an example of a non-model-free reinforcement learning method, where a formal definition of the environment is necessary for the algorithm to run. In summary, these should only be used with a small amount of possible states and a simple environment.

**Term #2: Q-Learning**

**Q-Learning** is an example of a model-free approach that can be used for reinforcement learning. This method updates Q-values when an action *a* is done in state *s. *Similar to neural networks, Q-learning heavily relies on this value update rule; a more formalized definition can be seen below.

Note that Q-learning uses several hyperparameters that need to be set beforehand for the model, which makes it more complicated to train properly. Also note that there needs to be an estimate for the optimal future value, which is necessary as the machine tries to optimize the total cumulative reward, not just the instantaneous reward.

Methods for developing estimates for the future values include **Deep Q-Networks**, which use neural networks to estimate Q-values, and **Deep Deterministic Policy Gradient** (DDPG), which tackles the problem by learning policies in continuous, high dimensional action spaces.

# Reinforcement Learning in Games

One of the most significant and popular accomplishments of reinforcement learning lie in its potential to defeat complex games, such as DOTA 2. Currently, OpenAI Five, has learned by playing over 10,000 years of games against itself, and can defeat professional teams in DOTA 2, a multiplayer strategy game.

This feat is monumental: DOTA 2 is a much more complex game than chess, involving different sets of abilities, experience points, attacks, defensive moves. In overall, a character can perform over 100,000 possible actions, and each match has over 80,000 individual frames. Another issue is the fact that there are 5 players on each team, meaning that the AI has to work with other AIs to defend their base and attack the opponent’s base.

To meet this problem, OpenAI Five uses a deep reinforcement learning model that uses rewards, such as kills, deaths, and assists, to help them improve. Despite never being taught this, AIs have learned how to master professional tactics in the game, and how to prioritize their team reward over their individual reward. This shows that the AI has the capability to learn strategies and moves; it’s incredible how the AI has developed this technology after starting off with random moves.

Reinforcement learning has the power to disrupt many other technologies as well. It can be used for text summarization, chatbots, online stock trading, and much more. As Kasparov said:

“It’s about setting the rules. And setting the rules means that you have the perimeter. And as long as a machine can operate in the perimeter knowing what the final goal is, even if this is the only piece of information, that’s enough for machines to reach the level that is impossible for humans to compete.”

With the growth that reinforcement learning has been receiving in the last couple of years, Kasparov’s statement may soon become a reality, as reinforcement learning continues to disrupts hundreds of fields.

# TL;DR

- Reinforcement learning is a subset of machine learning that optimizes a machine’s actions by providing a reward function and training in a virtual environment.
- Markov Decision Processes (MDPs) allow us to formalize our understanding of the environment into an environment a reinforcement learning algorithm can run on.
- Q-Learning uses a model-free approach to reinforcement learning, and trains by updating q-values.
- Current innovations in reinforcement learning can be seen in OpenAI’s disruption in DOTA 2, where their AI algorithm is able to defeat some of the top players in the world.
- Reinforcement learning has a variety of applications, including chatbots and online stock trading.

# Further Reading

*For information about projects that I am currently working on, consider subscribing to my **newsletter**! Here’s the link to **subscribe**. If you’re interested in connecting, follow me on **Linkedin**, **Github**, and **Medium**.*