Human-Like Playtesting with Deep Learning

Source: Deep Learning on Medium


Go to the profile of Tech at King

By Alex Nodet — Artificial Intelligence Engineer

King, like many other game companies, follows a free-to-play business model. This trend has increased within the gaming industry in recent years. Its efficiency is driven by frequent releases of new in-game content.

As of October 2018, Candy Crush Saga offers more than 3,700 levels to its players, and 15 new ones are released every week. Offering high-quality content is one of our core values at King. Therefore it is important to make sure that every level we release is correctly balanced. One traditional way is to ask playtesters for feedback. However, it comes with limitations we will discuss later in this article.

This blog post introduces the paper Human-Like Playtesting with Deep Learning, authored by Gudmundsson, Eisen, Poromaa, Nodet, Purmonen, Kozakowski, Meurling and Cao from our Central AI team. It describes how we use Deep Learning in our automated playtesting pipeline, and shows the advantages of Artificial Intelligence over human-based playtesting. In the first section, we will focus on the Machine Learning model described in the paper. Next, we will explain how we turned the model into a production-grade application that everyone in the company can enjoy.

The Traditional Way and Its Limitations

Playtesting in games is used to understand the player experience and can have different perspectives — difficulty balancing and crash testing being two common examples (Gudmundsson et al.).

Adding a new level to Candy Crush Saga is a process that can be divided into three areas:

  1. Creation: level designers use their imagination and creative skills to make a new and fun puzzle. This is where the core idea of the level is created.
  2. Balancing: level designers reach out to other people or playtesting companies to make sure that their level is challenging. Designers often need to modify their level and tweak it in order to make it enjoyable for everyone.
  3. Release: the level is released and available worldwide in the game.

Balancing usually takes the most time and does not need all the creative talent our designers are capable of. We looked at ways to improve it to free them from boring tasks and found that a lot of time was spent waiting for playtest results. We rely on those results to know how to tweak the new levels appropriately. Unfortunately, to meet our requirements, it takes one week for human playtesters to test new levels. This puts stress on our level designers as they have to keep switching context between tweaking the current week and creating content for the next one.

The Benefits of Automated Playtesting

The idea behind automated playtesting is to create and teach virtual players to play King games as human players would. By running them at scale in the Cloud, level designers will get playtest feedback faster. With our approach using Deep Learning and Google Cloud Platform, we managed to reduce the waiting time from one week to a few minutes.

Virtual players enable more frequent feedback reports (red lines), at any stage of the balancing

Improving our content production pipeline has led to several benefits and new use-cases:

Better Quality Content

A faster playtest allows for more iterations of the new levels. It means that level designers can refine more quickly. Because playtesting is not time-consuming anymore, it is possible to get feedback right before release to make sure that all tweaks work as intended. Finally, level designers can focus on the same content throughout the day, reducing the context switching mentioned above which impacts creativity.

More Thorough and Stable Playtests

One issue with human playtesters is that inherently, the more they play the better they get at the game. This introduces a bias into their feedback. Virtual players are version-controlled software, therefore avoiding such bias. On top of that, the measures are both more precise and diverse, since they communicate directly with the game engine.

A QA Byproduct

By building an automated playtesting platform for content balancing purposes, we actually created a QA byproduct for developers. They can use the platform to explore levels and find bugs. They can also check that new features don’t break the rest of the game. It is a powerful tool to increase the game’s quality as a whole.

Virtual Players and AI

In order to play Candy Crush Saga and provide meaningful feedback, a virtual player needs to understand the game and make decisions. At the core of the problem, the question it needs to answer is simple:

“Which action would a human player take on this game board?”

Although this question is simple, it is nonetheless challenging for artificial intelligence. Why? Here are a few reasons that distinguish Candy Crush Saga from traditional board games, such as chess or Go, where AI reached competitive performances.

  • The game is non-deterministic. There are many sources of randomness involved in the game, e.g. the colour of candies falling from the top of the board. In chess or Go, the starting position is always the same and making a move has a deterministic outcome on the game.
  • Each level is unique. Although the amount of objectives you can run for is small, such as reaching a score or collecting ingredients, each level offers a different dynamic and is governed by a different set of game elements. For instance, conveyor belts, portals, gravity and the shape of a level greatly affect the strategy needed for passing the level.
  • The state space is huge. A standard game board in Candy Crush Saga is a 9×9 grid, which is considerably smaller than a 19×19 Go grid. However, there is close to a hundred different kinds of game elements, which is greater than that of Go (black, white) and chess (King, Queen, Rook, Bishop, Knight, Pawn). In addition, a game element’s behaviour depends on the other game element it interacts with. On top of that, game elements may or may not stack on top of each other, modifying their behaviour as well.

Different approaches are possible to determine which action should be played. Here are three different techniques that we tried, with their pros and cons.

Hard-Coded Heuristics

This kind of heuristics is fast at runtime and cheap to make when you start a project. They will also help you fill the blanks during your integration tests. If the problem you are trying to solve is simple enough, they might even be a good mid-term solution. However, their accuracy is often mediocre and the embedded knowledge is biased towards what its developer values the most. On top of that, they are very expensive to maintain long-term.

Monte-Carlo Tree Search (MCTS)

Tree-search algorithms allow you to explore different possibilities and plan ahead to be able to reach the best outcome. In Candy Crush Saga, it is common to play small moves in order to make a better combo later in the game. This is closer to how our players play Candy Crush Saga than simple heuristics. On the other hand, since it is based on simulations it is much slower at runtime, up to a factor 1000 in our case. In addition, reaching the best possible outcome translates to imitating a superhuman player. Thus this algorithm is not the best choice since our player base is, by definition, not superhuman.

Deep Supervised Learning

Supervised Learning algorithms learn from real-world data. When that data comes from real players, we avoid the superhuman problem MCTS has. They can also express heuristics that are hard to describe or code. On top of that, Deep Supervised Learning has proven its performances on classification problems. However, it needs a lot of samples to be able to train properly. With a lot of data often comes technical challenges, such as efficient storage and processing.

Working with big amounts of data is not a problem at King, and Deep Supervised Learning solves exactly the question virtual players are asking. It is the method we chose to use in production. However, it is good to keep in mind that hard-coded heuristics are more than enough to start building an infrastructure, and MCTS could be a good fallback solution if you don’t have any data at hand.

Our Deep Supervised Learning Solution

First, let’s annotate our question to turn it into a classification problem. We want to predict which action (label) would a human player (data source) take on this game board (input). Or, put into an image:

The input of the supervised learning algorithm is a Candy Crush Saga game board. The board is in a stable state, waiting for an action to be made. We are therefore solving a problem very similar to image classification. That is why our deep neural network relies heavily on convolution layers. We assumed the game has the Markov property, “That is, given the present [game board], the future does not depend on the past”, (Markov process, n.d.). Indeed, all the information needed to make a move is contained in a game state. While a sequence model such as LSTM could increase the performance of the classification algorithm, most of it is already captured by a feedforward deep neural network. Also, since the game is non-deterministic and triggers a lot of randomness sources during playtime, there would be a lot of noise interfering with the training of a sequence model.

Before we dive deeper into the architecture of the neural network, let’s have a look at how the input and label are encoded.

Input Encoding

A Candy Crush Saga game board could be treated like an RGB image, however, it would mean that our neural network’s understanding of the input would be at the pixel level. A lot of resources and training time would be spent on understanding that a blob of yellow pixels close to each other represents a yellow candy and that a candy is one atomic game element.

Thanks to the game engine we can encode the input differently and prevent our Deep Learning model wasting resources on understanding something we already know about. We represented the input as a 9×9 grid with 102 binary channels. Each channel is associated with a kind of game element and describes if that element is present or not on each cell of the 9×9 grid. The understanding of the input is now at the game-element level, thus making the learning much faster and efficient.

Gudmundsson et al., 2018 — An example game board of CCS encoded as 102-channel 2D input (only 7 channels shown).

Action Encoding

In order for the supervised learning model to train properly, it needs labels as a source of truth. We need to define what an action is and label it in a consistent way. However, the game boards in our training data do not contain a fixed number of available actions, since it depends on the game elements present on the board.

To tackle that issue, we define an action as a swap between two cells on the game board. Then we use the one-hot encoded index of the edge between those cells as a label. It makes the action definition agnostic to the number of available actions in a particular sample.

Mapping between actions and labels

There are a few edge cases in Candy Crush Saga where the direction of the swap matters. One improvement to the model could be to take it into account and use 288 labels instead of 144, or use a second output to predict swap direction.

In the unlikely event that the network recommends an action that defies the game rules, we can detect it at runtime and pick the next most recommended action.

Network Architecture and Training

Since our problem is close to image classification, we focused on a feed-forward deep convolutional network architecture. As described above the input for our network is a 9×9 grid with 102 binary channels, whereas the output is a probability distribution over 144 swap indexes. We tried different architectures to achieve the best performance. You can find more details about them and the different performances we reached in our paper “Human-Like Playtesting with Deep Learning”.

Gudmundsson et al., 2018 — Each layer of our network architecture

Our dataset, before it is split, consists of approximately 12 million samples. It is ingested from our game servers into Google Cloud BigQuery. We used Python and Tensorflow to implement the neural network and trained it with Google Cloud Machine Learning Engine.

From Machine Learning to Production

Developing a production-grade ML product is not a trivial task. Even though the prediction performances of the model can be satisfying, it is crucial that software and systems engineering best practices are followed. That includes traditional methods like versioning, monitoring and documentation. For ML applications, it also applies to the data pipeline that will prepare and feed your training or inference data to your model.

Many components are needed to make a service based on Machine Learning performant and viable in the long run. A resilient ML application requires both a mature data pipeline upstream and an efficient serving pipeline downstream.

Sculley, D., et al. “Hidden technical debt in machine learning systems” — Only a small fraction of real-world ML systems are composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.

Fortunately, King has a good history when it comes to handling massive amounts of data. As announced this summer, we moved our data warehouse to Google Cloud Platform. We chose to push our playtesting pipeline to GCP as well. Kubernetes Engine and PubSub will be the main components responsible for distributing the virtual players and their workload.

The Core of Our Playtesting Pipeline

Before running in the cloud, let’s start simple and assume we just want to play one game round with our ML model locally. The core of our pipeline is made of three applications:

  • The game: for example, Candy Crush Saga
  • The brain: our trained ML model
  • The agent: making the bridge between the two previous parts

Both the game and the brain are accessible through REST APIs to ease both communication and their integration in a bigger system later.

At the beginning of a playtest session, the agent will send a level to the game. Then its task will be to retrieve the game state, ask the brain for a prediction and play the recommended action in the game. It will repeat that until the game round ends with success or loss. During the playtest session, the agent records all the metrics we are interested in, and exports them when the session ends.

Setup for playing a game round locally

Why do we have a separate agent? Its role is very simple, so simple it could easily be integrated into the brain or the game.

In fact, the agent’s existence as a third, separate entity is essential. It allows the game and the brain to have a loose coupling between each other and with the rest of the pipeline, which will grow bigger in the next sections.

APIs, Docker and Kubernetes

Our three different applications are embedded in different Docker images. Using containers and APIs abstract away their implementation details. It eases the prototyping, versioning, release and deployment of the different components of the core pipeline.

Having a ML model in a container makes it simple to use locally and is useful to our game developers who can treat it as a black box during game QA sessions. Another benefit is that it will not change their workflow, if we decide to replace the ML model with another decision algorithm. In addition, using containers and APIs allows unifying the way different games are integrated into the pipeline. Its capability to support new games is greatly increased.

From game events to containerised ML model actionable by API

Last but not least, Docker containers and Swagger-defined APIs make it straightforward to work with mocks during integration tests. Therefore, we didn’t need a fully trained ML model, or a fully compatible game to start working on a scalable pipeline in the cloud.

Because our games involve randomness, we need to run hundreds of game rounds in parallel for each level to be able to calculate reliable statistics with the playtest metrics. In order to achieve that, we scale horizontally by creating Kubernetes Deployments in GKE.

In Kubernetes the smallest deployable unit is called a Pod, and contains one or more containers. The agent and the game scale together, so we group them in the same Pod definition. Several replicas will be created with the first Deployment. The brain is stateless, and its lifecycle is independent from the game and the agent. Thus it has its own Pod definition, Deployment, and replicas. The brain Pods are exposed through a Kubernetes Service to make sure the game-agent Pods can access them. Having both the brains and the game-agents in Kubernetes opens a door for Reinforcement Learning with Kubeflow.

The scaling of the brain Pods is handled by a Horizontal Pod Autoscaler. The game-agents, however, are scaled proactively through the Kubernetes API by the orchestration server (described next section). It allows a smarter and faster scaling than the one provided by the autoscaler. Some benefits include, for example, better management of cold start and prioritisation between games. We usually run a few hundred game-agents per new level.

Our production pipeline in Google Cloud Platform. Yellow: new level sent to the orchestration server. Blue: attempts on the new level. Red: game state and action prediction from the core pipeline. Green: playtest metrics. Black: control flow.

Task Queue, Orchestration and Accessibility

We used Cloud PubSub to make a distributed task queue. It will deliver the level and parameters for each game round attempt to the game-agents, and store the playtest metrics they export.

RBEA, our real-time analytics platform, aggregates and analyses these metrics. Our data scientists can investigate the metrics without much difference from their usual workflow. It also gives them a lot of flexibility regarding the data without having to modify the pipeline backend code.

We coordinate the different pipeline components with an on-premises orchestration server, offering an API as well. Level designers, data scientists and developers benefit from automatic playtesting smoothly, without worrying about resource management in the Cloud.

The pipeline deployment is done with Deployment Manager. On top of the advantages a scripted, versioned deployment process carries, our main use-case is creating pipelines to support new games or distinguish development from production pipelines.

Summary

Congrats, you made it to the end! We hope you learned something or enjoyed reading about one of our usages of Machine Learning and Google Cloud Platform.

We saw what kind of advantages an automated playtesting platform brings compared to human-based playtests. Our level designers are now able to iterate several times per day to release better content and suffer less from context switching, and developers can use it for regression testing or bug hunts.

Deep Learning is at the core of this new tool and allows us to simulate players. It is backed by a distributed infrastructure running in the Cloud to reach the quality of a production-grade software, accessible seamlessly across King. As of October 2018, 25 million game rounds have been played with virtual players.

If you want to go deeper into the Machine Learning section, more details are available in our paper Human-Like Playtesting with Deep Learning (S. F. Gudmundsson, P. Eisen, E. Poromaa, A. Nodet, S. Purmonen, B. Kozakowski, R. Meurling, L. Cao, 2018).

As always, feel free to reach out if you have any questions or want to discuss!