Project Pendragon Part 2: A Reinforcement Learning Bot for Fate Grand Order

Fate Grand Order wallpaper that I passed through a YOLOv2 object detector I fine tuned to detect human faces, it is somewhat surprising that it detects any cartoon faces…

In my previous post I outlined a bot I built to play the mobile phone game Fate Grand Order (FGO) nicknamed Pendragon. The core of the Pendragon bot is three neural networks, two classic CNNs and one siamese CNN the output of the networks is used by the bot to decide which of the five cards it was dealt that turn to play.

At the end of that previous post I discuss some possible next steps and one of the ones I mentioned was to use reinforcement learning to train a model to choose cards on its own without me having to lay out rules for it to follow. I thought this would be very difficult because I would have to build an environment for the bot to experiment in from scratch… but I went ahead and did that because I felt it would be an interesting project and it was worth the extra effort. This post will cover how I built a simplified FGO environment for the bot to play in, structured a neural network to make decisions on which cards to pick, trained it using reinforcement learning, and impressions I have of its learned gameplay.

After training and integrating the new reinforcement learning network back into the framework I built to play FGO I decided it needs a nickname. For now I am going with Pendragon Alter. FGO makes characters and then also evil “Alter” version of them which lets them make more characters without a lot of extra development (and boosts their revenue). So I figure my original Pendragon bot could use a, possibly evil, reinforcement learning counterpart.

As some background, FGO is a game where the player is dealt five cards every turn and of those five they pick three as their actions for that turn. Each type of card has its own properties and playing them in different combinations has different benefits. For example, playing three cards of the same color adds additional bonuses onto of the cards base statistics. The nature of the bonuses depend on the types of card. Arts cards (blue) charge powerful ultimate abilities, Buster cards (red) deal additional damage, and Quick cards (green) create the possibility of critical hits, double damage. I try to explain the game mechanics that are important in this post, but feel free to check out the previous post for a more detailed overview of other general mechanics that make up FGO.

Pendragon Alter building an “Arts chain” picking 3 Arts cards

Building a Fate Grand Order Environment

Like many others I have used OpenAI’s gym library to train some simple reinforcement learning models using other people’s code, but part of my motivation for this project because I wanted to see how it was building a custom environment from scratch. A lot of the inspiration for this came from reading a 2016 post by Scott Rome where he created an environment to train a network to play blackjack (here).

In FGO, the player usually has 3 characters on the field at any given time with around nine opponents spawning in over three waves. To simplify this I structured my environment as a one-vs-one battle where the goal of the two entities is to reduce the opponent to 0 hit-points by dealing damage every turn.

For now basic structure of these battles is a loop where every iteration consists of both teams dealing damage which breaks when one team is dropped to 0 hit points.

The enemy team deals damage by randomly selecting a number within a range every turn, I figured that this simple solution would work for now and that it was more important to reasonably scale the enemy’s damage rather than recreate how they dealt it. The player team on the other hand deals damage based on which three card combination they pick out of the five cards they are dealt based on their initial card deck. I did this to more closely mirror how players fight battles in FGO.

See in the github repo lines 275–432 for the implementation.

Once that loop was built I began to add in backend calculations and mechanics that FGO uses to amplify damage. I found the values for these calculations and mechanics (here).

Some examples of these backend mechanics:

  1. Noble Phantasms (NPs): are powerful ultimate abilities that characters have in FGO which typically deal lots of damage. NPs have to charge to 100% before they are used and this is tracked in a “NP gauge”. The current charge is increased by playing certain cards. In my implementation after the player’s NP gauge reaches 100% the NP is used on the next turn to deal additional damage. The ability for a player to charge their character’s NPs is crucial for them to be able to successfully clear FGO battles.
  2. Critical hits: the percentage chance of landing a critical hit is based on the generation of “critical stars”. The rate at which critical stars are generated changes based on what cards you use. Stars are generated on turn N and consumed on turn N+1. The number of stars increases the likelihood of dealing critical damage. While there is some randomness involved in successfully getting a critical hit, if you successfully make one you deal double damage which is very valuable.
  3. Card placement in the chain: There are bonuses to placing a card in different locations of the three card chain. For example the first card in the chain applies bonuses to all cards in the chain, while the third card has its own properties increased significantly. This makes using the first and third card in the chain quite important. My Pendragon bot, does not actually make good use of this mechanic, the new reinforcement learning bot does a much better job of this, see the final section for details.
  4. Card Chains: As I mentioned previously there are bonuses to placing three cards of the same type (color) together because the card’s statistics are amplified. This bonus combined with the bonuses from card placement helps to make card chains very powerful within the game and thus was essential to add into the game environment.

These additional mechanics can be found in lines 104–235 of

The Structuring the Neural Network

In order to begin to build out the mechanics to integrate the network into my new game environment I had to decide on how to represent the action space for the model. My first round of brainstorming had me want the model to classify different hands based on all the possible ways to build a hand out of the three different card types (Arts, Buster, Quick). But I hit a wall with this pretty quickly, see below.

Hand: “Arts”, “Quick”, ”Quick”, “Buster”, “Arts”
output: “Arts”, “Arts”, “Arts”

How do I constrict the output such that it doesn’t try play three arts cards when only two are available?

I wasn’t able to think of an elegant way to do this so I had to go back to the drawing board and figure out a way to represent the action space such that every output node was always valid and it would just have to pick the best one.

The methodology I ended up settling on was to have the action space be all the possible permutations of picking three card slots out of the five possible card slots. My thought process here was that rather than thinking about card types which will change from hand to hand, there will always be five valid card slots. So the 60 possible permutations of picking 3 slots out of 5 where order matters could be a good way to represent the action space. See below for the previous example hand with the previous example’s input hand, all outputs 1–60 are valid.

Hand: “Arts”, “Quick”, ”Quick”, “Buster”, “Arts”
output_1: “Card 1”, “Card 2”, “Card 3”
output_2: “Card 1”, “Card 2”, “Card 4”
output_60: “Card 5”, “Card 4”, “Card 3”

This meant that any network I built would have 5 input nodes representing the five card slots and 60 output nodes representing all the ways to pick three cards out five with replacement. contains the full network code as well as how the network gets actions based on current game state and updates itself based on observed results.

Initially I was concerned over whether or not the network would be able to generalize concepts like “playing three of the same card type is good” between different categories. Given two input vectors:

Hand_1: “Arts”, “Arts”, ”Quick”, “Buster”, “Arts”
Hand_2: “Buster”, “Quick”, ”Arts”, “Arts”, “Arts”

A concern that I had with this method was whether or not the network would learn that picking the three arts cards in either hand is a good tactic since choosing “Card 1”, “Card 2”, “Card 5″ and “Card 3”, “Card 4”, “Card 5” can be done by using multiple nodes? However the network did learn this behavior and I was pleasantly surprised.

Pendragon Alter constructing another Arts Chain by picking the node corresponding to cards 3,2, and 4

Deep Q-Learning

In reinforcement learning there is no initial dataset and we are allowing a network to explore the action space of an environment where it is rewarded and punished for different decisions such that it can try to discover how to optimally navigate the environment.

The basic process is as follows:

  1. the Agent (the network in this case) is given the current state of the game. This could be the pixels of an atari pong game or whatever representation you choose. In my case it is a length 5 list of cards the player was dealt that turn.
  2. Agent chooses an action out if its action space. In the Pong example Andrej Karpathy has it as the probability of going up. With my FGO game it is which of the 60 possible card slot combinations is best. However it should be noted that in reinforcement learning there is the idea of exploration vs exploitation. Essentially saying that sometimes an action should be chosen at random rather than simply doing what the Agent thinks is best. This helps the Agent to explore and find additional rewards that it would not have found otherwise if it has simply exploited the rewards it knows.
  3. Action is imputed into the environment, any rewards are collected, and the environment moves onto its next state, frame, or turn. Mechanically I do this by adding the rewards to the card slot combination that the network outputted. For a positive reward that category’s value in the output array increases and the network will see that given that input again that particular category is a beneficial one.
  4. The Agent is updated based on the rewards it received. After the rewards are used to modify the output array, the network is trained on the initial input state with the modified output array as the target. This helps to reinforce good choices, while taking into account bad choices as well.
  5. Rinse and repeat.

This process cycle can be seen in where the Battle class calls the network to get actions and updates itself with rewards at the end of every iteration and game.

How to Reward a Bot for Good Behavior?

excerpt from FGO official comics making a joke about how players are addicted to rolling for new characters, but talks about rewards so figured I would include it

This section took a lot of experimenting but eventually I found a set of rewards which work well for this bot. Initially I only placed rewards on the final outcome of the battles, basically a +1 for winning and -1 for losing. At this point the bots would demonstrate some knowledge of beneficial card placements, but none of the more advanced behavior like building card chains consistently. In these initial rounds of training the networks rarely got above 30% win rate in the custom environment I built. The best run may have had a win rate of 50% which still left a lot of room for improvement. So to adjust for this I added additional rewards into the game for successfully using the game’s more complex mechanics.

In order to try and get the bot to learn these more advanced mechanics I added additional rewards into the environment. The first rewards I added were for playing three cards of the same type (either Arts, Buster, or Quick) which helped the models. After some testing I also added a reward for the bot building up its NP gauge to 100%. As stated before card chains greatly amplify the effects of the individual cards and NPs are often used by players to finish the battle in one shot so I felt that both were important for the bot to use in order to be successful.

One of the downsides I thought about to doing this is that the bot will place emphasis on doing these actions and this might be me overly constraining the bot to meet my views on the game. In the future I may be able to fix this by adding a memory mechanic for the bot to track all of the actions and rewards from a full battle and then evaluate rather than just evaluating it after every turn has been taken. However with my current setup where I evaluate after every turn, I have found this reward structure to be effective.

With these additional rewards in place I was able to get the bots to around a 60–70% win rate while training and with more tweaking the final network has around 80% win rate after training for 50,000 battles. I benchmarked these win rates by tracking how the bot did in a 10,000 game period of time, so when I say 60% it means it won around 6,000 of the past 10,000 games and so on.

These win rates can likely be increased further, but I found the network had learned good and some very interesting behaviors which I will discuss below.

Pendragon Alter’s Results and Comparison to Original Pendragon bot

So being able to win 80% of the simulated battles I put it through is all well and good, but now we get to look at how it performs in the actual game.

For this I integrated the reinforcement learning bot into the framework I built to interact with the FGO mobile game in my previous post and had it play a number of matches.

I outlined previously that I felt a good strategy to approach FGO with would be to have the bot play card chains whenever available and then I had it play cards in the order of Arts, Buster, and Quick. I found that Pendragon Alter plays Arts and Buster cards and chains fairly consistently.

Pendragon Alter picking a Buster chain based on card input

However, while it does play Quick chains, it does not do so consistently. Instead if there are two cards of the same type with three quick cards. It will play a Quick card in card slot 2, with the other two cards in slots 1 and 3. (see below). This maximizes the effect of the two cards that are placed in slots 1 and 3.

Alter ignoring a Quick Chain to charge its NP using Arts cards in the first and third slot

If there are three Quick Cards, one Arts, and one Buster card, what it does is it mixes in a Buster or Arts card into the second card slot rather than playing the three Quick cards to form a Quick Chain. This means that the bot forgoes getting the bonus of playing a Quick Chain (increased critical chance) to either deal more damage right now or to charge its NP further.

Alter ignoring a Quick Chain to charge its NP using Arts cards in the second slot

This is very interesting to me because, first it appears to not have developed a liking for playing Quick cards if it doesn’t have to. This may be due to the fact the 15 card deck I initialized it with only contains three quick cards while having six arts and six buster cards. So it would be more used to playing Arts and Buster cards. The second interesting details is that Pendragon Alter appears to have some intuition about the additional bonuses to playing cards in the first and third card slots as I discussed above.

Pendragon Alter building some understanding of the first and third cards having additional bonuses is also exhibited in cases where no chains are available. In the case below it does not pick either Quick card, and instead builds a chain using two Buster cards and one Arts card. By placing Buster cards at the beginning and end of the chain the chain will deal the most damage out of any card combinations of that hand, while the middle Arts card will help it power up it’s NP gauge. So basically Pendragon Alter is trying to maximize its own damage while still charging its NP gauge.

Pendragon Alter maximizing damage by playing Buster cards in slots 1 and 3 while filling the 2nd card slot with an Arts card

When there are no card chains available, I would argue that Pendragon Alter is superior to my previous Pendragon bot because it makes better usage of card positioning in the chain. See below for how the two bots would play the hand seen in the previous GIF.

Hand: "Buster","Quick","Arts","Buster","Quick
Project Pendragon Bot: “Arts”, “Buster”, “Buster”
Alter Pendragon Bot: “Buster”, “Arts”, “Buster"

Very similar, but the difference of swapping the first two card slots is important.

I built Project Pendragon to play all available Arts cards, then Buster, then Quick. This means that it will always play all of the Arts cards it has available, but it does not maximize the effect of the Arts card or the damage from the Buster cards. The way to maximize those two things is to play the hand that Pendragon Alter plays. The first buster card increases the damage of the entire chain including the final buster card, the middle Arts card causes more NP gain than an Arts card in the first slot, and the final Buster will have its damage increased by the first and third card slot bonuses.

So while I could add logic into Project Pendragon to account for this, it exposes a weakness in my initial algorithm and an area where the reinforcement learning trained Pendragon Alter beats out my initial bot.

Final Notes

So far these posts have been inspired by many other data scientists and their work. Sentdex’s Grand Theft Auto tutorial series, Chintan Trivedi’s post on how he trained a network to play FiFA and the previously mentioned post by Scott Rome on training a network to play blackjack to name a few. So thank you to everyone who contributes to the data science community to make it so rich in information!

While Pendragon Alter shows how some basic reinforcement learning can yield some interesting results, there is still room for improvement. As I mentioned being able to take entire games into consideration instead of turn by turn play would be another interesting extension of the bot.

In Part 1 of this series I built a rules based bot powered by three neural networks to play Fate Grand Order. Part 2 shows how I built a, possibly evil “Alter”, reinforcement learning counterpart. So it seems only fitting that I write a part 3 that shows how the two bots fare when placed head to head!

Source: Deep Learning on Medium