Predicting Hearthstone Arena scores — part 1

Source: Deep Learning on Medium

Hearthstone is indeed a fun game: the rules are simple for anyone to pick up quickly, the characters are funny and the overall game quality is very high.

Now what is even more fun is machine learning! So can we combine the two for even greater fun? I know you are now thinking about a super smart AI that will take you to legend while watching YouTube… well that is not going to happen, at least not here. And honestly, who wants to play (and possible lose) to a machine all the time? So let’s do something that is still fun but will hurt no feelings (Blizzard especially). Now that the new expansion is only days away from being released (besides constructed) we have to prepare for a completely new balance in the arena scene as well. So the question is: knowing existing cards and their scores, can we predict the new scores? Let’s find out.


Machine learning always starts with data. So do we. In order to pursue our quest we are going to need data, preferably in an easy-to-digest format, JSON for example. Thanks to the HSReplay team we can easily load the data of all cards using their public API. Since we are working with arena here we can restrict ourselves to use only the collectible cards set. As for the tier scores we use rembound’s Arena-Helper on github:

with urllib.request.urlopen("") as url:
arena_data = json.load(url)
print("There are %d cards with tier score" % len(arena_data))
# Change key to ID
arena_data = {x['id']:x for x in arena_data}
with urllib.request.urlopen(urllib.request.Request("", headers={'User-Agent' : "Magic Browser"})) as url:
card_data = json.load(url)
# Change key to ID
card_data = {x['id']:x for x in card_data if x['id'] in arena_data}

Feature transform

So we have cards with all the attributes (=features) and their corresponding tier scores (=labels) so we could try to do some form of supervised learning. Thus, our assumption is that the attributes of a card predict their tier score independent of the other cards in our deck. We all know this is not true: the game itself builds heavily on card synergies and combos (though much less dominant in the arena), but for the sake of simplicity we will ignore this for the moment.

With that said, we need to build a model that can learn to transform the attributes of a card into a scalar number. In machine learning this is called regression. However, machines can only operate on numbers, yet our cards feature a lot of text. Therefore, our first step is to vectorize each card, meaning we have to produce a one dimensional vector from each card.

The logic is very simple: each scalar attribute will be normalized to be in the range of [0,1] and each categorical attribute (e.g. type, or class) will be transformed into a one-hot encoded vector. These are zero vectors as long as many categories there are with 1’s being set to each element corresponding to the attribute on the card. Our function would look something this:

def card2vec(card):
# vectorize class
class_idx = card_classes.index(card['cardClass'])
class_vec = mx.nd.one_hot(mx.nd.array([class_idx]), len(card_classes)).squeeze()

# vectorize type
type_idx = card_types.index(card['type'])
type_vec = mx.nd.one_hot(mx.nd.array([type_idx]), len(card_types)).squeeze()

# vectorize attack
attack_vec = mx.nd.array([card.get('attack', 0) / max_attack])

# vectorize health
health_vec = mx.nd.array([card.get('health', 0) / max_health])

# vectorize rarity
rarity = card['rarity']
if rarity == 'FREE': # Free gets the same ID as common as there is no difference in occurence probabilty
rarity = 'COMMON'
rarity_idx = card_rarities.index(rarity)
rarity_vec = mx.nd.one_hot(mx.nd.array([rarity_idx]), len(card_rarities)).squeeze()

# vectorize race
race_vec = mx.nd.zeros(len(card_races))
if 'race' in card:
race_vec[card_races.index(card['race'])] = 1

# vectorize mechanics
mechanics_vec = mx.nd.zeros(len(card_mechanics))
if 'mechanics' in card:
for m in card['mechanics']:
mechanics_vec[card_mechanics.index(m)] = 1

# Concatenate vectors
sum_vec = mx.nd.concat(class_vec, type_vec, attack_vec, health_vec, rarity_vec, race_vec, mechanics_vec, dim=0)

return sum_vec

Similarly, we normalize our labels as well and assign them to our data. Since neutral cards have different tier score depending on the chosen hero class we will end up with 3917 card — label pairs for the 999 arena cards.

def get_label(card_id, hero_idx):
tier_score = arena_data[card_id]['value'][hero_idx]
if tier_score != '':
# remove all non-numeric characters
tier_score = re.sub(r"\D", "", tier_score)
return float(tier_score) / maximum_tier_score
return None

If you look closely you will notice two things:

  • We use Apache MXNet for our vectors
  • We do not use the text on the cards

Apache MXNet is my choice of doing matrix operations, feel free to use Numpy instead, should be very similar. The reason why we are not using text is because we rely on the mechanics attribute of the card that is hidden to the user but used by the game. Mechanics could be as simple as “DEATHRATTLE”, or as complex “RECEIVES_DOUBLE_SPELLDAMAGE_BONUS”. As you will see later it does not perfectly substitute the text but provides a good start for our experimentation.

Measuring success

Modern machine learning is good but not perfect. Thus, we need to be able to measure how good our model is. A simple metric would be to see how far our predictions are from the real value. This is nice, but we would like to emphasize larger deviations from the correct value so I recommend to use the mean squared error instead. Also in order not to fool ourselves I split our data pairs into a training (80%) and a validation (20%) set.

Learning arena scores

It seems we are all set to train our computer so let’s just throw some deep learning at it… oh oh, not so fast! Deep learning is very fancy and indeed powerful, however people often forget that it also needs a hell lot of data! We have less than 4000 data pairs, so why don’t we try something more classic first?

Let’s try linear regression first.


MSE is 653.65, that’s terrible. Now what about Bayesian? 0.01303. Ok, that’s a whole lot better, we are approximately 20 points off at each prediction. We can try support vector machines of different complexity (linear, quadratic, polynomial)


No improvement, still around 0.013… it seems we really do need to try deep learning.

Deep learning model

Again, we have very little data for this experiment so we need to make our model as compact as possible. Like a single layer perceptron would suffice. In MXNet Gluon our model would like something like this:

def create_model(input_size):
print('Model input size is %d' % input_size)

model = nn.HybridSequential()
model.add(nn.Dense(input_size, activation='relu'))

return model

Essentially, we have an input layer consisting of as many neurons as many inputs we have (60) and a single output neuron for our prediction. We also added dropout layers that help with learning.


And after 500 epochs (and few minutes) later we get 0.0056, that’s a lot better, we are only ±15 points wrong. Not terrific, but a good start.

Finally the scores

Now we can run our model on the cards of the new expansion to generate their predictions. Fortunately / unfortunately the new expansion introduces a few new mechanics like Overkill that we need to ignore since we have no data about it, so those predictions will be very inaccurate. Nevertheless, here is the table of our predictions using this very simple model.

Interestingly, the highest scoring card is Bloodclaw: the new 1 mana paladin weapon, this is a good example where the model overestimated a card’s value due to the lack of text processing. Thus, in part two we will do some fancy NLP to replace the mechanics attribute with real text understanding and see if it gets any better. Stay tuned!