Reducing the Artificial Neural Network complexity by transforming your data

Original article was published by Walinton Cambronero on Artificial Intelligence on Medium


Reducing the Artificial Neural Network complexity by transforming your data

Introduction

The need to reduce the complexity of a given model can have multiple drivers, e.g., to improve the computational performance. A model has likely been chosen because after many iterations of training and testing, that’s the one model with the best results. Therefore, the model complexity can’t be arbitrarily reduced. Research on this topic is active, e.g., Sebastiaan Koning et.al (2019) discuss this same problem for the AstroNet dataset in this paper. In their paper, two methods are proposed [1]:

The first method makes only a tactical reduction of layers in AstroNet while the second method also modifies the original input data by means of a Gaussian pyramid

The second method (modifying or transforming the input data) is a common-practice. According to Google’s Crash Course on Machine Learning, transformations are done primarily for two reasons:

  1. Mandatory transformations: make the data processable by a Machine Learning algorithm, e.g. converting non-numeric features into numeric.
  2. Optional quality transformations: help the model perform better, e.g. normalizing numeric features.

The kind of transformation proposed by Sebastiaan Koning in [1], and the one proposed in this article, fall in the second category.

Objective

In this article, I present a linear data transformation for the Poker hand dataset and show how it helps reduce the model complexity for a Multi-layer Perceptron (MLP) Neural Network. In a previous story, I talked about the Poker hand dataset and how to measure the performance of a classifier on this dataset. A 3-layers MLP performed relatively well. Today I want show that it is possible to obtain equivalent accuracy with a less complex model by simply understanding the data we’re working with and transforming it so that it’s more appropriate for the problem we’re trying to solve.

Dataset description

This particular dataset is very human-friendly. It uses a 11-dimensional description of poker hands by explicitly listing the suite and rank of each card, and the associated poker hand. Each data instance represents 5 cards. Each card has two attributes: suite and rank. The last attribute is the poker-hand.

Encoding and example

The following is the encoding used in the dataset and one example. The dataset is very well-documented here in case you’d like more details.

Photo: Graeme Main/MOD

Suite: 1: Hearts, 2: Spades, 3: Diamonds, 4: Clubs
Rank: 1: Ace, 2:2, …, 11: Jack, 12: Queen, 13: King
Hand: 0: Nothing 1: Pair 2: Two pairs, …, 8: Straight Flush 9: Royal Flush
Sample: (Royal Flush of Hearts): 1,10,1,11,1,13,1,12,1,1,9

Transformation

The transformation is based in the fact that the order in which the cards appear (in a hand) doesn’t matter (to classify the hand), and that a more important attribute for classifying a hand is the number of cards of the same rank or suite that appears in the hand. The original dataset model gives an artificial importance to the order in which the cards appear in the hand (samples are ordered lists of 5 cards) and it does not explicitly encode the number of times a rank or suite appears. This attribute needs to be learned by the neural network. The premise is that by making this attribute explicitly available in the data, a given neural-network should be able to better describe the classifier function, in comparison to the same neural-network when using the original model in which the attribute is hidden.

Linear transformation

The following is a linear transformation from the original 11D space to a new 18D space. A linear transformation is preferable due to its reduced computational requirements. The new dimensions and descriptions are:

Attributes 0 through 12: The 13 different ranks, i.e. 0: Ace 1: Two 2: 10: Jack, 11: Queen, 12: King.

Attributes 13 through 16: The 4 different suites, i.e. 13: Hearts, 14: Spades, 15: Diamonds, 16: Clubs

Domain for both: [0–5]. Each dimension represents the number of times the rank or suite appears in the hand. There are 5 cards per hand, hence the maximum value is 5.

Last dimension: (label): Poker hand [0–9] (remains unchanged).

Encoding and example

The following is an example transformation for the same Royal Flush.

Representation in original dimensions (11D):

Data: 1,10,1,11,1,13,1,12,1,1,9
Encodes: Hearts-Ten, Hearts-Jack, Hearts-King, Hearts-Queen, Hearts-Ace, Royal-Flush

Representation in new dimensions (18D):

Data: 1,0,0,0,0,0,0,0,0,1,1,1,1,5,0,0,0,9
Encodes: 1^st column = 1 ace, 10^th through 13^th columns = 10, Jack, Queen and King, 14^th column = 5 cards are hearts, and 18^th column still means a Royal Flush.

The new model represents any given a combination of 5 cards the same way regardless of order and explicitly exposes information useful for Poker hands such as the number of cards of the same Rank.

Tools

Scikit-learn, Numpy and Seaborn are used for the Machine Learning, Data processing and visualization, respectively.

Where is the code

A Jupyter notebook with both the MLP and linear transformation code is available in my Github. Find the Jupyter notebook here.

Results

In my previous story, I showed that a MLP with 3 hidden layers of 100 neurons each, with alpha=0.0001 and learning rate=0.01 using the original dataset, achieves an ~78% accuracy. These hyper-parameters were found after running an extensive grid-search over a wide range of values. So, the following measurements will be made based on these same values.

Metrics

The MLP accuracy is measured with the F1 macro-average metric. This is an appropriate metric for the Poker hand dataset, as it deals nicely with the fact that this dataset is extremely imbalanced. From Scikit-learn’s documentation:

The F-measure can be interpreted as a weighted harmonic mean of the precision and recall … In problems where infrequent classes are nonetheless important, macro-averaging may be a means of highlighting their performance

Scikit-learn’s Classification Report is shown for the different experiments. It contains the macro-average F1 metric, among others.

In addition, the time spent in the MLP training for both cases is also shown.

3 layers MLP with original data

Accuracy: For the 3-layers MLP and the original data (no transformation applied yet), a ~80% accuracy in the F1-score macro-average is obtained.

Time spent in training: 20+ seconds

Classification report

              precision    recall  f1-score   support
0 1.00 0.99 0.99 501209
1 0.99 0.99 0.99 422498
2 0.96 1.00 0.98 47622
3 0.99 0.99 0.99 21121
4 0.85 0.64 0.73 3885
5 0.97 0.99 0.98 1996
6 0.77 0.98 0.86 1424
7 0.70 0.23 0.35 230
8 1.00 0.83 0.91 12
9 0.04 0.33 0.07 3
accuracy 0.99 1000000
macro avg 0.83 0.80 0.78 1000000
weighted avg 0.99 0.99 0.99 1000000

2 Layers MLP with transformed data

Accuracy: For the 2-layer MLP with the transformed data, it can be observed that ~85% accuracy is obtained. This in fact a simpler neural-network model: it has the same number of neurons and hyper-parameters but one less hidden-layer.

Time spent in training: 10–15 seconds

              precision    recall  f1-score   support           0       1.00      1.00      1.00    501209
1 1.00 1.00 1.00 422498
2 1.00 1.00 1.00 47622
3 0.97 1.00 0.98 21121
4 1.00 0.99 1.00 3885
5 1.00 0.98 0.99 1996
6 0.83 0.48 0.61 1424
7 1.00 0.41 0.58 230
8 0.38 0.75 0.50 12
9 0.50 1.00 0.67 3
accuracy 1.00 1000000
macro avg 0.87 0.86 0.83 1000000
weighted avg 1.00 1.00 1.00 1000000

1 Layer MLP with transformed data

Accuracy: With a single layer, the MLP with the transformed data achieved ~70% accuracy. When compared with the accuracy obtained with the original dataset (~30%), it performs almost twice as good.

Time spent in training: ~10 seconds

Other experiments

Feel free to take a look at the Jupyter notebook that has the code and results for these and other experiments.

Conclusion

By applying a simple linear transformation that makes the dataset less human-friendly but more ML-friendly, the MLP model was simplified. Specifically, a hidden-layer of 100 neurons was removed without compromising the performance of the classifier. The results show that the accuracy results are similar or better than the ones achieved by the more complex neural-network, and the time spent on training was reduced by 25% to 50% approximately.

References

[1] Sebastiaan Koning, Caspar Greeven, Eric Postma (2019) Reducing Artificial Neural Network Complexity: A Case Study on Exoplanet Detection. https://arxiv.org/abs/1902.10385