“Can the Game be played on paper? (Part 3)”

Original article was published by Saurav Shenoy on Deep Learning on Medium

Here is another odd observation! Batch normalization almost always helps reduce overfitting, but in this case it actually does not affect variance while reducing bias. Still this may help as implemented with other regularization methods may help reach an acceptable level of both bias and variance.

L1 performed better than L2 but was still unable to prevent overfitting (even with lamda as high as 0.8!). I attempted both together and prevented overfitting but only by causing the training accuracy to drop to 65%. This is a huge amount of trade off, but likely the only way to prevent overfitting with such little data.

L1L2 regularization

This may seem counterintuitive, but by increasing the number of layers and units per layer, this will increase training accuracy, and then the same overfitting methods may lead to higher validation accuracy as a result. It could also lead to more overfitting in the first place so the validation accuracy would remain the same, but I think dropout is likely to work as the number of trainable parameters in the network increases.

Unfortunately, this did not work as to prevent overfitting, an even lower accuracy was reached:

7 layer model

I will try one last variation before moving on. Although it was concluded that 3 seasons of data is best before a drop off in performance, adding more training data to a deep neural network can reduce overfitting, so as a last-ditch attempt I will use 10 seasons of data. This attempt was also to no avail, although it managed to train much faster (150 epochs to reach the same accuracy).

So logistic regression was actually able to perform better than the deeper neural network, most likely due to the lack of data.

Using the same intuitions gained from the 4 input model, I was able to test the 8 input model much faster, but again the neural network was unable to beat logistic regression.


  1. The “Big Data” term thrown around is proven here, and the small dataset makes neural networks much more difficult to train
  2. The Adam optimizer, as proven, is so versatile as it works better than simple batch gradient descent, even for such a small dataset
  3. You could continue to gather more seasons of data, but the evolution of the game likely changes the weights, offsetting the addition of data that helps reduce overfitting; so in this case, more data does not necessarily help
  4. Overfitting was a HUGE problem with such a small dataset, proving to be a problem too large to solve. It did reach 90% training data but the trade off to reduce overfitting was much, much larger than a regular application of a neural network
  5. Logistic regression performed better than the neural network, meaning a simple linear relationship between the inputs suffice
  6. Even though it was unable to reach the desired 70%, 67% is pretty good!

Stay tuned, in the next series I will approach different methods (SVMs, Random Forrest)