Deep Learning: Why is my validation loss lower than my training loss?

Original article was published on Artificial Intelligence on Medium

Deep Learning: Why is my validation loss lower than my training loss?

I first became interested in studying machine learning and neural networks in late high school. Back then there weren’t many accessible machine learning libraries — and there certainly was no scikit-learn.

After every school day I would hop on the bus home, and within 15 minutes I would be in front of my laptop, studying machine learning, and attempting to implement various algorithms by hand.

I rarely stopped for a break, more than occasionally skipping dinner just so I could keep working and studying late into the night. During these late-night sessions I would hand-implement models and optimization algorithms (and in Java of all languages; I was learning Java at the time as well). And since they were hand-implemented ML algorithms by a budding high school programmer with only a single calculus course under his belt, my implementations were undoubtedly prone to bugs.

I remember one night in particular.

The time was 1:30 AM. I was tired. I was hungry (since I skipped dinner). And I was anxious about my History test the next day which I most certainly did not study for.

I was attempting to train a simple feedforward neural network to classify image contents based on basic color channel statistics (i.e., mean and standard deviation).

My network was training…but I was running into a very strange phenomenon:

My validation loss was lower than training loss!

In this article, you will learn the three primary reasons your validation loss may be lower than your training loss when training your own custom deep neural networks.

How could that possibly be?

  • Did I accidentally switch the plot labels for training and validation loss? Potentially. I didn’t have a plotting library like matplotlib so my loss logs were being piped to a CSV file and then plotted in Excel. Definitely prone to human error.
  • Was there a bug in my code? Almost certainly. I was teaching myself Java and machine learning at the same time — there were definitely bugs of some sort in that code.
  • Was I just so tired that my brain couldn’t comprehend it? Also very likely. I wasn’t sleeping much during that time of my life and could have very easily missed something obvious.

But, as it turns out it was none of the above cases — my validation loss was legitimately lower than my training loss.

It took me until my Second year of college when I took my first formal machine learning course to finally understand why validation loss can be lower than training loss.

And a few months ago, brilliant author, Aurélien Geron, posted a tweet thread that concisely explains why you may encounter validation loss being lower than training loss.

I was inspired by Aurélien’s excellent explanation and wanted to share it here with my own commentary, ensuring that no students (like me many years ago) have to scratch their heads and wonder “Why is my validation loss lower than my training loss?!”.

Why is my validation loss lower than my training loss?

First let’s take a look at what’s a “Loss” when training Neural Networks?

Figure 1: What is the “loss” in the context of machine/deep learning?

At the most basic level, a loss function quantifies how “good” or “bad” a given predictor is at classifying the input data points in a dataset.

The smaller the loss, the better a job the classifier is at modeling the relationship between the input data and the output targets.

That said, there is a point where we can overfit our model — by modeling the training data too closely, our model loses the ability to generalize.

We, therefore, seek to:

  1. Drive our loss down, thereby improving our model accuracy.
  2. Do so as fast as possible and with as little hyperparameter updates/experiments.
  3. All without overfitting our network and modeling the training data too closely.

It’s a balancing act and our choice of loss function and model optimizer can dramatically impact the quality, accuracy, and generalizability of our final model.

Typical loss functions (also called “objective functions” or “scoring functions”) include:

  • Binary cross-entropy
  • Categorical cross-entropy
  • Sparse categorical cross-entropy
  • Mean Squared Error (MSE)
  • Mean Absolute Error (MAE)
  • Standard Hinge
  • Squared Hinge

A full review of loss functions is outside the scope of this article, but for the time being, just understand that for most tasks:

  • Loss measures the “goodness” of your model
  • The smaller the loss, the better
  • But you need to be careful not to overfit

Reason #1: Regularization applied during training, but not during validation/testing

Figure 2: The first reason is that regularization is applied during training but not during validation/testing.

When training a deep neural network we often apply regularization to help our model:

  1. Obtain higher validation/testing accuracy
  2. And ideally, to generalize better to the data outside the validation and testing sets

Regularization methods often sacrifice training accuracy to improve validation/testing accuracy — in some cases that can lead to your validation loss being lower than your training loss.

Secondly, keep in mind that regularization methods such as dropout are not applied at validation/testing time.

As Aurélien shows in Figure 2, factoring in regularization to validation loss (ex., applying dropout during validation/testing time) can make your training/validation loss curves look more similar.

Reason #2: Training loss is measured during each epoch while validation loss is measured after each epoch

Figure 3: The second reason has to do with when the measurement is taken

The second reason you may see validation loss lower than training loss is due to how the loss value are measured and reported:

  1. Training loss is measured during each epoch
  2. While validation loss is measured after each epoch

Your training loss is continually reported over the course of an entire epoch; however, validation metrics are computed over the validation set only once the current training epoch is completed.

This implies, that on average, training losses are measured half an epoch earlier.

If you shift the training losses half an epoch to the left you’ll see that the gaps between the training and losses values are much smaller.

Reason #3: The validation set may be easier than the training set (or there may be leaks)

The final most common reason for validation loss being lower than your training loss is due to the data distribution itself.

Consider how your validation set was acquired:

  • Can you guarantee that the validation set was sampled from the same distribution as the training set?
  • Are you certain that the validation examples are just as challenging as your training images?
  • Can you assure there was no “data leakage” (i.e., training samples getting accidentally mixed in with validation/testing samples)?
  • Are you confident your code created the training, validation, and testing splits properly?

Every single deep learning practitioner has made the above mistakes at least once in their career.

Yes, it is embarrassing when it happens — but that’s the point — it does happen, so take the time now to investigate your code.

BONUS: You may be over-regularizing your model. Try reducing your regularization constraints, including increasing your model capacity (i.e., making it deeper with more parameters), reducing dropout, reducing L2 weight decay strength, etc.

Hopefully, this helps clear up any confusion on why your validation loss may be lower than your training loss!

It was certainly a head-scratcher for me when I first started studying machine learning and neural networks and it took me until mid-college to understand exactly why that happens — and none of the explanations back then were as clear and concise as Aurélien’s.

I hope you enjoyed this Article!