Real Examples of Bias Variance Tradeoff in Deep Learning

Original article can be found here (source): Deep Learning on Medium


Ahhh… the bias-variance trade-off…

You have probably seen tons of articles written on this all over the internet by now.

This isn’t like one of them.

There seems to be a lack of examples relating to deep learning on how (high/low) bias and variance actually look like.

Hopefully this post fills that gap.

Here’s what I’d be covering:

  1. A brief conceptual understanding of bias vs variance
  2. How the fit loss curves might look like in different scenarios
  3. How to mitigate it
  4. How to prioritize what to mitigate first — should you lower bias or variance first?

Bias and Variance in Machine Learning

The terms “Bias” and “Variance” actually have different meanings across industries.

In psychology, “Bias” could refer to the whole gang of cognitive biases! e.g. information bias, confirmation bias, attention bias etc.

Fun fact: I actually spent 4 years of my life getting a psychology degree. Yet here I am writing a post that is nothing to do with it. It’s funny how life turns out.

In mathematics, the term “Variance” refers to the squared deviation from the mean.

In the context of machine learning however, it simply refers to whether your trained model has either learned too little (bias) or learned too much (variance).

In simple terms,

Bias = A simple model that under-fits the data


Variance = A complex model that over-fits the data

So how does this look like conceptually?

Adapting the axes and data points off Andrew Ng’s course, here’s how the different scenarios look like if I draw on the images.

Example of High Bias with Low Variance and Low Variance with High Bias
Example of High Bias with High Variance and Low Variance and Low Bias

I think the hardest concept to get round is the High bias and High Variance scenario.

I’ve annotated the diagram to indicate the areas of high bias and variances so you’d visually understand how it conceptually looks like.

So far so good yes?

If you’re looking for the math and derivations, there are tons of other posts that covers that. I won’t be covering those.

Now that you’ve got a brief understanding on what bias and variance mean in the context of machine learning, let’s go ahead and view examples of how they actually look like.

Examples of the Bias Variance Trade-off

Here are some of the actual screen captures of how the different bias vs variance scenarios look like.

These were results of multiple deep learning models being run to find the good examples of each. 🙂

Low Bias and Low Variance

Example of Low Bias and Low Variance

Low Bias and High Variance

Example of Low Bias and High Variance

High Bias and Low Variance

Example of High Bias and Low Variance

High Bias and High Variance

Example of High Bias and High Variance

You have just seen four examples of how each combination might look like.

Could you tell which ones had high biases or high variances? What were you looking at to come to your conclusion?

Here’s what I look at to determine the biases and variances.

The first thing I look at are the fit error statistics.

In the above results, these are the two numbers labelled on the curves. Those numbers represents the minimum training/validation error across epochs.

If training error is high, let’s say at 15.39%, then you could say that your model is under-fitted or has high bias.

I will explain why “could” is in italics later in the post. There is an underlying assumption here.

If the validation error is high, let’s say 17.77%, AND the difference between the validation error and training error is relatively big (look at the white space between the lines), then you could say your model is over-fitted or has high variance.

In combination, if training and validation errors are high, AND the difference between the two statistic is relatively big, then there is high bias and high variance.

Feel free to scroll up again to view the results one more time to confirm your understanding.

How do you mitigate the biases and variances problem?

Here’s a general framework for addressing bias/variance in machine learning.

In this case, for your deep learning projects.

Flowchart on how to address bias and variance in deep learning projects

Here’s a concrete example on how regularization helps.

Example of the effects of regularization on a deep learning model

Sadly upon regularization, sometimes you might end up with the above scenario.

The model went from low bias, high variance to high bias, low variance.

In other words, by setting a L2 regularization to 0.001, I have penalised the weights too much causing the model to underfit.

Prioritizing Mitigation Efforts

As you have seen the flowchart above, you should almost always prioritize reducing bias first.

However, recall that I mentioned “could” in italics above and alluded it to an underlying assumption.

Food for thought.

On what basis is your bias measured against?

Think about it for a moment.

If you haven’t noticed, the basis for variance is your bias.

So what is the baseline for bias? Was 15.39% training error really bad?

It depends.

It could be bad for sure! Or… perhaps not.

The basis for bias comparison is the assumption of human-level performance for that particular task.

If I were to tell you that the task I was training on was a difficult Name Entity Recognition (NER) task and humans tend to get the labels wrong 15% of the time.


15% vs 15.33% isn’t too bad!

I would even say the model is a good fit!

The model is performing at an equivalent level a human would perform.

Now the question begets.

How do you measure human-level performance?

A good way is to apply stratified sampling to the training set and cover the labels. Then get a human or a group of people to painstakingly label the samples.

The results of this manual-labeling task is your human-level performance.

Now just for formalities and technicalities…

The difference between your training error and human-level error is what is known as “avoidable bias”.

Note: You might come across the term “Bayes Error” when doing research. Human-level error is an approximation of Bayes error. In statistics, Bayes error is the lowest possible error for any classifier of a random outcome. It is analogous to an asymptote in mathematics; your goal is to get as close to it as possible but you know you will never reach it.

Now that’s out of the way, let’s get back to the initial question on prioritization.

When should you focus on reducing bias vs reducing variance?

Machine Learning Error Analysis

To understand where you are to focus your time and efforts, you will need to do an error analysis.

The following are examples of machine learning error analysis.

It’s pretty straight forward to comprehend.

An example of prioritizing bias

In this example, the avoidable bias is higher than variance; 4% vs 1%.

Therefore, your priority should be to reduce the bias in the model.

Now let’s say we change it up a little.

Which should you prioritize now?

An example of prioritizing variance

In this case, the variance is higher than the avoidable bias; 5% vs 1.5%.

If this is the case, then focus your time on reducing the variance.

Now, as of this point in the post, I’ve covered what I wanted to cover.

But since we are at the topic of error analysis, let’s take it one step further.

What do you do if this happens?

Example of high test error

I don’t know the name for the difference between test and validation error so I labelled it a “?”.

But notice that the model is doing pretty well on both training and validation sets.

However, it failed to perform on the test set.

Why is that?

Here’s a couple of things to check if this happens to you:

  1. Does your test set come from the same distribution as your validation set? If you train a cat classifier, are your test set pictures of tigers?
  2. Have you over-fitted on your validation set? Increase your validation set size. Remember to shuffle your data as well.
  3. Your evaluation metric used during training/validation was not a good indicator. You will probably have to change the metric.

There could be more reasons but these are the few things I would do.