 bias variance decomposition:the story behind

Source: Deep Learning on Medium

A gentle introduction

Most of us have faced before some general problems in machine learning projects:overfitting, underfitting..In order to solve this problems,the naive solution was about testing all suggestions that we could find in google without analysing the situation and the differents parameters that we should take in consideration :the nature of algorithme of learning ,his characteristics , the quality of data..In this article we will learn to determine the origin of errors that you have by understanding the “bias variance decomposition” so that you can improve the accuracy of your model after.Take a breath and bring with you a cup of coffee and let’s start working 😊😊😊😉.

Definitions and notations

In the following, “P” is a distribution composed of pairs of data points (x, y) where x denote an element of the input-set and y is the vector associated to this x(i.e the vector that should we get in output if we give the classifier the vector x in input according to the distribution P), we will denote by “A” an algorithm of classification (you can imagine whatever you want :Decision Trees,Random Forest,Neural Networks etc) and finally “h(D)” is an classifier that we did get by training the algorithme “A” using some dataset D ={(x1,y1),(x2,y2),….} that we formed from the distribution “P”.

note:we will use the squared error(squared_error(h(x),y)=[h(x)-(y)]^2) as error function for simplicity😇😇😄😄😄

The expected test error of an classifier

Intuitively, to calculate the expected test error of a classifier h (D), we have to repeat this process(we name him alpha for example) a large number of times:

1. we take a couple (x,y) from the distribution P (for sure we have to choose (x,y) that don’t belong to the dataset D).
2. we calculate the squared error in this case .
3. we record the result obtained.

at the end of the process alpha ,all we need to do is to calculate the average of the results that we recorded before , formally :

for more précsion , the expected test error of an classifier could be given as follow:

The expected classifier

As we said before,the classifier h(D) is a result of combination between two components:

1. The algorithm of classification “A” .
2. The dataset “D” that we used to feet this algorithm.

we can see that we could build many classifiers (h1,h2,h3,… ) using different dataset (D1,D2,D3,….) who are formed from the same distribution “P”,so we define the expected classifier as follow:

the output of the expected classifier when we give him the vector x as input would be calculated with the formula:

Take a sip of your coffee and take some rest to understand what we will try to explain now 😂😂😂😁😁😁.

The expected error of the algorithme “A”

In a simple way,we can calculate this value by repeating this process(we name him beta for example) a large number of times:

1. we build a dataset “D” from the distribution “P”.
2. we get the classifier h(D) by training the algorithm “A” using the dataset “D”.
3. we use the alpha process described before to calculate the expected test error for this particular classifier.
4. we record the result obtained.

and at the end of this process,all we need to do is to calculate the average of the results that we recorded before and finally we can say that we did get value wanted,formally:

we can develop this expression by using operations and mathematical properties to obtain this final expression composed of three elements, each of them represents a specific type of “error”.

Now,it’s time to interpret each component and understand how he can influence the accuracy of a classifier 👌👌👌👌👌😎😎😎😎.

The first component:”the variance”

this value describes the degree of awareness of the algorithm to the small changes in data, think about it: if we train the algorithm “A “with two different datasets D1 and D2 to get two different classifiers h1 and h2 ,how closer will be the outputs if we give h1 and h2 the same vector x, in fact the is the cause of the overfitting problem😲😲😲😲:the algorithm cares too much about the details in the data, so that it ignores the “main” characteristics and captures only the specific details of each sample during the training phase,that’s why ,we get a bad results when we show the classifier some new observations because he can’t find the captured details in the new data.

The second component:”the noise in data“

I think that you should analyse deeply this🧐🧐🧐🧐:(x, y) is a couple of data points where x denote an element of the input-set and y is the vector associated to this x(i.e the vector that should we get in output if we give the classifier the vector x in input according to the distribution P).What i mean with my words is that the association “x< — >y” is not necessary true wich happens sometimes in real life: during the preparation of data,we give a false label to a sample,we can say the this component describe the noise of data .