Chapter 5: Machine learning basics

Source: Deep Learning on Medium

Chapter 5: Machine learning basics(part 1)

This story is the summary of my intuition from Deep learning book by Ian Goodfellow, Yoshua Bengio and Aaron Courville

If you haven’t read my previous story on the previous chapters please go check

This chapter begins by explaining linear regression, easy ha? I know but this books really went beyond the simple linear equation we did see in many tutorials so let begin with linear regression equation:

where y is the target value we need to predict, beta is the slope or in other words the weight and X is the feature we have from the dataset, so this weights get tuned until it gives us a good results for y, So how we measure it the goodness of this simple model, how you ever heard about cross entropy and shocked? yes, likewise. cross entropy is just the difference between the data distribution between the model and the actual dataset, the distribution of the model is by default gaussian so imagine if you have a gaussian distibution like so,

and a scatter dataset points like this,

what is the model doing during training is that it tries to match these two photos as much as possible without mimicking the exact disstrbution of the data so we don’t face overfitting so the metric or the cross entropy here is just root mean square error of L2 distance between predicted and actual pt.

by the way, we can determine whether the model is going to over or under fit by controlling the model capacity,

In fact model’s capacity is the ability of model to fit a function, for example we can increase the capacity of this linear regression model to include polynomials and this increases it’s hypothesis space and thus increases it’s complexity and make it likely to overfit if the dataset given is small in size

Here’s a plot that shows the optimal model capacity for a given dataset.

so what actually measure model measures model capacity, here Vapnik-Chervonenkis dimension or VC dimension comes to the scene, The VC dimension of a classifier is defined by Vapnik and Chervonenkis to be the cardinality (size) of the largest set of points that the classification algorithm can shatter, here’s an article that went deeper if you are interested

Parametric vs non parametric models?

linear regression, logistic regression, and SVMs where the form of the model was pre-defined — non-parametric learners do not have a model structure specified a priori. We don’t speculate about the form of the function f that we’re trying to learn before training the model, as we did previously with linear regression. Instead, the model structure is purely determined from the data. in other words the parameters in linear regression which is Weights (beta) are set and independent of the input dataset while for example in KNN it’s mechanism is predicting y for Xtest such that y is equal to y that associates with the nearest Xtrain to Xtest in the training dataset, here’s the math formula,

Y =Yi such that i = min(Xtrain, Xtest)² so here it’s dependent on the dataset and has no set parameters.


bias is how much off we estimate out weights, bias equation is bias(theta) = E(Theta) — Theta, so the bias of a certain parameter Theta is the estimation of Theta minus the actual Theta

So if the bias is zero we said that this parameter is unbiased, I will give an example to make things clear from the book.

Consider a set of samples

{x(1), . . . , x(m)} that are independently and identically distributed according to a Bernoulli distribution with mean Theta.


In plain english it’s how much the variation in the dataset changes our estimator(weight) and thus affects our model performance we for sure needs an estimator with low bias and low variance

In the end the mean square error for the model is the sum of bias and variance of a given estimator(weight), MSE = (E(Theta) — Theta)² = bias(Theta) + var(Theta)

here bias comes with underfitting as we got our parameters wrong so we missed an important information from the data while variance comes with overfitting in which if it increased it means the estimator catches every information from the data and failed relatively in generalization.

This chapter will be completed in another story where I will talk about differences between frequentist and bayesian theory, supervised and unsupervised algorithm from inside!