Original article was published by Asitdubey on Artificial Intelligence on Medium
Supervised Learning can be best understood by the help of Bias-Variance trade-off. The main aim of any model comes under Supervised learning is to estimate the target functions to predict the output with the help of input variables. Supervised learning consists of the Machine learning Algorithms, that are used for the data for its analysis by looking at its previous outcomes. Every action, has its outcomes or final target which helps it to be useful. Supervised Learning takes the help of the actions and its previous outcomes to analyze it and predict the possible outcomes of future. In Supervised Learning every algorithms function on some previous known data which is labeled; labeled here means that every information about the data is given. Algorithms is being trained on that labelled data repeatedly and then machine performs the actions based on that training to predict the outcomes. These predicted outcomes are more or less very similar to the past outcomes. This helps us to take decisions for the actions that hasn’t been occurred yet. Whether it is weather forecasting, predicting stock market price, house/property price, detecting email spam, recommendation system, self-driving car, churn modelling, sale of products etc., Supervised Learning comes into actions. In Supervised Learning, you supervise the learning process, meaning the data that you have collected here is labelled and so you know what input needs to be mapped to what output. it is the process of making an algorithm to learn to map an input to a particular output. This is achieved using the labelled datasets that you have collected. If the mapping is correct, the algorithm has successfully learned. Else, you make the necessary changes to the algorithm so that it can learn correctly. Supervised Learning algorithms can help make predictions for new unseen data that we obtain later in the future. It is as same as the teacher-student scenario. A teacher teaches the students to learn from the book (labelled datasets), and students learn from it and later on gives the test (prediction of algorithm) to pass. If the student fails (overfitting or underfitting), teacher tune the students (hyperparameter tuning) to perform better later on. But theirs a lot to catch-up between what is an ideal condition and what in practical possible. As no students (Algorithms) or teacher (datasets) can be 100 percent true or correct in their work. Same way, there are many advantages and disadvantages of every model and data that is been feeded into the model. Datasets might be unbalanced, consists of many missing values, improperly shaped and sized, can contains many outliers that makes any model task difficult to perform. Similarly, every model has its disadvantages or makes error in mapping the outputs. I will talk about these errors that prevent models to perform best and how can we overcome those errors.
Before proceeding with the model training, we should know about the errors (bias and variance) related to it. If we know about it, not only it would help us with better model training but also, helps us to deal with underfitting and overfitting of model.
This predictive error is of three types:
3. Irreducible error
Let’s deal with Bias first.
Bias are the simplifying assumptions made by a model to make the target function easier to learn. It is the difference between the predicted value and the correct value which we are trying to predict. Bias occurs when an algorithm has limited flexibility to understand the true meaning of the outcomes, when an algorithm is not that capable of predicting the actual outcomes. It occurs from the wrong assumptions made by algorithms.
Wikipedia states, “… bias is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).”
Bias is actually the measure of predictions made by an Algorithm. If an Algorithm makes wrong prediction it means, it has high bias and more accurate the prediction is, lower the bias. Think of it as bias judgement of people. If we are more bias towards any person, we are more likely to make wrong assumptions about them and vice-versa.
According to Forman’s article:
“Bias is the algorithm’s tendency to consistently learn the wrong thing by not taking into account all the information in the data (underfitting).”
Generally, linear algorithms (example of parametric algorithms) have a high bias making them fast to learn and easier to understand but generally less flexible. In turn, they have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias. Parametric Algorithms or linear algorithms consists of data with fixed sizes, independent of numbers of training example.
Bias is of two types:
High Bias — Suggests more assumptions about the target variable. E.g., Linear Regression, logistic Regression, Linear Discriminant Analysis.
Low Bias — Suggests less assumptions about the target variable. E.g., SVC, KNN, Decision Tree.
If we fit a Linear Regression model in Non-linear datasets, no matter how many data we collect, a linear line will not model the non-linear curve — Underfitting.
Variance is the change in amount in the estimation of target functions if different training data is used. Take it as, there are variety of apples differs in tastes depends on changes in the place where it grows.
From EliteDataScience, the variance is: “Variance refers to an algorithm’s sensitivity to specific sets of the training set occurs when an algorithm has limited flexibility to learn the true signal from the dataset (overfitting).”
Wikipedia states, “… variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).”
According to Forman’s:
—“Variance is the algorithm’s tendency to learn random things irrespective of the real signal by fitting highly flexible models that follow the error/noise in the data too closely (overfitting).”
As here we can see that the model tries to connect every points or noise and tries to fit perfectly with unconstrained and flexible model, covering up all the random noises — Overfitting.
Variance are of two types:
Low Variance — Suggests small changes to the estimate of the target function with changes to the training dataset. E.g., Linear Regression, Linear Discriminant Analysis, Logistic Regression.
High variance — Suggests large changes to the estimate of the target function with changes to the training dataset. E.g., SVC, KNN, Decision Tree.
In Variance, Algorithms with more flexibility tries to cover up every random noise to make the better prediction and due to this, model best fit to training sets but not able to estimate much accuracy for target variables. It overfits the data.
3. Irreducible Error
Irreducible errors cannot be reduced regardless of what Algorithm we use. This error doesn’t occur due to incapability of Algorithms to fit the data, but is occur due to unknown variables in the datasets. Well it can be reduced somehow by proper data cleaning.
We people know that we have only one Earth, and every person living on Earth is one of their kind i.e., not identical to anyone else. But let say, like in most of science fiction stories and series, there are multi-universe identical to the one we are living in. And since, there are multi-universe we can say that there can be multi-identical Earth on which people identical to us living. Now we have a very likable fictional character “Clark Kent (Superman)” on our Earth between us. He is brave, good, kind-hearted and our hero. Now, there could be a very good chance that every Earth has its Superman with them. Our Superman is great and a good person but can you tell that other Superman on other Earths is also good as of ours. Maybe maybe-not. Might be on some Earth, Clark Kent is not Superman just a reporter, on some Earth he works in bakery, on some Earth might be running for presidential candidate or can be a bad evil superpower Superman (black Superman as shown in comics or fictional movies). He can be anything, we cannot surely say. If we assume that he is the same Superman as of ours, we will be bias towards him and will make erroneous decision. If we assume, he plays different characters on different Earth and try to predict the value according to it we might make our Algorithm more complex and this will result in high variance or overfitting. If we use cross-validation technique, we can train on many sets and average the predictions but, we cannot reduce the overfitting. That’s where Bias-Variance Trade-Offs comes in picture.
To achieve the better result, every Supervised Algorithms has to follow the Bias-variance trade-offs to make good predictions. The goal of any supervised machine learning algorithm is to achieve low bias and low variance. In turn the algorithm should achieve good prediction performance.
Low Bias — High Variance
A low bias and high variance problem is overfitting. Different data sets are depicting insights given their respective dataset. Hence, the models will predict differently. However, if average the results, we will have a pretty accurate prediction. Non- Linear machine learning models has low bias and high variance. E.g., SVC, KNN, Decision Tree.
High Bias — Low Variance
High bias and low variance problem is underfitting. Different Algorithms will have same prediction to one another but might be inaccurate. Linear machine learning models has high bias and low variance. E.g., Linear Regression, logistic regression, linear discriminant analysis.
Total Error = Bias² + Variance + Irreducible Error
Dealing with High Variance
Increasing the training data; larger the data will be more accurate the predictions will.
Because of overfitting, we can try using less features.
Increasing the regularization by increasing the lambda, it will regularize the model more.
By tuning the model more, for like in KNN we can increase the n_neighbors value to predict more and in SVC we can increase the C parameter that controls the violation of margins in training data and thus increase the bias.
Dealing with High Bias
We can try increasing the number of features and also increasing the polynomial features will complicate the model.
We can decrease the lambda to less regularize the model so we can fit the data better.
There is no proper way to escape from this, by tuning the model and controlling the training features we might be able to lessen the effect of Bias-Variance Trade-Offs.
By optimizing bias and variance, we can finally achieve the model to best fit the data. This is the goal of every machine learning model to best fit the data and to increase its performance.
— I am new to Data Science and Machine Learning, If anything i missed mentioning here, or written anything wrong, please guide me through it. It’ll be great help to me.