Original article was published by Kaif Kohari on Deep Learning on Medium
The ideal workflow for your Machine Learning Project
When we start a new machine learning project we highly emphasize on training and testing models and less on understanding the data. It is really important to get ‘One with data’ before fitting it into a model.
Most of the Machine Learning models that make it to production fail badly and there are plenty of reasons for it. Some of the most common reasons are data distribution of training and testing data is different, mishaps in the training data like outliers, missing values, variance problem, etc. They perform really bad when tested with real world and unseen data. So if you want a successful ML model in production, you have to do a a lot more than just training and testing and you have to keep a lot of things in mind
In this blog, I have summarized all the things you should be aware of while working on a Machine Learning project, from data cleaning and analysis to choosing the right model and hyperparameter tuning.
Disclaimer: This post is highly biased towards normal tabular data. I have not included issues related to Time-series data like data leakage, NLP data like tokenization, image data like augmentation, image resizing etc and also I will be using the words ‘features’ and ‘columns’ of data interchangeably because they mean one and the same thing here.
The first thing we need is data. Kaggle and UCI ML repository are some of the best places to get ready-made data. Scikit-learn and tensorflow have collection of datasets ready to use in-built in the library.
Now lets say you want to solve a problem with Machine Learning and the type of data you need is not available in ready-made form on the internet. In such cases you will have to make your own dataset. For this you should be familiar with scraping data from the web and then converting it to a file.This blog also gives a good insight making your own datasets.
Once you get your data organised in tabular form, its time to do some analysis on it.
We have a long way to go, so grab a cup of coffee and get in the ride!!!
Glance through the data: It’s always good to get started by glancing through your data like looking at some rows to get an overview of the data.
Now lets analyze the features of our data
- Its very important to know the type of values you will be dealing with in your dataset like integer, float, object. You can do this using info() method in pandas.
2. Then you can use the describe method in pandas to get a statistical overview like mean, median, min, etc of all numerical features of the dataset.
3. Check for empty values in the dataset. Using pandas it can be done by the isna() method. Now if your dataset has missing values you need to find a solution to that problem or-else it can heavily effect your Machine Learning model during predictions.
Dealing with Missing values
1. Removing rows with missing values
The simplest strategy for handling missing data is to remove the entire row that contains a missing value. Pandas provides the dropna() function that can be used to drop either columns or rows with missing data.
2. Predicting missing values
We can predict the missing values using an ML algorithm like regression by putting all the data of missing values in test set.
3. Replace the missing values
We can also replace the missing values with the mean or median of the column data. Usually median is preferred over mean because an outlier (we will talk about it further in the post) in the feature can effect the mean highly but will only slightly effect the median.
“If any feature in the dataset has a lot of missing values, none of the above three methods will do any good. In that case the only right thing to do will be drop the column from your dataset.”
Now the dataset you have may contain lots of features and there is a high probability that few of the features or in some cases most of them are of no use for our analysis and ML models. So you should be well aware of which features to drop and which to keep i.e feature selection.
Some commonly used methods for feature selection are:-
1. Dropping features with lot of similar values (Problem of low variance)
There are instances in your data where a feature has a value repeated lots of times like for example 1,1,1,1,1,1,1,1,0,2,1,1,1,1,1,1,1…. . This types of features have very low variance and are hardly any useful for your Machine learning model. The best thing is to drop them.Now you can either calculate the variance of features manually by doing all the math or do it in few lines of code with sklearn’s variance threshold method.
2. Dropping features that by default are of no use
Let’s understand this by an example. This is the famous titanic dataset and you have to predict if the passenger survived or not (label= Survived) given other features like Sex, Age, etc. When you are doing analysis on this dataset, you know features like Age, Sex, no. of relatives of the passenger,etc played an important role in deciding who will be given high priority in escaping the drowning titanic whereas columns like Ticket number, Embarked (the port on which the passenger embarked the titanic), etc have nothing to do with if they Survived or not, so these features can be dropped and need not include in the model during training. (Before picking a dataset you should be really well aware of what the features mean, so you can have an idea of how effective they are while training the model)
3. Chi-square Method
Chi-square test is a very famous method used to find dependence between each feature with the target label in the dataset. Chi-square test is used only for categorical features. If you wish to know the in-depth math behind Chi-square test click here.
Sci-kit learn has an inbuilt method to calculate k-best features from your dataset using Chi-square test which can get your work done in a few lines of code. For references click here.
There are various other methods like lasso regression, etc too for feature selection .
*You can also decrease the no. of features using PCA.*
Most of times different features in the data might be have varying magnitudes. For example in a in case of grocery shopping datasets , we usually observe weight of the product in grams or pounds which will be a bigger numbers while price of the product might be dollars which will be lesser numbers.Many of the machine learning algorithms use euclidean distance between data point in their computation.Having two features with different range of numbers will let the feature with bigger range dominate the algorithm.
What is feature scaling?
The most commonly used technique of feature scaling is mean normalization. It is used when we want to bound our values between two numbers, typically, between [0,1] . Formula for normalization:
There are other methods for feature scaling like min-max scaler, etc although there is no hard and fast rule on which to use and mean normalization is highly used.
When to do feature scaling?
Feature scaling is essential for machine learning algorithms that calculate distances between data. If not scaled, the feature with a higher value range starts dominating when calculating distances. Some algorithms where feature scaling matters:
- Algorithms like KNN, K-means which involve calculation of Euclidean-distance measures require feature scaling
- Scaling is critical while performing Principal Component Analysis(PCA). PCA tries to get the features with maximum variance, and the variance is high for high magnitude features and bends the PCA towards high magnitude features.
- We can speed up gradient descent by scaling because θ descends quickly on small ranges and slowly on large ranges, and oscillates inefficiently down to the optimum when the variables are very uneven.
Below image explains how feature scaling speeds up gradient descent:
Problem of Outliers
“Observation which deviates so much from other observations as to arouse suspicion it was generated by a different mechanism” — Hawkins(1980)
Most common causes of outliers on a data set:
- Data entry errors (human errors)
- Measurement errors (instrument errors)
- Experimental errors (data extraction or experiment planning/executing errors)
- Natural (not an error, novelties in data)
How to detect outliers?
- Z-Score-This score helps to understand how far a data value is from the mean of the feature. More specifically, Z score tells how many standard deviations away a data point is from the mean. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean.
Z score = (x -mean) / std. deviation
.A standard cut-off value (threshold) for finding outliers are Z-scores of +/-3 or further from zero i.e if the Z-score of a value is greater or equal to 3 it is an outlier.
2. Scatter plots
A simple way to detect outlier is by using scatter plot. The outlier can be easily identified through data visualizations.
After detecting the outliers, you can replace them with median values of the respective features.
Data Visualization is something which is often regarded as obvious or unimportant due to its inherently subjective nature. But what people forget is that data visualization is a medium for communication between technical and non-technical people. When you want to explain your project to your CEO who doesn’t knows much about Data Science you convey your results and conclusions through plots and graphs not Python code.
Some basic data-visualizations I did in some of my projects:
Choosing the correct Machine learning model for your Problem
Now comes the big dilemma. Which Machine Learning model to choose for your problem? Confusion, confusion, confusion!!!
There is so much variety when it comes to choosing the correct model to solve your problem. Each category supervised and unsupervised (there is one more category semi-supervised which basically deals with generative networks not discussed in this post) has lots of varieties to choose from. For choosing the right model you should be familiar with the pros and cons of all the ML algorithms like SVM, KNN, Trees, Regression, Neural Networks, etc. It requires good understanding of the math behind the model.
Tip: One common thing many experienced Data Scientists say is ‘Keep it Simple’. You don’t always need neural networks to solve every problem. You can achieve great results by simple regression, clustering based algorithms too. Many of us have a bad habit of trying to solve every other problem using complicated models or deep NNs .
(In no way do I mean that simple ML models are better than NNs, what I mean to say is you don’t have to use Neural Nets for every other problem. )
Some basic things to keep in mind while choosing a model that’ll fetch you the best results:
- Size of training data: For really large datasets or those with many features, neural networks or boosted trees might be an excellent choice. Whereas smaller datasets might be better served by regression, Naive Bayes, or KNNs. It is always said that deep Neural Nets are hungry for data.
2. Complexity of data: For instance, decision trees, Neural Nets, polynomial regression works well in cases where your data has a complex shape, whereas linear models like Linear regression, linear SVM , work best where the dataset is linearly separable to some extent.
3. Dimensionality: Having a dataset with lot of features (high dimensionality) can become a big problem in traditional ML models.In today’s big data world it can also refer to several other potential issues that arise when your data has a huge number of dimensions.
If we have lots of features and less observations, there is a high risk of overfitting during training. When we have too many features, observations become harder to cluster, too many dimensions causes every observation to be very far from others. This will badly effect distance based algorithms like KNN, K-means, etc.
The problem of high dimensionality can be tackled by either avoiding distance-based algorithms or decreasing no of features using PCA , feature selection as discussed before.
4. Accuracy and/or Interpretability of the output
The big issue -> Model Accuracy inreases, complexity/interpretability of the model increases.
“As long as complex models are properly validated, it may be improper to use a model that is built for interpretation rather than predictive performance.“
Accuracy beats Interpretability:
Let’s say you are working on medical data and you really need a good accuracy for the project because wrong predictions by your model during medical diagnosis can be very harmful. So in this case you will go for the most complex model which will usually give the best accuracy and you will least care about its interpretability.
Predictive performance and accuracy should be given more importance but the interpretability (explanation) of the model is also something you should take care of as far as the business goal of your project is considered.
In Machine Learning, Model Parameters are the properties of training data that will learn on its own during training by the classifier or other ML model. For example, weights and biases, or split points in Decision Tree, etc.
Model Hyperparameters are instead properties that govern the entire training process. They include variables which determines the network structure (for example, Number of Hidden Units) and the variables which determine how the network is trained (for example, Learning Rate). Model hyperparameters are set before training. Our job is to tune the hyperparameters in such a way that it will give our model the best accuracy on train/val data and also work well on unseen test data.
Two of the most famous methods for optimizing hyperparameters:
- Grid Search: In Grid Search, we try every combination of a preset list of values of the hyper-parameters and evaluate the model for each combination. We note down accuracy for all possible combinations of hyperparameters and later the combination which gave the best results is chosen. In short we brute force over all possible hyperparameter combinations.
- Random Search: Random search is a technique where random combinations of the hyperparameters are used to find the best solution for the built model. It tries random combinations of a range of values.
Now that you know about Grid Search and Random Search the question arises which one to use and when?
Avoid Grid Search in NNs: As seen above Grid search means basically trying all combinations, it can be very time consuming and expensive in deep Neural Nets which contains so many hyperparameters. So, Random Search is preferred over Grid Search in NNs.
Whereas in basic ML models like linear regression, etc where there are a few hyperparameters to tune, it can be convenient to use Grid Search i.e. brute force over all parameters.
Check this out for code examples on hyperparameter tuning in basic ML models with help of Sci-kit learn.
For hyperparameter tuning in NNs, the best and easiest method is using sklearn wrapper with KerasClassifier. This notebook by Matthew Stewart does quite a good job of explaining it or you can also look up to the official documentation.
Now that you have chose the best Model and hyperparameters, your model is ready to do some predictions. For calculating the accuracy of how well your model predicted on test data in:
- classification models: You can check the accuracy using confusion matrix or ROC curve.
2. Regression models: MSE is the best choice.
Once you make it here you have unlocked the full potential of your project. You have a deep understanding of the the dataset and the problem, you’ve set up the entire training/evaluation infrastructure and achieved high confidence in its accuracy, and you’ve explored a lot, gaining performance improvements in ways you’ve predicted each step of the way. That’s it for now. If you have any queries feel free to contact me on Twitter or Linkedin.
If you enjoyed this blog don’t forget to leave some claps and share it with others who may find it useful.
Some more good blog posts: