Original article was published on Artificial Intelligence on Medium
Implementing an end-to-end Machine Learning Project with Deployment
Hey, everyone! I am Sayantan Gupta, currently working as an Engineer in Qualcomm Inc. with over 2 years of experience.
Nowadays, almost everyone is familiar with the terminology- “ Machine Learning and Artificial Intelligence”. Its exponential increase in demand,requirement and application worldwide has made it an essential skill to learn in this last 5–6 years. In this article, I am going to cover an end-to-end Machine Learning Project starting from data cleaning and pre-processing right till deployment. So, lets get started!!
Let’s first understand what is a Machine Learning or a Data Science Pipeline!
Machine Learning Pipelines:
Generally speaking, there are 5 major steps in any data science project. Together it is called lifecycle of a data science project
- Data Analysis
- Feature Engineering
- Outlier Removal
- Model Building
- Model Deployment
We can build a data science pipeline, where each step can function independently and then finally integrate together to create a complete data science project.
About the Project:
Moving on to the project, this project is about predicting selling price of houses in Bangalore, India ,based on various factors like location, square feet, no. of BHK’s, baths etc. The dataset for this project is taken from Kaggle (Link). As some of you might have already guessed,this is a regression problem, so we have to solve accordingly. Before starting with the steps one by one, lets first look at the UI of the final project which will be our output:-
Project Link on Heroku: https://house-price-predictor-app.herokuapp.com/
Lets start with the steps. Before starting, I must say that this is not the only way to solve this problem. There are multiple number of ways to solve this problem and some will definitely be better and more efficient than this, with higher accuracy. What I want is to show you guys is one complete end-to-end way to complete a ML project. In case, if you have any queries related to this project, feel free to ping me on Linkedin. My Linkedin profile : https://www.linkedin.com/in/sayantan-gupta-779a23125/
So, lets get started!!
- Data Analysis:
1.1 Importing libraries
Import all the required libraries: Pandas, numpy, matplotlib. Set display option as “display.max_columns” in order to ensure that whenever you are viewing the dataset, you get to see all the columns i.e all the features in the dataset. After that
1.2 Get the dataset and check its basic information
Read the dataset csv file and get its shape. As we can see, the shape is (13320,9), meaning it has 13,320 records and 9 features. If we see dataset.head(), we get to see the first 5 records in the dataset. Then we see the dataset info using info() function.We can see that there are total 9 columns(features). 3 of them are of dtype “float64”, which are numerical features, while 6 of them are of dtype “object”, which are categorical features. We can also see the no. of non-null values in it.
1.3 Analyzing the features
This is where the domain knowledge related to which you are making any project comes into play. Having a good domain knowledge helps you in judging whether you should remove any column which you think is not significant in predicting the dependent feature(here selling price).In this case, we can opt out “availability” since it generally does not affect the selling price of houses.
This is a brief data analysis and now we can move to Feature Engineering
2. Feature Engineering
2.1 Removing Null values from all the features
This is one of the most important steps in the entire data science pipeline. 1st we check the sum of null values for all the columns using isnull().sum() function
Generally we can remove those columns which have no. of null values more than 50% of all the records. In this case we can consider the threshold value as 50% of 13320 i.e 6660. Here no column has no. of null values more than 6660. However we can see “society” has a pretty high no. of null values(5502) compared to others. So, there is a disproportion in the dataset and it’s better to remove “society”.
We will first separate the numerical and categorical columns and then remove null values for each of them, as procedure to remove null values is different for numerical and categorical columns .1st lets do this for Numerical features:
First find all the numerical features using a list comprehension in python:numerical_features = [feature for feature in dataset.columns if dataset[feature].dtype != ‘O’]. As mentioned above, numerical features are those whose dtype is not object(‘O’), so we can use this property and get all the numerical features. For a particular numerical feature, we will replace its null values by the mean of all the values in that column, using fillna() function. We will do this for all the numeric features. Next we remove null values for categorical features
Using list comprehension, we find the categorical features, using the property that dtype = object(“O”) for categorical features. Now for a particular categorical feature, we will replace its null values by the mode in that column, using fillna() function. Mode is the most commonly seen value in a data. We will do this for all the categorical features. Now all the null values are removed.
2.2 Handle categorical features to remove redundant or unexpected values
If we observe the dataset carefully, we will see that some features have redundant values. For eg: in BHK features, both 2 BHK and 2 bedroom categories are there. We have to merge them into one category, otherwise our Machine learning model will treat them as different categories, which is incorrect. Lets start by finding the unique categories for all the categorical features:
For area_type, there are no redundant categories
For location, it looks like there are no redundant categories as well
For ‘size’ feature, we can see redundant categories like 2BHK and 2 bedroom, 4 BHK and 4 bedroom etc. So, we can create a new column which will not have these redundant categories. For size, categories are 2 BHK, 2 bedroom, 3 BHK etc, so better to keep only 1st number i.e 2, 3 ,4 etc as no. of BHK in ‘size’ feature. For that, we can split the string on the basis of “ “ and consider only the 1st part which is the number of bedrooms
For ‘total_sqft’ feature, it is a number but it is still a categorical data type. We need to convert it to float data type. But 1st lets check for redundant values
As we can see from above that some of the values have ranges like ‘1133–1384’. In such cases, we should consider the average of both the intervals and then typecast it to float
As we can see from above that there are some cases where “convert_range_to_float” function is returning Null. Thus we need to remove this null values and replace it with mean
We need to introduce one new feature “Price per sqft.” which is a very important feature to be considered while doing feature Engineering. We will see its usage later, while removing outliers
2.3 Handle rare categorical values
There are some features, for eg, location, which have many categories. If we see the total no. of unique categories:->
There are 1305 unique categories in ‘location’and one hot encoding will create extra 1304 features(columns), increasing the size of the dataset by a large extent, which is not at all feasible. This phenomenon of having so many new features getting introduced due to one hot encoding is called dimensionality curse and we need to avoid it. For that we need to perform dimensionality reduction. One technique is to label all rare categorical features with a single label (for eg:”Other”). Rare categories are those which occurs very less number of times(we can keep the threshold as 1%, i.e those categories which occurs less than 1% of total no. of records are considered rare categories)
If we do dataset[‘location’].value_counts(), we will see that many categories are occuring very less number of times ->
We need to merge all these rare categories into a single category. For ‘location’, we can keep threshold count as 30. After that we can apply lambda function to consider all those categories as “other_loc” whose value count is less than 30
3. Outlier Removal
Outliers are those observations which have either unusually high or unusually low values. They don’t fit in the expected range of values for that feature. Outliers should be removed so that the entire data is evenly distributed.
Again, there is no fixed way to remove outliers. Everything is based on the data insights, domain expertise and how the data analyst/scientist analyses the data, so it is obvious that the way to remove outliers will differ from person to person.
We can start by considering the total sqft per bedroom. Considering the threshold for sq ft for 1 bedroom to be 300 sq.ft
From the above, we can see some unusual data(6 bhk has 1020 total sq.ft, 8 bhk has 600 sq.ft etc.). These are wrong data and therefore are outliers and can be directly removed
Now consider price per sq. ft. Price per sq. ft can be too low or too high. 1st we need to check the description of this feature price per sq ft
As we can see , min value is 267 and max value is 176470. Min value is very less. Max value is very high, which might be possible if the area is very high fi. But as we are building a generic model, we will consider values above a certain threshold as outliers and remove them
Lets plot a histogram for price per sqft feature
From the above curve, we can see that the feature ‘price per sqft’ follows a normal distribution. Since it follows normal distribution, we can apply standard deviation method to remove the outliers for this feature
Standard deviation method for outlier removal
There are 3 main points, mean, lower limit and upper limit. If the value for the feature lies in between lower limit and upper limit, it is not an outlier. Else it is an outlier.Generally, we consider 3 standard deviation for lower limit and upper limit. This means lower limit = mean — 3*(1 standard deviation) and upper limit = mean + 3(*(1 standard deviation).
In this case for price per sqft feature, mean = 6312 and 1 standard deviation= 4177. If we apply 3 standard deviation to lower limit, lower limit will be negative. So, in this case we have to consider 1 standard deviation for both lower and upper limit
Next, coming to bathrooms, its unusual that a 2 BHK house has 5 or 6 bathrooms, so those are outliers. So, we can consider if no. of bathrooms are greater than no. of bed rooms + 2, then it is an outlier and remove them. Same is with balcony
For removing outliers for BHK. lets first plot a scatterplot between ‘total_sqft’ and ‘price per sqft’ for “Hebbal” location
In the above highlighted circle(although it doesn’t look to be a circle! XD), we see that for the same total square feet, price per sqft of 2 BHK is greater than 3 BHK. This might be possible due to various factors. However since we are building a generic model here, so we can make a general rule like: If price per sqft for n BHK is less than mean price per sqft of (n-1) BHK, then it is an outlier and should be removed.
Implementation approach: Maintain a dictionary where key=no. of BHK and value = mean price per sqft for that BHK. Now create a numpy array which contains those indexes for which price per sqft of n BHK is less than mean price per sqft of n-1 BHK. Finally drop all those indexes
Now if we plot scatterplot of “Hebbal” location:->
In the above plot, we see that in the highlighted circle, the outlier points are removed. Now all outliers removed . Since price per sqft column was only for removing outliers, so we can delete it
4. Model building
Since our machine learning model understands only numbers, so we need to 1st convert all the categorical features into numerical features. We can do that using one Hot encoding. Do one hot encoding for ‘area_type’ and ‘location’ feature
Now the model is ready to be trained. In the training set, Xtrain is the dataset without dependent variable “price” and ytrain contains only the dependent variable “price”. Lets first try out Grid Search CV, which is an API of scikit learn that tries out multiple models and gives the model which provides best score
Here we compared between 3 models that used 3 different algorithms -> Linear Regression, Lasso Regression and Decision Tree Regression. We see that the model that uses Linear Regression performs best among these 3
So, lets use linear regression in our final model .1st use train_test_split to split the entire data into training set and test set and apply linear regression on the training set.
We got a score of 0.85, which is pretty decent. We also used K-fold cross validation using shuffle split , to randomize the data. For all the splits, we got a score of over 0.85
Now, lets predict the price using our model. We wrote a function which takes the required parameters as input and gives the predicted price as output
Lets predict below:
These are the prices returned in Lakhs. So we are finally done with model building. Now we save our model to a pickle file so that later we can directly use this already trained model without training it again. This pickle file contains only the weight and co-efficients of linear regression and not the actual whole data. Thus its size is very less around only 3KB. We also store all the columns(features) into a json file.
Now this pickle file and json file will be used by our Python flask server, in model deployment which we will discuss below
5. Model Deployment
We will use Python Flask server, which will act as a back-end server to provide all the required data that will be asked by the front-end client. The front-end part is written in basic HTML and CSS(which we will not discuss here, you can refer to all the code in my Github repository). We have used Jquery to do get and post calls to our back-end Flask server to retrieve the data.
The above part does ‘get’ calls to our backend server to retrieve the location and area_types , so that when the website page is loaded, all the location and area_types are already pre-filled in a dropdown menu.
These functions: getbath(), getbhk() and getbalcony() gets the data that user inputs in the website and stores it on the basis of Element id.
This function estimate() is the main function that gets all the data entered by the user in the website and then does ‘post’ call to our backend server to get the final selling price of the house
Below is the code for our app.py file, which runs through Flask server. Always remember that execution of our project will start from this file app.py
- Initialize the flask server and load the model and columns.
2. Redirect the home screen to our HTML file “home.html”
3. Return the locations and area_types when ‘get’ calls are done to pre-fill the website ,on loading, with all the locations and area types
4. Return the final predicted price when ‘post’ call is done with all the required input parameters provided by user
5. Now run the Flask app inside main, keeping debug mode as True
When you will run this file, you will see your output in localhost. But others can’t see your project, until and unless you deploy it in some application. For that we will use Heroku. For deploying in Heroku, apart from all these files, we require 2 more files, Procfile and requirements.txt
Procfile content -> web: gunicorn app:app
Here 1st app is the name of the file that will be executed 1st, here it is app.py. 2nd app is the name of the Flask server
requirements.txt -> This is a text file which contains all the imported libraries and their versions. In order to generate requirements.txt, go to the Project folder and in Windows Power Shell/command Prompt, type:
pip freeze > requirements.txt
Now create a Github repository and add all these files in your repo. After that, connect your Github Repo from Heroku,give an appropriate app name and then deploy. You can see your deployed website and share to others. That’s it!! Congrats! your end-to-end ML project is done!