Original article was published by on AI Magazine
How to Structure Data Science Projects
Why, When, and How to start your first data-science/machine learning project
When should I start my first project?
The question that every data science/machine learning aspirant comes across at least once, while they are relatively new to this field is that
Is it too early to start my own project? What more do I need to learn before I start working on my own project?
The answer to this question varies from person to person but a general rule of thumb is that once you feel comfortable with your command over a few fundamental subtopics of machine learning, you’re good to go! It’s never too early. We learn faster and retain more DOING a task, than watching someone perform the same or reading a book about it.
Which Project Should I Choose?
Pick any one topic that you want to work on (regression, classification, computer vision, NLP, etc.), try to come up with real-life applications for the topic, and make a map, a rough sketch, of all the steps you need to take, to get the idea out of your brain and into the real world.
Early in your data science career, you do not need to worry if your project has any real-world significance or possible business outcomes. The fundamental focus of this is to test yourself on your skills and find the areas where you lack knowledge.
From here on, I will be walking you through my first data science project, Resale Price Prediction of Cars. Take a look at the deployed version of this project to get an idea of how your project should look upon completion.
The code snippets used in this post are extracted from the code on my GitHub. RESALECARS
Here are the steps that you will need to adhere to while doing any data science project :
- Data Collection
- Data Preprocessing
- Exploratory Data Analysis
- Feature Engineering
- Model Building
In this article, we will be stopping at Model Building. Check out PART-2 for deployment using Heroku and streamlit. I will be going through every one of the above-mentioned steps about the Resale Car Price Prediction projected mentioned earlier.
Data Collection :
Congratulations! You’ve made it this far. You have an idea you’re willing to bring to life. It’s your first data-science brainchild! Now you have to figure out what data you need to build a model. You can go about 2 routes to collect data:
I used the python library Selenium to collect data from the website CarsDirect. However, I will not be going into web scrapping techniques and frameworks in this post.
Data Preprocessing :
Most real-world datasets will have null values and other kinds of values (which we will get into later) that you need to take care of before you go any further.
Let’s find out what our dataset looks like and how many null values we have in this dataset
Now that we have an idea about which columns have missing values we can go ahead and impute these missing values using various statistical techniques (mean, median, mode) or other methods such as creating a column to indicate missing values, replacing missing values with a different category, etc.
Luckily the data that I worked with had no missing values!
Exploratory Data Analysis:
The use of statistics in data science is extremely underrated. Before we feed the data into a model we first need to understand what kind of data works for the chosen model.
– Continuous Data:
We can use visualization tools such as seaborn and matplotlib which provide a wide range of plots we can use to visualize aspects of our data. While handling continuous data we can start by plotting the histogram, kernel density plot, scatterplots, and finally correlation matrix.
– Categorical Data :
We can visualize the categorical columns by using bar plots, pie charts to show the frequency of occurrence of each class/category in the categorical column. Seaborn and matplotlib both provide beautiful visualization tools to do so. However, the simplest method to do so would be to directly use pandas (which in turn uses matplotlib)
Feature Engineering :
We have obtained our data. We have cleaned the data. We know how our data is distributed. So let’s get started with building a model to pre- STOP RIGHT THERE! We are not there yet. Our data is still not ready to be used for model training yet.
Not all distribution of data works well with each model. An analogy for this would be if I (an Indian) were to go to another country (say China) I would not be able to adjust to their food and work at my best potential because I’m accustomed to classic Indian meals. Linear models work well with Gaussian data, Tree-based models do not require you to normalize your data and other different models need different kinds of data. It is best practice to start by fixing skewness and transforming your data into Gaussian distribution.
This code snippet allows you to visualize the distribution of all your numerical data and find out how skewed the data is.
This Skewness of the data can be fixed by various transformations such as :
- Log Transformation
- Box-Cox Transformation
- Exponential Transformation
- Reciprocal Transformation
Creating New Columns:
There is no standard procedure for this step. It completely varies depending on the type of data that you have. I will try to elaborate by giving examples where you would use this.
- Instance 1: Let’s say you have a column of categorical data and 10%-15% of your data is missing. You cannot drop the instance as that would considerably reduce your data size. What do you do?? Create a new column which has the value 1 where the given column is not null and 0 for null, and then proceeds to impute the null values with -1 or “N.A” as a new category.
- Instance 2: Let’s say you have a column of continuous data such as age. You can further use this column to create a new column (let’s say ‘Adult’), with value 0 if age < 18 and 1 for age ≥ 18…
There are many other approaches that you can take while performing feature engineering but these are the main 2 that I stick on to, among others.
Split the data into testing and training set before we start model building
We are finally here. Phew! That was long. Our data is now ready to be used for model building.
For this particular instance (resale car price regression), the task we are performing is regression. Thus we will be trying out various regression models, find out which gives us the best accuracy/ RMSE, and then build stacked models.
The models I usually use here are: Linear Regression, Random Forest Regressor, Gradient Boosting Trees, SVM, Lasso, ElasticNet, BayesianRidge, Ridge, LassoLarsIC, Kernel Ridge, XGBoost, and LightGBM. Quite a long list isn’t it?
For each of these models create a parameter grid, then use Randomized Search to find the ideal parameters for the given model.
Given below is the code for Gradient Boosting Regressor. This process is performed on every single model in the list with a different parameter grid.
Once you know which models give you the best accuracy, pick all or/and few of these and create multiple ensembles to get the best output.
We now have a good predictor, as good as it can get! We have come a long way from just an idea. We have finally completed our first project YAYYY! …oh wait. We haven’t deployed the model yet!
Thanks for sticking around. I hoped this helped you, leave a clap if you found this informative, and feel free to contact me if you have any queries related to data science.