Simple Introduction to Data Science

Original article was published on Artificial Intelligence on Medium


The lifecycle of a Data Science project

A Data science project can be segregated into 7 steps:

1. Understanding the business problem

Photo by Daria Nepriakhina on Unsplash

This is the first and foremost step at the starting of any data science project. It’s really important to understand the business problem on which you are building your project. Any ambiguity/confusion in this step can cost you a lot at the later stages. It’s one of the most important traits of anyone who want to be successful in the world of data science. If you are working for a client, be sure to ask as many questions as possible in this initial stage. Before you start to look at the data, the business framework should be absolutely clear in your head.

For example: If you are looking at a business problem that requires you to build a predictive model to predict which customers are most prone to attrition for a credit card issuing bank. the questions you might want to ask can be like the following:

  1. How do you define attrition? — This will help you understand the outlook of your business stakeholders.
  2. There can be different types of behaviors that will be covered under the umbrella attrition term. For instance, in one scenario a customer closes the account voluntarily, whereas in another scenario the account stays dormant for a very long time. — Each and every question has to be clarified to be able to capture the attrition successfully when you start working with the data.
  3. Is model accuracy more important at the cost of model readability or the model readability have to be prioritized? — This will help answer the model approach- a statistical model or an ML model.
  4. Who will be the end-user of this model? Is this model a feeder model into some other model or a standalone model? — This will help in designing the model input-output structure.

And you get the idea here. Got more questions? Go ahead ask them all. This information forms the base of the modeling.

2. Data Acquisition

Photo by Franki Chamaki on Unsplash

The next step after understanding the business problem is to scout for the data sources from where you can gather all the behavioral, performance, geographical, and all kinds of descriptive data for your business problem. Preferably you should source your data from stable and reliable sources.

For example: In case you want to target a customer base for a bank credit card balance transfer offer, you would need a plethora of attributes to determine the profile of a customer. You can use the credit bureaus data (trades, inquiries, delinquency, etc.), the internal historical data (past offers acceptance rate, end of the month outstanding balances, internal credit score, etc.). In short a group of characteristics which are statistically determined to be predictive in separating the good and bad accounts for a particular type of offer.

3. Data Preparation

Once the data is collected, it then enters the data preparation stage. Data preparation often referred to as “pre-processing” as well. It is the stage at which the raw data is cleaned up and organized for the next stages of the data science project. During preparation, raw data is diligently checked for any errors. The purpose of this step is to eliminate bad data (redundant, incomplete, or incorrect data) and begin to create high-quality data for the best business intelligence.

For example: You sourced data from various sources as discussed in stage 2 but in the pre-processing stage while we prepare data, now we realize that a particular data source is not joining properly at the level (let’s say account ID level for instance) and is leading to an erroneous attribute column. We can remove that data source at this stage. Similarly, redundancy, incompleteness, or incorrectness all are tackled in the same manner in this stage.

4. Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach to analyze datasets to summarize their main characteristics, most often with visual methods. EDA is used to visualize what the data can show us before the modeling task. It is not easy to look at a column of numbers or a whole spreadsheet and determine important characteristics of the data. It may seem overwhelming to derive any insights by looking at the data in its tabular form. Exploratory data analysis techniques consist of Univariate Analysis {(Box plots, histograms etc.) to look at the basic trend of a single variable in the data, the maximum, the minimum, inter-quartile range, outliers etc.} and Bi-variate Analysis {(Scatter Plot, bar chart, line graphs etc.) to look at the dependent-independent variables relationships and trends}

EDA is a crucial step to perform before diving into machine learning or statistical modeling stage. It provides an important base framework for understanding and the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results. EDA is valuable to make certain that the results produced are valid, correctly interpreted, and applicable to the desired business contexts.

5. Predictive Modeling

Predictive modeling is a commonly used statistical technique to predict future behavior. Predictive modeling solutions are a form of data-mining technology that works by analyzing historical and current data and generating a model to help predict future outcomes.

There are various kinds of predictive modeling techniques like- regression, classification, clustering, etc.

This step is the most crucial one. While developing a predictive model, make sure to evaluate the model after building it and also develop 3–4 challenger models to enable you to select the best one out of them. If its’ an ML model, you judge entirely based on the accuracy because most of the ML models function as a black box. On the other hand, if it’s a statistical model, take a good look at the final variables you have used in your model and check their business relevance along with the results of the model evaluation metrics.

6. Visualization and Communication

Photo by Lukas Blazek on Unsplash

Once the model is ready, it’s really important to communicate its results effectively. Be sure to prepare proper documentation depicting the results of the model on development, validation, and out of time samples which can clearly show what the Gini/KS or other evaluation metrics are. Also, communicating the results should also include clearly describing the assumptions used while building the model. Also, a sensitivity analysis is an important visualization tool that can depict the robustness of your model. You get the idea here. Choose wisely on how to effectively communicate and display the accuracy and the role of the model that you have developed.

7. Deployment and Maintenance

Now, once the model has been approved by the developers and the business stakeholders as well. It has to be put in production. The job is not over once a model goes to the production stage. The model must be continuously monitored at an appropriate frequency interval. The PSI (population stability index) and CSI (Characteristic stability index) are a good measure of how the population distribution measures up against the development sample. If it’s significantly different then the results of a model can become unreliable or further deep-dive is required. Also, important KPIs that model is predicting should all be monitored, and if the percentage change becomes high between two months/two periods of time when the model has been run then a model might need to be recalibrated or redeveloped.

These are the various stages of a data science project. There are several data science teams that specialized in some stages that we discussed above whereas in Kaggle competitions or in many firms you might be responsible for the end-to-end development of a data science project. Either way, all the stages are important for the development of a robust and reliable model which can help us tackle the business problem at hand.

Each stage involves many further technical sub-steps which when combined together constitute one stage. But this post was my attempt to explain an overall holistic view of a data science project described without any technical jargons.