What is Data Science Pipeline?

Original article can be found here (source): Artificial Intelligence on Medium

What is Data Science Pipeline?

In today’s world no-doubt “Data Science” is clearly a buzz word. Everyone keeps on talking about data science. But, what is actually data science very few knows, what is actually the stages in data science pipeline, very few knows. I will explain the data science pipeline in this article from complete basic.

On a high level view if we talk about there are actually 3 stages of data science, which are as follows:

Data Collection

Data Modelling

Data Deployment

Data Collection

The very first step which comes into Data Pipeline is Data Collection, which means that we have to collect data. It is obvious that to do anything with the data, we have to collect it first, otherwise, it is not possible.

Data is the new oil of the present century & in order to do run any business in today’s world, we should have data, otherwise how one can get insight and boost the business or any company or new startup, because data is the only key which leads us to success by giving us the insights through which we can take important decisions.

There are many sources through which data can be gathered, for example:

  1. Log Data
  2. Smart Devices Data
  3. Sensor Data
  4. Social Media Data
  5. Survey Data, etc.

This is the first part of the data science Pipeline in which we can gather data.

Data Modelling

This is the biggest part of the data science pipeline, because in this part all the actions/steps our taken to convert the acquired data into a format which will be used in any model of machine learning or deep learning.

What this part consists of is:

  1. Data Exploration: In this step data is explored i.e. important features are identified, correlation of features are identified, importance of various features is calculated by plotting various graphs or by taking insight from the basic description obtained by using some functions of various libraries like pandas & numpy.
  2. Data Cleaning: This involves removing the unwanted values, filling missing values.
  3. Data Transformation: This involves changing categorical data into numerical data by encoding it.
  4. Data Reduction: In this step unwanted features are removed.
  5. Splitting Data: Now, we have to split the data into training, validation and testing set in order to construct a model which gives optimal result. To know more about data splitting refer my article:

Now, this covers data modelling part, and from here, we have to proceed towards the last part.

Data Deployment

In this step, we have to train a model with the data we have modelled till the previous section, and then evaluate the model in order to check that whether the model can be used in the real world.

If in case model is having lower accuracy, we have to experiment so that the model can be improved. For example, we can tune hyper-parameters of the model in order to improve the results.

Finally, we deploy the model or we use the model for real world data and gain insights which help us to boost our business or anything else.