Online Shopper’s Intention

Original article was published by Navaneeth Sharma on Artificial Intelligence on Medium

Hi !! For the past one week, I am working on a project called OSI (Online Shoppers Intention). I am going to explain to you how I completed the project by following the Life Cycle of Data Science.

Data Science Life Cycle

Mainly DS Life Cycle consists of 8 major parts. It’s very much important to implement them all for the successful completion of the project. I have listed down the components

  1. Understanding the business requirements
  2. Data Collection
  3. Data Preparation
  4. Exploratory Data Analysis (EDA)
  5. Modeling and Evaluation
  6. Deployment
  7. Real-World Testing
  8. Optimization (Including retraining and managing it)

So let’s go through each in detail and implement it practically.(I have used python, excel for visualizing, and model preparation.)

Understanding the Business Requirements

The First thing to solve any problem is to understand the problem, without understanding this we cannot solve it in the best way. The understanding Business requirement means to frame a problem statement according to the requirements. For our case, let’s assume you are a Data Scientist at an E-commerce Company. And you are asked to analyze the data and build a model that can predict whether a customer will generate Revenue. By understanding the situation, it is clear that Classification because we need to classify the Customers into two categories (Revenue Generated or not ).

Data Collection

Let’s get the data for our task from the UCI Machine learning Repository. You Can go to this link to download the data. Click here

Data Preparation And Cleaning

Data Preparation and Cleaning is one of the crucial parts of the DS life cycle process. Favorably there are no missing values in the data set we downloaded. There are techniques to handle such situations with the help of Pandas and NumPy libraries of python. Let’s look at the data to understand how we can proceed.

import pandas as pd
df = pd.read_csv('osi.csv')
Few data Points of OSI data set

By observing the data, we get to know that there are 17 independent features and one dependent feature i.e Revenue. To perform the visualization of data, it is very much essential to convert all features to numerical form. Let’s transform it by using the Sklearn library.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df[‘VisitorType’] = le.fit_transform(df[‘VisitorType’])
df[‘Month’] = le.fit_transform(df[‘Month’])
df[‘Weekend’] = le.fit_transform(df[‘Weekend’])
df[‘Revenue’] = le.fit_transform(df[‘Revenue’])

That’s it for Cleaning! In the situation of having missing data and other types of noises, the task of cleaning and preprocessing of the data would be more.

Exploratory Data Analysis (EDA)

Let’s dig into the plots and graphs. It is assumed all the features (independent features) are isn’t correlated to each other in many of the Machine Learning algorithms. Let’s check this by Pearson Correlation

import seaborn as sns
import matplotlib.pyplot as plt
Pearson Correlation

From the above figure, it is clear that administrative data (both duration and point) are correlated. Information, Product Related, and Rates(Exit and Bounce) have similar Characteristics. Page Value seems to be more correlated towards the Revenue. So we can combine respective highly correlated features into a single data point.

df['ProductRel_per_dur'] = df['ProductRelated']/(df['ProductRelated_Duration']+0.00001)
df['Admin_per_dur'] = df['Administrative']/(df['Administrative_Duration']+0.00001)
df['Inform_per_dur'] = df['Informational']/(df['Informational_Duration']+0.00001)
def select_columns(data_frame, column_names):
new_frame = data_frame.loc[:, column_names]
return new_frame
col = ['ProductRel_per_dur','PageValues','SpecialDay','Month','Admin_per_dur','Inform_per_dur','Bounce_by_exit'
Df = select_columns(df,col)plt.figure(figsize=(18,12))

Let’s see how it behaves after combining the features.

Pearson Correlation after Combining Features

This seems better than the previous case since the data correlated with each other. (I have set a threshold of 60% for the correlation).

One of the major factors is to choose the right features (It is generally a misconception that more feature increases the prediction score). Let’s Implement this…

X1 = Df.drop('Revenue',axis=1)
y1 = df['Revenue']
#use inbuilt class feature_importances of tree based classifiers
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier(n_estimators = 5,
criterion ='entropy', max_features = 12),y1)
feat_importances = pd.Series(model.feature_importances_, index=X1.columns)
ax = feat_importances.nlargest(15).plot(kind='barh',color='grey')
plt.xlabel('Feature Importance Score')

This will be the most expected output

Feature Importance

Also passing our data to Chi-Square Statistical Method(If you are unfamiliar with this please get some intuition first)

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X1 = Df.drop('Revenue',axis=1)
y1 = Df['Revenue']
bestfeatures = SelectKBest(score_func=chi2, k=12)
fit =,df['Revenue'])
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X1.columns)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']

This will the Output

            Specs         Score
0 ProductRel_per_dur 1.649523e+07
4 Admin_per_dur 8.581059e+05
5 Inform_per_dur 6.048444e+05
1 PageValues 1.751268e+05
3 Month 8.616370e+01
2 SpecialDay 5.379709e+01
10 VisitorType 3.754752e+01
7 Browser 8.873291e+00
11 Weekend 8.120464e+00
8 Region 3.037565e+00
9 TrafficType 1.283194e+00
6 OperatingSystems 1.037132e+00
Output of Chi-Square test

Now we need to get more insights from data. I have made a presentation file that explains the data a bit more. (Click here to visit that ppt file).

By observing the results it is clear that Page Values, Administrative, Product Related, Information and Month has a great impact on the output (Revenue) of the data. Let’s continue with these features.

Now let’s look at some of the plots related to both Model preparation and EDA. ( I have used Excel for the below graph. )

Revenue Generated In the Given Data

Oh! It looks like the data is highly imbalanced (It contains more of Non-Revenue Generated Data i.e 10,422 that is almost 84.5% of the total data). Such cases may lead our model incorrectly. To make the algorithms to learn the imbalance of data needs to be corrected. Let’s do this job by using imblearn Library of python

X1 = Df[['ProductRel_per_dur','Admin_per_dur','Inform_per_dur','VisitorType','SpecialDay','PageValues','Region','Month']]
y1 = Df['Revenue']
from imblearn.combine import SMOTETomek
smk = SMOTETomek(random_state=42)
X_res,y_res = smk.fit_sample(X1,y1)

Let’s move to the Model Selection Part (I have used Pair plot in my notebook. It can signify more importance to feature selection. Do check out).

Modeling and Evaluation

Using the knowledge of previously observed data. I have used standard scalar to scale down the data. I have used the Wrapper method and Univariate methods for feature selection. Using these powerful methods helped me to get the best features according to the algorithms.

I have trained the model on algorithms such as Logistic Regression, SVM classifier, Random Forest, KNN, and ANN. But I had the most accurate validation score in KNN (K=1 and ball tree algorithm). It gives an Accuracy of 99.997% of Training Accuracy and about 91.6% of Validation Accuracy, 99.96% of F1 Score, and about 91.7% of Validation F1-Score with some extra features added to it. Check out the below code (You can generate a great model by implementing few lines of code given below)

import pandas as pd
df = pd.read_csv('osi.csv')
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['VisitorType'] = le.fit_transform(df['VisitorType'])
df['Month'] = le.fit_transform(df['Month'])
df['Weekend'] = le.fit_transform(df['Weekend'])
df['Revenue'] = le.fit_transform(df['Revenue'])
df['ProductRel_per_dur'] = df['ProductRelated']/(df['ProductRelated_Duration']+0.00001)
df['Admin_per_dur'] = df['Administrative']/(df['Administrative_Duration']+0.00001)
df['Inform_per_dur'] = df['Informational']/(df['Informational_Duration']+0.00001)
X1 = df[['ProductRel_per_dur','Admin_per_dur','Inform_per_dur','VisitorType','SpecialDay','PageValues','Region','Month']]
y1 = df['Revenue']
from imblearn.combine import SMOTETomek
smk = SMOTETomek(random_state=42)
X_res,y_res = smk.fit_sample(X1,y1)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_res)
X1 = scaler.transform(X_res)
from sklearn.model_selection import train_test_split
X1,x2,y1,Y2 = train_test_split(X1,y_res,test_size=0.25)
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1,algorithm='ball_tree',weights='uniform'),y1)


I have used flask framework for deployment of the model. Flask is a light weight framework to build a simple web application. So I choose Flask. You can choose other frameworks too. Heroku Platform is one of the popular web hosting services. I have hosted my model at Heroku platform. The code of all these are available in GitHub.

Real World Testing

You can test the model by clicking here. It leads to the home page of the web app created.

Optimization (Including retraining and managing it)

This might not be included for this project, since we have not created a database to collect more data. If we collect more data using DBs like SQL and No SQL’s we use for retraining the model.


The Project is to enhance my (also yours) skill, and knowledge to move forward in the field of Data Science. Thank You… Happy Learning.