Project 1 : Bigmart Sale Prediction

Hi guys, I’m Luan. I come from Viet Nam. This article is the first one on my medium. I’m quite bad at English, so I wanna write more to improve English skill. This is also the first project when I start reaching with Machine Learning and Deep Learning. If having any mistake, I hope all people can condone to me. Thank everyone a lot.

This project I refer to Diogo Menezes Borges and Ryan. The following links :

Now we go to be specific into this project.

Introduction about data :

According to the provided information, Big Mart is a big supermarket chain. The data have been collected from in 2013, including 1559 products across 10 stores in different cities. They are provided to challenge all Data Scientist to make this corporation sales predictor model to ensure success of their business.

Process makes this model corresponding to the following steps :
1. Exploring Data.
2. Data Pre-processing.
3. Feature Engineering.
4. Creating Model.
5. Evaluation.

  1. Exploring data.
    Everyone can explicitly see this part via this link : . Here I only present summarily.
    1.1 Distribution of the target variable
Item_Outlet_Sales Distribution

We can see that our target variable is skewed to the right. So we will concentrate on that.

1.2 Distribution of the variable Item_Fat_Content

See that the variable Item_Fat_Content includes five values. So we should correct into two values LF and Regular.

1.3 Distribution of the variable Item_Type

We see that there are sixteen different types. We need to think of the way to reduce this

1.4 Distribution of the variable Outlet_Location_Type

1.5 Distribution of the variable Outlet_Type

Supermarket type2, Grocery Store, Supermarket type3 have low expression in the distribution. Maybe we should gather them into the single category.

1.2. Bivariate Analysis

Once again, I please emphasize that this article I have referred to

Firstly we will analyze numerical variables

1.2.1 Item_Weight and Item_Outlet_Sales analysis

We see that the variable Item_weight have low correlation with the variable Item_Outlet_Sales.

1.2.2 Item_Visibility and Item_Outlet_Sales analysis

Maybe, we think that Visibility of items will have high correlation to Sales of Items. However, seeing at the above picture, we see that items have low Visibility which still have high sales. Even, Items with Visibility equal 0 still sell very well.

1.2.3 Impact of Outlet_Identifier on Item_Outlet_Sales

1.2.4 Outlet_Establishment_Year and Item_Outlet_Sales analysis

To seem to have no significant meaning between Outlet_Establishment_Year and Item_Outlet_Sales
1.2.5 Impact of Outlet_Size on Item_Outlet_Sales

The variable Outlet_Size probably don’t have high correlation with our target variable. Most stores have size “Medium” but still the “High” and “Small” stores which are clearly in an inferior number can beat or even come close to their numbers.

1.2.6 Impact of Outlet_Type on Item_Outlet_Sales

From this analysis possibly it would be a good idea to creat a new feature that shows the sales ratio according to the store size.

Other variables Everyone can see at link that I already give on the head of this article.

2. Data Pre-Processing

After exploring data, we can give out some conclusion :
Item_Visibility does not have a high positive correlation as expected
As well, there are no big variations in the sales due to theItem_Type .
– If we look at variable Item_Identifer , we can see different groups of letters per each product such as ‘FD’ (Food), ‘DR’(Drinks) and ‘NC’ (Non-Consumable). From this we can create a new variable. 
Item_Fat_Content has vale “low fat” writen in different manners.
– For Item_Type we try to create a new feature that does not have 16 unique values.
Outlet_Establishment_Year besides being a hidden category, its values vary from 1985 to 2009 . It must be converted to how old the store is to better see the impact on sales.

2. 1. Looking for missing values

Firstly, we need to combine train data and test data to avoid the trouble of repeting twice the same code

# Join Train and Test Dataset
#Create source column to later separate the data easily
data = pd.concat([train,test], ignore_index = True)
print(train.shape, test.shape, data.shape)

Next, we remove NaN value . Numpy’s NaNvalues are good because Pandas is able to recognize this object and count it.

We see that 39,9 % Item_Outlet_Sales is NaN value. So we will change them into mean value.

2.2. Imputing Missing Values

2.2.1. Imputing the mean for Item_Weight missing values

data.pivot_table() (refer to this link ). this method allows us to create a table with all the identifiers and their respective weight. Since this method ignores all NaN values and the same item exists in more than one store, for those missing the weight we can retrieve from this table the mean() weight of all products with the same Item_Identifier.

#aggfunc is mean by default! Ignores NaN by default
item_avg_weight = data.pivot_table(values='Item_Weight', index='Item_Identifier')
data[:][data[‘Item_Identifier’] == ‘DRI11’]
def impute_weight(cols):
Weight = cols[0]
Identifier = cols[1]

if pd.isnull(Weight):
return item_avg_weight['Item_Weight'][item_avg_weight.index == Identifier]
return Weight
print ('Orignal #missing: %d'%sum(data['Item_Weight'].isnull()))
data['Item_Weight'] = data[['Item_Weight','Item_Identifier']].apply(impute_weight,axis=1).astype(float)
print ('Final #missing: %d'%sum(data['Item_Weight'].isnull()))

Running above code, we have the following result :

NaN values have changed equal mean value.

2.2.2. Imputing Outlet_Size missing values with the mode

For this example, we will apply the same logic. In this case, instead of using the default codeaggfunc = mean() for the pivot_table()we will use the mode

def impute_size_mode(cols):
Size = cols[0]
Type = cols[1]
if pd.isnull(Size):
return outlet_size_mode.loc['Outlet_Size'][outlet_size_mode.columns == Type][0]
return Size
print ('Orignal #missing: %d'%sum(data['Outlet_Size'].isnull()))
data['Outlet_Size'] = data[['Outlet_Size','Outlet_Type']].apply(impute_size_mode,axis=1)
print ('Final #missing: %d'%sum(data['Outlet_Size'].isnull()))

Running above code, we have the following result :

3. Feature Engineering

3.1 Item_Visibility minimum value is 0

In our data, seeing that visibility have the value 0, which makes no sense. since every product must be visible to all clients. Let’s consider it as missing value and impute it with mean visibility of that product.

def impute_visibility_mean(cols):
visibility = cols[0]
item = cols[1]
if visibility == 0:
return visibility_item_avg['Item_Visibility'][visibility_item_avg.index == item]
return visibility
print ('Original #zeros: %d'%sum(data['Item_Visibility'] == 0))
data['Item_Visibility'] =
print ('Final #zeros: %d'%sum(data['Item_Visibility'] == 0))

As a result :

3.2 Determine the years of operation of a store

We concern working time instead of Establishment time. Hence we will handle them thank to the following code :

#Remember the data is from 2013
data[‘Outlet_Years’] = 2013 — data[‘Outlet_Establishment_Year’]

3.3 Create a broad category of Item_Type

Segment Item_Type into 3 category including “FD” (Food), “DR” (Drinks) or “NC” (Non-Consumables)

#Get the first two characters of ID:
data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2])
#Rename them to more intuitive categories:
data['Item_Type_Combined'] = data['Item_Type_Combined'].map({'FD':'Food', 'NC':'Non-Consumable', 'DR':'Drinks'})

As a result :

3.4 Modify categories of Item_Fat_Content

As analyzed in above part, we need to correct the variable Item_Fat_Content into 2 category( Low_Fat and Regular ).

#Change categories of low fat:
print('Original Categories:')
print('\nModified Categories:')
data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({'LF':'Low Fat', 'reg':'Regular',
'low fat':'Low Fat'})

We also saw that there were some non-consumables as well and a fat-content should not be specified for them. So we can also create a separate category for such kind of observations.

As a result :

4. Feature Transformations [taken from ]

4.1. Categorical Variables — One Hot Encoding

Because scikit-learn only accepts numerical variables, we need to convert all categories of nominal variables into numeric types. Let’s start with turning all categorical variables into numerical values using LabelEncoder() (Encode labels with value between 0 and n_classes-1). After that, we can use get_dummies to generate dummy variables from these numerical categorical variables

#Import library:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
#New variable for outlet
data['Outlet'] = le.fit_transform(data['Outlet_Identifier'])
var_mod = ['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined','Outlet_Type','Outlet']
for i in var_mod:
data[i] = le.fit_transform(data[i])

One-Hot-Coding refers to creating dummy variables, one for each category of a categorical variable. For example, the Item_Fat_Content has 3 categories — LowFat,Regular,Non-Edible. One hot coding will remove this variable and generate 3 new variables. Each will have binary numbers — 0 (if the category is not present) and 1(if category is present). This can be done using get_dummies function of Pandas.

#Dummy Variables:
data = pd.get_dummies(data, columns =['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Outlet_Type','Item_Type_Combined','Outlet'])

4.2 Exporting Data

After analyzing and handling data, we will separate our data into train data and test data. Running below code :

#Drop the columns which have been converted to different types:
#Divide into test and train:
train = data.loc[data['source']=="train"]
test = data.loc[data['source']=="test"]
#Drop unnecessary columns:
#Export files as modified versions:

5. Model Building

train_df = pd.read_csv(‘data/train_modified.csv’)
test_df = pd.read_csv(‘data/test_modified.csv’)

Refer to 

5.1 Linear Regression

mean_sales = train_df['Item_Outlet_Sales'].mean()
baseline_submission = pd.DataFrame({
'Item_Outlet_Sales': mean_sales
from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)
X_train = train_df.drop(['Item_Outlet_Sales','Item_Identifier','Outlet_Identifier'],axis=1)
Y_train = train_df['Item_Outlet_Sales']
X_test = test_df.drop(['Item_Identifier','Outlet_Identifier'],axis=1).copy(), Y_train)
lr_pred = lr.predict(X_test)
lr_accuracy = round(lr.score(X_train,Y_train) * 100,2)
print('sai so la %.4g' %lr_accuracy)
linear_submission = pd.DataFrame({
'Item_Outlet_Sales': lr_pred

5.2 #Decision tree

from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor(max_depth=15,min_samples_leaf=100),Y_train)
tree_pred = tree.predict(X_test)
tree_accuracy = round(tree.score(X_train,Y_train)*100,2)
print('sai so dicision la : %.4g'%tree_accuracy)
tree_submission = pd.DataFrame({
'Item_Outlet_Sales': tree_pred

5.3 #randomForest

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=400,max_depth=6, min_samples_leaf=100,n_jobs=4),Y_train)
rf_pred = rf.predict(X_test)
rf_accuracy = round(rf.score(X_train,Y_train)*100,2)
print('sai so randomforest la : %.4g' %rf_accuracy)

5.4 Result :

Conclusion : Dicision tree have the best result with 61.58 %.

Source: Deep Learning on Medium