M5 Walmart Sales Forecasting

Original article was published by Keshu on Deep Learning on Medium

Here in this blog I have discussed about Sales Forecasting of Walmart in Sates of USA for future 28 days.Here the blog is organized as follows:

1. Sales Forecasting & It’s importance

2. Machine Learning Point of view of Problem & Structure of data

3. Traditional Approaches

3. Machine Learning Pipeline( Preprocessing , EDA, Feature Engineering, Modeling)

4. Comparison of ML Models Used

5. Future Work

6. References

Sales Forecasting is a very important area in Field of Business Management in which we construct a system by which future sales volumes are estimated. It helps businesses to prevent panic sales by manufacturing products according to future customer demand thus maximizing the profit. It allows you to manage virtually all aspects of your business. It is similar to weather forecasting as both types of forecasting rely on science and historical data. While a wrong weather forecast may result in you carrying around an umbrella on a sunny day, inaccurate business forecasts could result in actual or opportunity losses.

Why Sales Forecasting?

Although in simple words sales forecasting is only predicting futures sales volume of a product. But in Real World it has much more Importance.

Some of the reasons for same as discussed in [1] are:-

  1. Help Sales Representatives To Meet Their Targets :- In any business the business representative need to take several decision to achieve their sales target. Sales Forecasting help Representative to make such decision according to predicted sales in future.
  2. Improve And Speed Up Product Delivery :- In most cases while looking up for some products customer look upon the delivery time particular company will take for product. By Knowing about forecast company will manufacture goods keeping them in mind so they have ready to deliver product as soon as customer orders it.

There are several more benefits for this sales forecasting but we will now focus on Main Problem

Machine Learning Problem

In this blog we are going to discuss about Kaggle Competition named M5 Forecasting -Accuracy ,In this competition we have Sales data for Walmart Stores in 3 states(California ,Texas ,Wisconsin) for 3 categories of data (HOBBIES,FOOD,HOSEHOLD) from year 2011 to 2016.We want to use this data to predict sales for next 28 days using several ML techniques.


They have given sales data of products from 2011 to 2016 in form of 4 data frames namely :-

  1. calender.csv :- contains information about the dates on which the products are sold
  2. sales_train_validation.csv :- contains information about the dates on which the products are sold(that we have used for training in our case)
  3. sell_price.csv:- contains information about the price of the products sold per store and date.
  4. sales_train_evaluation.csv :- Includes sales [d_1-d_1941](we will use sales from d_1914–d_1941 form this as test).

Also we will predict sales from d_1942-d_1969 for private Score in Kaggle.


structure of data

Structure of data is shown in image in left .Here they have 3 states named California, Texas, Wisconsin. In states of California they have 4 stores ,Texas 3 stores, Wisconsin 3 stores. For each store they have store have 3 categories of items Hobbies, Foods, Household. For Food Category they have 3 departments and 2 departments for Hobbies and Household Categories.


Here we are using Weighted Root Mean Squared Scaled Error (WRMSSE) for evaluating model Performance. For computing WRMSSE they are constructing 42,840 time series from this data. The More details on WRMSSE is given in notebooks here.

Traditional Solutions

There are several Existing statistical solutions for sales forecasting problems like ARIMA, Moving Averages, Exponential Smoothing etc. To know more about them click here. The problem with these models/methods was that don’t take categorical variables into account and tends to work for short predictions only. So in This blog we will discuss several Machine Learning Model and See how they perform in this .


Basic Pipeline used

The Figure shows how the basic overview of pipeline followed while working on this Kaggle Competition.

Note:-Here while constructing time series related features we first split train data into train and cv and then calculated them

First Cut Approach:- Here I have tried to focus more on direct modeling strategy rather than Recursive Modeling. The Reason behind this is that In recursive modeling even small error in start predictions may lead to huge error in later on predictions.



Data Cleaning is important step in every Modeling Strategy. Presence of incorrect data may cause our model to work incorrectly. There are various reason for incorrect data like:- computation error, human error etc. Here we are having event_name_1,event_name_2 ,event_type_1,event_type_2 as nan .We have replaced these nan values as no_events. Also we are trying to reduce memory usage of all categorical datas like item_id, cat_id, store_id, dept_id ,year ,event_name_1, event_name_2, event_type_1,event_type_2, year etc. Here I have shown only for 2 features like item_id,dept_id.

Similarly we have applied Label Encoders to all other categorical features present in data.We are also combining snap_ca, snap_tx, snap_wi into a one column named snap.


Exploratory Data Analysis is an crucial part for purpose of analysis. It enables us to get an overview of data, find up patterns in data. We have performed EDA on this Walmart sales data using two main libraries in python Seaborn & Matplotlib. We have plotted several plots as shown below:-

Sales Variation According to Categories of Items

Here in the plot shown we have plotted average sales of products according to there categories. Using this plot we found out that average sales for FOODS category item is more then the rest two and HOOBIES category item have least average sales out of all three. Also average sales of FOODS category is very much higher then rest two.