Original article was published by Kenillshah on Deep Learning on Medium
Restaurant Visitors Forecasting Challenge- A Machine Learning Case Study
This is my first medium story. Hope you have fun reading it as much as I enjoyed writing it for the audience.
Operating a successful restaurant require restaurant owners and managers not on managing day to day operation, but also evaluating a way to reduce cost and grow future sales.
Forecasting based on historical data that can provide insight into your two largest costs, food, labor and help you to make decisions based about where you put your resources.
Forecasting the number of visitors is a meaningful task in the service industry. There are different factors which are helpful in predicting future customers in a restaurant such as location, quality of service, cost, staff members etc. Today I am going to take you through a real world data science problem which I have picked from Kaggle’s live competition and will demonstrate my way of solving it. This case study solves everything right from scratch and I talk about my approach of solving the problem.
It can be hard to know that how many visitors came to a restaurant per day, so that we cannot know the total revenue of a restaurant per day based on the visitors came to the restaurant,so that we design a real world problem where we have to forecast the visitors came to the restaurant per day and based on this restaurant owners need to know about the next day forecast. Recruit holding has uniquely accessible to datasets which could make future automated future customer prediction possible.
Challenges to solve:
Given details about a Restaurant Id, Name ,Location and area, can you build an algorithm the automatically suggest the visitors? Challenging, right?
But if solved correctly, it can eliminate human interference in giving visitor forecasting of a restaurant which make the restaurant owners more powerful about making decisions of future customers and also the waiters to know more about made the food per day for the customers.
Mapping the real world problem to a Machine Learning Problem:
Type of Machine Learning Problem:
For a given Id of the restaurant we need to suggest the total visitors of that restaurant based on their different features like location, area,Genre_Name etc. The given problem is a regression problem as it will return the total visitors of a restaurant.
Use of Machine Learning/Deep Learning:
Here we are using reservation and visit data to predict the total number of visitors to a restaurants in future and based on this we preprocessed and feature engineered the data and feed into a machine learning and deep learning model to tackle this problem.
Source Of Data:
Here the data comes in two separate sites:
1)Hot Pepper Gourmet(hpg):users can search restaurants based on the hpg_id and make a reservation online.
2)AIRREGI(air):a reservation control and cash register system.
We also used the reservation, visits and other information from these sites to forecast future restaurant visitor totals on given date. The training data covers the dates from 2016 until April 2017. The test set covers the last week of april and May 2017.
There are days in the test set where the restaurants were closed and had no visitors. These are ignoring in the scoring.
We are getting the data for this problem via below link :https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting/data
This is a relational dataset from two systems In air_restaurant it is prefaced with the air and in hpg_restaurant is prefaced with hpg. Each restaurant has unique air_store_id and hpg_store_id.
air_reserve.csv: This file contains the reservation made in the air system and here reserve_datetime indicates that when the reservation was created and visit_datetime indicates that the time in the future where the visit will occur.
o air_store_id: the restaurant’s id in the air system.
o visit_datetime: the time of the reservation.
o reserve_datetime: the time when the reservation was made.
o reserve_visitors:the number of visitors for that reservation.
hpg_reserve.csv: this file contains the reservation made in the hpg system.
o hpg_store_id: the restaurant’s id in the hpg system.
o visit_datetime: the time of reservation.
o reserve_datetime: the time the reservation was made.
reserve_visitors:the number of visitors for that reservation.
air_store_info.csv: This file contains information about selecting air restaurants.
hpg_store_info.csv: This file contains information about selecting hpg restaurants.
store_id_relation.csv: This file allows you to join select restaurants that have both air and hpg systems.
air_visit_data: This file contains historical visit data for air_restaurants.
Sample_submission: This file shows the submission including the days for which you must forecast.
o Id: Id is formed by air_store_id and visit date.
o Visitors: the number of visitors forecasted for store and date combination.
date_info: This file gives the basic information about the calendar dates in the dataset.
o holiday_flg:is the day is holiday in Japan.
The evaluation metric is Root Mean Squared Logarithimic Error which is calculated as:
n is the total number of observation
p_i is the prediction of visitors.
a_i is the actual number of visitors.
log(x) is the natural logarithm of x.
Exploratory Data Analysis:
1)It started with Jan2016 and fluctuated trend of visitors by visit_date till july and then on we steep increase happened from approx 4500 visitors to above 15000 visitors.
2)After July it was fluctuated till November and in the end of the December the trend shows the 3 times drop from the current values of visitors.
3)After Jan2017 again one time steep increased and then fluctuated.
1)In the First 4 weeks there is a small amount of increase in the trend of visitors by weekday as compared to the 5th week and then in the last week which shows the decrease trend of visitors.
2)There is a Increasing trend of visitors from starting to the end of the week and there is slightly increase and decrease in the visitors from all weekday.
1)The highest Number of visitors in the month of March as compared to the all the months and lowest visitors contribution from the month of May and June.
2)There are slight change of visitors went to restaurant except for the month march and may.
1)It is Fluctuated trend starting with the first day to the end of the day of the month.
2)The highest visitors are on the 23rd day and lowest visitors are on the last day of the month and other day are fluctuated.
1)The first four weeks of the mean visitors are increasing and in the last week there is a steep decrease in the mean visitors.
2)The highest mean visitors are in the last week and the lowest mean visitors are in the first and third week.
1)In the December month there is a highest mean visitors and all other months are cumulatively increasing by small number of visitors and in the august there is a small decrease in the visitors.
2)Overall there are slightly fluctuated trend of the visitors to went to the restaurant and take the food.
1)Reservation starts after the 5th day of the first month of January in 2016 and then fluctuated till the end of the Aug month of 2016.
2)There is a same contribution of the visitors from sept 2016 and then after it is fluctuated and in the end of the june 2017 it is consistently decreasing.
1)The contribution of the visitor reservation by hourly is small in the early morning and in the afternoon it is steep increasing till the 8 oclock in the evening and then steep decrease after 11PM in the night.
2)This plot shows that in the initial hours there are not that much of reservation happened but after in the afternoon there is a increase in the reservation.
1)There is a very long scale break between a reservation and visit and those are the most extreme values for the airdata as there is format of hour is between before 24 hours to next day 24 hours.
2)This plot shows the difference between the reserve time and visit time and based on that number of visitors are visits on that particular hour and there are breaks in between the reservation and visit.
1)In the Initial days i.e from 1 to 45 days vistors are at the peak and after completing the 50th day there is a similar trend of visitors till days of 400 difference.
2)by the difference of the days between visits and reservation there is a drop of visitors when the days are increasing and after some days there are similar visitors consistently.
1)In all the days starting from Monday to Sunday the initial hours shows the upward trend of visitors till the evening 6 oclock and after that there is downward trend of visitors till the night 11PM
2)This plot is based on reserveweekdayname of the day of the hour how many visitors visits to a restaurant and also compare the visitors on the day of the hour between the time interval.
- There are more visitors who made more reservation in 2016 as compared to 2017.
1)Reservation starts after the 5th day of the first month of januray in 2016 and then fluctuated till the end of the year of 2016.
2)There is a upward trend in the starting months of 2017 and it stays till the month of the may after then it is a steep decrease.
- No reservation made by the visitors till the 10 oclock in the morning and after that is started with very little amount of reservation made till 5 oclock in the evening and in the 6PM to 7Pm there is a highest number of reservation happened as compared to all the other hours of all the day and in the night there is a decreasing trend of reservation made by visitors.
1)There is a very frequent scale happened between a reservation and a visit and those are the most extreme values for the hpgreservedata as there is a format of hour is between 24 hours to next day 24 hours.
2)This plot shows the difference between the reserve time and visit time and based on that the number of visitors are visits on that particular hour and there are breaks in between the visits and the reservation.
1)In the Initial days visitors are at the highest pick and after completing the 50th day there is a similar trend of visitors till days of 350 difference.
2)By the difference of the days between the visit and reservation there is a drop of visitors when the days are increasing and after some days there are similar visitors consistently.
1)In all the days starting from Monday to Sunday the initial hours shows the similar trend of visitors till 7 o’clock in the morning and then shows the fluctuated trend till evening 6 o’clock and after that there is a downward trend of visitors till the night 11PM
2)On Monday, Friday and Wednesday there is a high peak of visitors between the afternoon 12:30PM to 15:30 PM.
- There are more visitors who made more reservation in 2016 as compared to 2017.
1)Izakya has the highest number of restaurant followed by Caffe/Sweets and Dining Bar, while International cuisine has the lowest numbers of the visitors.
2)This analysis shows the preference of going to specific type of restaurant by the number of branches in the particular area.
1)Japanese Style has the highest number of the restaurants followed by the international cuisine and creation.
2)There are number of Genres who has less or no restaurants.
1)There are more number of visitors on holiday days the non holiday.
6th Place Kaggle Solution (Team: Yunfeng and Ankit):
1)Apart from the dataset given in the competition, Weather data for Recruit Restaurant Competition is also used.
2)From the Calendar Information, a feature called hour gap is used which gives the gap between the reserving a restaurant and visiting in hours, which is again subdivided into 5 categories based on gap length.
3)Average,median,max and min visitors per restaurant is taken into the consideration separately for working and non working days.
4)Area wise total count of restaurants is also calculated.
5)From the Weather Information, here also temperature and precipitation information is used but the temperature is subdivided into low,average and high.
6)The week day wise mean of visitors count for all 7 days of all restaurants is also calculated.
7)Month wise visitors count for all 12 months for each restaurant is also calculated.
First Cut Approach:
1)First of all import required libraries and the helper function
Which can be used to deal with this problem.
2)Load the data of airvisit,airreserve,hpgreserve, etc. and do some statistics like number of null values, total shape of the data, mean, median and some data cleaning like some of the columns which are duplicated which can be removed in the above dataset after loading and also performed some preprocessing in the text columns.
3)Doing EDA based on all the features.
3.1)Analysis on the number of visits in the air restaurant and plot the number of visitors per day over the full training time range and with the median visitors of the week and month of the year and get the idea which day in week more visitors as compared to other day in a week also we are doing the same thing on month and year.
3.2)Analyze based on the reservation how many visitors visits to a particular restaurant on that day along with the hour of the visit and the time between the reservation to visit.
3.3)Analysis of the data that how many number of reservation made during the month and map the difference between the monthly reservation to the yearly reservation in the air system.
3.4)Analysis on the number of visits in the hpg restaurant and plot the number of visitors per day over the full training time range and with median visitor of the week and month of the year and get the idea which day in a week more visitor as compared to other day in a week also we are doing same with the month and the year.
3.5) analysis of the data of area/region wise how many restaurant are there in (air/hpg system) and how many are vegetarian restaurant and how many are the non vegetarian.
3.6) Average number of the visitor based on the reservation in multi cuisine restaurant as compared to the single restaurant.
3.7) Visualize the area of the restaurant in which it is belong to base d on the latitude and longitude and the number of the cuisines (genres) based on the area.
· On hyperparameter tuning, we found out max_depth=5,n_estimators=3000.
· RandomForestRegressor model training & prediction:-
· The RMSLE we got for RandomForestRegressor is 10.59.
· Linear Regression:
· On hyperparameter tuning, we found best alpha value is:0.01
· Linear Regression model training & prediction:-