Building a Recommendation System — Evaluating Similarity Based on Correlation

Source: Deep Learning on Medium

As I mentioned on my previous post, I would share a step by step tutorial to develop a recommendation system with Python Machine Learning and AI from Lynda. On this tutorial, we are using Pearson’s R correlation to recommend an item that is very similar based on user’s action.

If you haven’t, please install Jupyter Notebook as this will be your playing ground.

We started by running this command on your terminal.

$ jupyter notebook

For this tutorial, please download these csv files. On this particular tutorial, we are using datasets of places to eat and restaurant goers. The recommendation will be highly based on user reviews similarity.

First step would be importing the Pandas and NumPy

import numpy as np
import pandas as pd

And followed by reading the data, in this case, csv files to be accessed by the program. Datasets we are using actually come from Mexico. These datasets are hosted at the University of California Irvine Machine Learning site. However, they were published by Blanca Vargas ETAL.

## reading csv file using read_csv function
frame = pd.read_csv("rating_final.csv")
cuisine = pd.read_csv("chefmozcuisine.csv")
geodata = pd.read_csv("geoplaces2.csv")

The file structure at Jupyter Notebook would look like this

To check if the csv files are correctly read, you can write this line to the notebook

example of frame.head() and geodata.head()

And if you want to look at particular row and column, for an example, you can write this. In this case, we only need the place ID and restaurant Name

places =  geodata[['placeID', 'name']]

Grouping and Ranking Data

Previously, we have the place ID and restaurant Name. Then, we want to have a look at the rating column. We want to find out the mean value for each ratings.

rating = pd.DataFrame(frame.groupby('placeID')['rating'].mean())
Mean value of each ratings

Now we are going to add a column called Rating Count, then we will generate the number of how many reviews each place received.

rating['rating_count'] = pd.DataFrame(frame.groupby('placeID')['rating'].count())

Now lets look at the description of this rating data frame by calling .describe()


From this picture, we can see that there are 130 unique places that have been reviewed in the rating data frame. Additionally, there is a max value of 36. This means that the place that have got the most review has received 36 reviews. We will check what is that particular place. Next, we are going to see which restaurant that has 36 reviews.

## Sorting the reviews count so we can see which one got the most reviewed
rating.sort_values('rating_count', ascending=False).head()
## The place with ID 135085 has the most reviewed, and we will check what restaurant it is

The place with ID 135085 turned out to be Tortas locas Hipocampo.

Next, we will check what kind of cuisine Tortas locas Hipocampo serving.

## This is to check what kind of cuisine the restaurant with ID 135085 is serving 

Preparing for Data Analysis

On this section, we will call a function named pivot table. This function is aimed to cross tabulate each user against each place, and output a matrix.

places_crosstab = pd.pivot_table(data=frame, values='rating', index='userID', columns='placeID')

And you notice that there are many cross tab that is full of nul values. This is because there are only few people that actually review places. Next, we will see how we can use it to find places that are correlated.

## Filter the rating value that is not null on Tortas Restaurant
Tortas_ratings = places_crosstab[135085]

Evaluating Similarity Based on Correlation

At this section, we aim to find correlation between each of the places and Tortas restaurant. This is achieved by calling the core with method of our places cross tab, and then pass it the Tortas rating series. We are going to generate a Pearson R correlation coefficient between Tortas and each other place that have been reviewed in datasets.

## Here we are calling the data frame constructor
## Then we pass in similar_to_Tortas, followed by naming the new column 'Pearson R'
## As we don't want to see the null values, we are calling dropna menthod to do that, and pass the argument and place equal to true
similar_to_Tortas = places_crosstab.corrwith(Tortas_ratings)
corr_Tortas = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])

Let’s try to make some sense here, if we have found a restaurant that seemed to be correlated with Tortas but only had, say, two ratings total, then it might not be similar to Tortas at all. As a result, the correlation really wouldn’t be significant.

Then, let’s create a filter that showing the restaurants that have at least 10 user reviews. Let’s look at the Pearson R correlation coefficient sorted in descending order.

## create the filter
## only get the value which rating's greater than or equal to 10
Tortas_corr_summary = corr_Tortas.join(rating['rating_count'])
Tortas_corr_summary[Tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(10)

If you are aware, there are places that have 1 Pearson R value. These one values are not meaningful here. The reason is because is in those places, there was only on user who gave a review to both places and gave them same score.

## remove places that have 1 Pearson R value
## taking the top seven correlated results and see if any of them also serve fast food
## merge and pass top correlated place ID and type of food they serve
places_corr_Tortas = pd.DataFrame([135085, 132754, 135045, 135062, 135028, 135042, 135046], index = np.arange(7), columns=['placeID'])
summary = pd.merge(places_corr_Tortas, cuisine,on='placeID')

As you can see from summary table, places that were not displaying is because the data is not listed in the cuisine’s dataset.

## getting the name of the restaurant with ID 135046 as they both server fast food
## getting more info about Restaurante El Reyecito (most identical)

That’s it. It might look a bit complicated. On the next tutorial, I will use machine learning algorithm for the test, which be a lot simpler.

Any feedback, suggestion and correction are most welcomed.