Source: Deep Learning on Medium
Previously, I wrote step-by-step tutorial for Recommendation System — Evaluating Similarity Based on correlation that use Pearson’s R correlation.
Similarly, on this post, I will provide step-by-step a Model-based collaborative filtering systems using truncated singular value decomposition which I learned from Lynda.
We will start by importing numpy, pandas, and TruncatedSVD from Sci-Kit Learn. SVD is a linear algebra method that can be used to decompose a utility matrix into three compressed matrices. It is very useful as it provides efficiency to not refer back to the complete and entire dataset. SVD provides latest variables that are available and affecting the behaviour of a dataset.
import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD
On this tutorial, we will use MovieLens dataset that was collected by GroupLens Research Project at University of Minnesota. Please download the dataset on this link.
Preparing the data
First we are going to create a column that consists of user_id, item_id, rating, and timestamp. And we are going to fetch the data from csv file into the table that we created.
## separate the data fetched from csv file with tab '\t'
## fetched from u.data
columns = ['user_id', 'item_id', 'rating', 'timestamp']
frame = pd.read_csv('ml-100k/u.data', sep='\t', names=columns)
Then, we will create a list called columns which consist of item_id, movie title, release date, video release datas, IMDB IRL, unknown, action, adventure, animation, children, and other movie categories.
## fetched from u.item
columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1')
movie_names = movies[['item_id', 'movie title']]
Then, we combine these whole thing into a table called combined_movies_data that consists of a frame and movie names, and then we will pass the parameter.
combined_movies_data = pd.merge(frame, movie_names, on='item_id')
Now we will check which movies that have the most number of reviews, just like how we did it in this post.
Then we will check what the movie is with the item_id of 50. Since we only want the unique record instead of the every single record where item ID is = 50, we are going to use .unique() function.
filter = combined_movies_data['item_id']==50
Building a Utility Matrix
On this section, we will create a matrix that is going to be a crosstab matrix which generate from our combined movies data.
rating_crosstab = combined_movies_data.pivot_table(values='rating', index='user_id', columns='movie title', fill_value=0)
Transposing the Matrix
Next, we will take this utility matrix transposed, which later we are going to use SVD to decompose it down to synthetic representations of the user reviews. And we are going to call this whole thing X.
X = rating_crosstab.T
Decomposing the Matrix
Now let’s decompose it. Let’s substantiate an SVD object, we will call it SVD, and then we will call truncated SVD, then we will set resultant matrix to have 12 dimensions.
## passing random_state = 17 to get the same repeatable results
SVD = TruncatedSVD(n_components=12, random_state=17)
resultant_matrix = SVD.fit_transform(X)
Generating a Correlation Matrix
We want to find out how similar each movie to other movies on the basis of user tastes. We used Pearson’s R correlation coefficient to do that. The correlation matrix will get back the 1,664 by 1,664 matrix. In this case, Star Wars, 1977, would be the pivot.
For each movie pair in the matrix, we will calculate how similar they correlate, based on the user perspective. To do that, we will use numpy’s corrcoef function, and we will pass it to resultant_matrix.
corr_mat = np.corrcoef(resultant_matrix)
Isolating Star Wars From the Correlation Matrix
As we already have a 1,664 by 1,664 correlation matrix, now we move on to isolate Star Wars from this correlation matrix. Firstly, we are going to generate a movie names index.
## pulling the movie names from ratings crosstab columns
## convert numpy array to a list then retrieve index of Star Wars, 1977
movie_names = rating_crosstab.columns
movies_list = list(movie_names)
star_wars = movies_list.index('Star Wars (1977)')
## isolating the array that represents Star Wars
corr_star_wars = corr_mat
output -> (1664,)
Recommending a Highly Correlated Movie
Let’s now generate list of movie names that correlate with Star Wars.
list(movie_names[(corr_star_wars<1.0) & (corr_star_wars > 0.9)])
Finally, we are going to retrieve only the movie that have a Pearson’s R coefficient close to one.
list(movie_names[(corr_star_wars<1.0) & (corr_star_wars > 0.95)])
Honestly, I only have watched few Star wars movie, but if you check online, there are similarities between those movies. Both movies came from the similar era, they both are very popular sci-fi films. Therefore, it is very most likely that if you like Star Wars, you might as well like Return of the Jedi.
That’s it. Little bit similar to my previous post, but in here, I used Machine Learning algorithm to find the solution. Any feedback, suggestion and correction are most welcomed.