Model-based collaborative filtering systems with Machine Learning Algorithm

Source: Deep Learning on Medium


Go to the profile of Gazza Azhari
Photo by my beloved friend Yoga Pradana

Previously, I wrote step-by-step tutorial for Recommendation System — Evaluating Similarity Based on correlation that use Pearson’s R correlation.

Similarly, on this post, I will provide step-by-step a Model-based collaborative filtering systems using truncated singular value decomposition which I learned from Lynda.

We will start by importing numpy, pandas, and TruncatedSVD from Sci-Kit Learn. SVD is a linear algebra method that can be used to decompose a utility matrix into three compressed matrices. It is very useful as it provides efficiency to not refer back to the complete and entire dataset. SVD provides latest variables that are available and affecting the behaviour of a dataset.

reference: https://www.lynda.com/Python-tutorials/Model-based-collaborative-filtering-systems/563080/632875-4.html
import numpy as np
import pandas as pd
import sklearn
from sklearn.decomposition import TruncatedSVD

On this tutorial, we will use MovieLens dataset that was collected by GroupLens Research Project at University of Minnesota. Please download the dataset on this link.


Preparing the data

First we are going to create a column that consists of user_id, item_id, rating, and timestamp. And we are going to fetch the data from csv file into the table that we created.

## separate the data fetched from csv file with tab '\t'
## fetched from u.data
columns = ['user_id', 'item_id', 'rating', 'timestamp']
frame = pd.read_csv('ml-100k/u.data', sep='\t', names=columns)
frame.head()
displaying each of the users and each of the movies they have reviewed with its rating

Then, we will create a list called columns which consist of item_id, movie title, release date, video release datas, IMDB IRL, unknown, action, adventure, animation, children, and other movie categories.

## fetched from u.item
columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1')
movie_names = movies[['item_id', 'movie title']]
movie_names.head()

Then, we combine these whole thing into a table called combined_movies_data that consists of a frame and movie names, and then we will pass the parameter.

combined_movies_data = pd.merge(frame, movie_names, on='item_id')
combined_movies_data.head()

Now we will check which movies that have the most number of reviews, just like how we did it in this post.

combined_movies_data.groupby('item_id')['rating'].count().sort_values(ascending=False).head()

Then we will check what the movie is with the item_id of 50. Since we only want the unique record instead of the every single record where item ID is = 50, we are going to use .unique() function.

filter = combined_movies_data['item_id']==50
combined_movies_data[filter]['movie title'].unique()

Building a Utility Matrix

On this section, we will create a matrix that is going to be a crosstab matrix which generate from our combined movies data.

rating_crosstab = combined_movies_data.pivot_table(values='rating', index='user_id', columns='movie title', fill_value=0)
rating_crosstab.head()

Transposing the Matrix

Next, we will take this utility matrix transposed, which later we are going to use SVD to decompose it down to synthetic representations of the user reviews. And we are going to call this whole thing X.

rating_crosstab.shape
X = rating_crosstab.T
X.shape

Decomposing the Matrix

Now let’s decompose it. Let’s substantiate an SVD object, we will call it SVD, and then we will call truncated SVD, then we will set resultant matrix to have 12 dimensions.

## passing random_state = 17 to get the same repeatable results
SVD = TruncatedSVD(n_components=12, random_state=17)
resultant_matrix = SVD.fit_transform(X)
resultant_matrix.shape

Generating a Correlation Matrix

We want to find out how similar each movie to other movies on the basis of user tastes. We used Pearson’s R correlation coefficient to do that. The correlation matrix will get back the 1,664 by 1,664 matrix. In this case, Star Wars, 1977, would be the pivot.

reference: https://www.lynda.com/Python-tutorials/Model-based-collaborative-filtering-systems/563080/632875-4.html

For each movie pair in the matrix, we will calculate how similar they correlate, based on the user perspective. To do that, we will use numpy’s corrcoef function, and we will pass it to resultant_matrix.

corr_mat = np.corrcoef(resultant_matrix)
corr_mat.shape

Isolating Star Wars From the Correlation Matrix

As we already have a 1,664 by 1,664 correlation matrix, now we move on to isolate Star Wars from this correlation matrix. Firstly, we are going to generate a movie names index.

## pulling the movie names from ratings crosstab columns
## convert numpy array to a list then retrieve index of Star Wars, 1977
movie_names = rating_crosstab.columns
movies_list = list(movie_names)
star_wars = movies_list.index('Star Wars (1977)')
star_wars
index for Star Wars (1977)
## isolating the array that represents Star Wars
corr_star_wars = corr_mat[1398]
corr_star_wars.shape
output -> (1664,)

Recommending a Highly Correlated Movie

Let’s now generate list of movie names that correlate with Star Wars.

list(movie_names[(corr_star_wars<1.0) & (corr_star_wars > 0.9)])

Finally, we are going to retrieve only the movie that have a Pearson’s R coefficient close to one.

list(movie_names[(corr_star_wars<1.0) & (corr_star_wars > 0.95)])

Honestly, I only have watched few Star wars movie, but if you check online, there are similarities between those movies. Both movies came from the similar era, they both are very popular sci-fi films. Therefore, it is very most likely that if you like Star Wars, you might as well like Return of the Jedi.

That’s it. Little bit similar to my previous post, but in here, I used Machine Learning algorithm to find the solution. Any feedback, suggestion and correction are most welcomed.