How to Build a Movie Recommendation System?

Original article was published by Ramya Vidiyala on Artificial Intelligence on Medium


How to Build a Movie Recommendation System?

Step by step guide to building a simple recommendation system

Have you ever wondered how YouTube recommends content, or how Facebook recommends you, new friends? Perhaps you’ve noticed similar recommendations with LinkedIn connections, or how Amazon will recommend similar products while you’re browsing. All of these recommendations are made possible by the implementation of recommender systems.

Recommender systems encompass a class of techniques and algorithms that can suggest “relevant” items to users. They predict future behavior based on past data through a multitude of techniques including matrix factorization.

In this article, I’ll look at why we need recommender systems and the different types of users online. Then, I’ll show you how to build your own movie recommendation system using an open-source dataset.


  • Why Do We Need Recommender Systems?
  • Types of Recommender Systems
    A) Content-Based Movie Recommendation Systems
    B) Collaborative Filtering Movie Recommendation Systems
  • The Dataset
  • Designing a Movie Recommendation System
  • Implementation
    Step 1: Matrix Factorization-based Algorithm
    Step 2: Creating Handcrafted Features
    Step 3: Creating a Final Model for our Movie Recommendation System
  • Performance Metrics
  • Summary

Why Do We Need Recommender Systems?

We now live in what some call the “era of abundance”. For any given product, there are sometimes thousands of options to choose from. Think of the examples above: streaming videos, social networking, online shopping; the list goes on. Recommender systems help to personalize a platform and help the user find something they like.

The easiest and simplest way to do this is to recommend the most popular items. However, to really enhance the user experience through personalized recommendations, we need dedicated recommender systems.

From a business standpoint, the more relevant products a user finds on the platform, the higher their engagement. This often results in increased revenue for the platform itself. Various sources say that as much as 35–40% of tech giants’ revenue comes from recommendations alone.

Now that we understand the importance of recommender systems, let’s have a look at types of recommendation systems, then build our own with open-sourced data!

Types of Recommender Systems

Machine learning algorithms in recommender systems typically fit into two categories: content-based systems and collaborative filtering systems. Modern recommender systems combine both approaches.

Let’s have a look at how they work using movie recommendation systems as a base.

A) Content-Based Movie Recommendation Systems

Content-based methods are based on the similarity of movie attributes. Using this type of recommender system, if a user watches one movie, similar movies are recommended. For example, if a user watches a comedy movie starring Adam Sandler, the system will recommend them movies in the same genre or starring the same actor, or both. With this in mind, the input for building a content-based recommender system is movie attributes.

Figure 1: Overview of content-based recommendation system (Image created by author)

B) Collaborative Filtering Movie Recommendation Systems

With collaborative filtering, the system is based on past interactions between users and movies. With this in mind, the input for a collaborative filtering system is made up of past data of user interactions with the movies they watch.

For example, if user A watches M1, M2, and M3, and user B watches M1, M3, M4, we recommend M1 and M3 to a similar user C. You can see how this looks in the figure below for clearer reference.

Figure 2: An example of the collaborative filtering movie recommendation system (Image created by author)

This data is stored in a matrix called the user-movie interactions matrix, where the rows are the users and the columns are the movies.

Now, let’s implement our own movie recommendation system using the concepts discussed above.

The Dataset

For our own system, we’ll use the open-source MovieLens dataset from GroupLens. This dataset contains 100K data points of various movies and users.

We will use three columns from the data:

You can see a snapshot of the data in figure 3, below:

Figure 3: Snapshot of data (Image by author)

Designing our Movie Recommendation System

To obtain recommendations for our users, we will predict their ratings for movies they haven’t watched yet. Movies are then indexed and suggested to users based on these predicted ratings.

To do this, we will use past records of movies and user ratings to predict their future ratings. At this point, it’s worth mentioning that in the real world, we will likely encounter new users or movies without a history. Such situations are called cold start problems.

Let’s take a brief look at how cold start problems can be addressed.

Cold Start Problems

Cold start problems can be handled by recommendations based on meta-information, such as:

  • For new users, we can use their location, age, gender, browser, and user device to predict recommendations.
  • For new movies, we can use genre, cast, and crew to recommend it to target users.


For our recommender system, we’ll use both of the techniques mentioned above: content-based and collaborative filtering. To find the similarity between movies for our content based method, we’ll use a cosine similarity function. For our collaborative filtering method, we’ll use a matrix factorization technique.

The first step towards this is creating a matrix factorization based model. We’ll use the output of this model and a few handcrafted features to provide inputs to the final model. The basic process will look like this:

  • Step 1: Build a matrix factorization-based model
  • Step 2: Create handcrafted features
  • Step 3: Implement the final model

We’ll look at these steps in greater detail below.

Step 1: Matrix Factorization-based Algorithm

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. This family of methods became widely known during the Netflix prize challenge due to how effective it was.

Matrix factorization algorithms work by decomposing the user-movie interaction matrix into the product of two lower dimensionality rectangular matrices, say U and M. The decomposition is done in such a way that the product results in almost similar values to the user-movie interaction matrix. Here, U represents the user matrix, M represents the movie matrix, n is the number of users, and m is the number of movies.

Each row of the user matrix represents a user and each column of the movie matrix represents a movie.

Figure 4: Matrix factorization (Image created by author)

Once we obtain the U and M matrices, based on the non-empty cells in the user-movie interaction matrix, we perform the product of U and M and predict the values of non-empty cells in the user-movie interaction matrix.

To implement matrix factorization, we use a simple Python library named Surprise, which is for building and testing recommender systems. The data frame is converted into a train set, a format of data set to be accepted by the Surprise library.

from surprise import SVD
import numpy as np
import surprise
from surprise import Reader, Dataset
# It is to specify how to read the data frame.
reader = Reader(rating_scale=(1,5))
# create the traindata from the data frame
train_data_mf = Dataset.load_from_df(train_data[['userId', 'movieId', 'rating']], reader)
# build the train set from traindata.
#It is of dataset format from surprise library
trainset = train_data_mf.build_full_trainset()
svd = SVD(n_factors=100, biased=True, random_state=15, verbose=True)

Now the model is ready. We’ll store these predictions to pass to the final model as an additional feature. This will help us incorporate collaborative filtering into our system.

#getting predictions of train set
train_preds = svd.test(trainset.build_testset())
train_pred_mf = np.array([pred.est for pred in train_preds])

Note that we have to perform the above steps for test data also.

Step 2: Creating Handcrafted Features

Let’s convert the data in the data frame format into a user-movie interaction matrix. Matrices used in this type of problem are generally sparse because there’s a high chance users may only rate a few movies.

The advantages of the sparse matrix format of data, also called CSR format, are as follows:

  • efficient arithmetic operations: CSR + CSR, CSR * CSR, etc.
  • efficient row slicing
  • fast matrix-vector products

scipy.sparse.csr_matrix is a utility function that efficiently converts the data frame into a sparse matrix.

# Creating a sparse matrix
train_sparse_matrix = sparse.csr_matrix((train_data.rating.values, (train_data.userId.values, train_data.movieId.values)))

‘train_sparse_matrix’ is the sparse matrix representation of the train_data data frame.

We’ll create 3 sets of features using this sparse matrix:

  1. Features which represent global averages
  2. Features which represent the top five similar users
  3. Features which represent the top five similar movies

Let’s take a look at how to prepare each in more detail.