Original article was published by James Loy on Deep Learning on Medium
Deep Learning based Recommender Systems
A gentle introduction to modern movie recommenders
Traditionally, recommender systems are based on methods such as clustering, nearest neighbor and matrix factorization. However, in recent years, deep learning has yielded tremendous success across multiple domains, from image recognition to natural language processing. Recommender systems have also benefited from deep learning’s success. In fact, today’s state-of-the-art recommender systems such as those at Youtube and Amazon are powered by complex deep learning systems, and less so on traditional methods.
Why this tutorial?
While reading through the many useful tutorials here that covers the basics of recommender systems using traditional methods such as matrix factorization, I noticed that there is a lack of tutorials that cover deep learning based recommender systems. In this notebook, we’ll go through the following:
- How to create your own deep learning based recommender system using PyTorch Lightning
- The difference between implicit and explicit feedback for recommender systems
- How to train-test split a dataset for training recommender systems without introducing biases and data leakages
- Metrics for evaluating recommender systems (hint: accuracy or RMSE is not appropriate!)
Dataset for this tutorial
This tutorial uses movies reviews provided by the MovieLens 20M dataset, a popular movie ratings dataset containing 20 Million movie reviews collected from 1995 to 2015.
If you would like to follow along the code in this tutorial, you can view my Kaggle Notebook, where you can run the code and see the output as you follow along in this tutorial.
Building Recommender Systems using Implicit Feedback
Before we build our model, it is important to understand the distinction between implicit and explicit feedback, and why modern recommender systems are built on implicit feedback.
In the context of recommender systems, explicit feedback are direct and quantitative data collected from users. For example, Amazon allows users to rate purchased items on a scale of 1–10. These ratings are provided directly from users, and the scale allows Amazon to quantify user preference. Another example of explicit feedback includes the thumbs up/down button on YouTube, which captures users’ explicit preference (i.e. like or dislike) of a particular video.
However, the problem with explicit feedback is that they are rare. If you think about it, when was the last time you clicked the like button on a YouTube video, or rated your online purchases? Chances are, the amount of videos you watch on YouTube is far greater than the amount of videos that you have explicitly rated.
On the other hand, implicit feedback are collected indirectly from user interactions, and they act as a proxy for user preference. For example. videos that you watch on YouTube are used as implicit feedback to tailor recommendations for you, even if you don’t rate the videos explicitly. Another example of implicit feedback includes the items that you have browsed on Amazon, which are used to suggest other similar items for you.
The advantage of implicit feedback is that it is abundant. Recommender systems built using implicit feedback also allows us to tailor recommendations in real time, with every click and interaction. Today, online recommender systems are built using implicit feedback, which allows the system to tune its recommendation in real-time, with every user interaction.
Before we start building and training our model, let’s do some preprocessing to get the MovieLens data in the required format.
In order to keep memory usage manageable, we will only use data from 30% of the users in this dataset. Let’s randomly select 30% of the users and only use data from the selected users.
After filtering the dataset, there are now 6,027,314 rows of data from 41,547 users (that’s still a lot of data!). Each row in the dataframe corresponds to a movie review made by a single user, as we can see below.
Along with the rating, there is also a timestamp column that shows the date and time the review was submitted. Using the timestamp column, we will implement our train-test split strategy using the leave-one-out methodology. For each user, the most recent review is used as the test set (i.e. leave one out), while the rest will be used as training data .
To illustrate this, the movies reviewed by user 39849 is shown below. The last movie reviewed by the user is the 2014 hit movie Guardians of The Galaxy. We’ll use this movie as the testing data for this user, and use the rest of the reviewed movies as training data.
This train-test split strategy is often used when training and evaluating recommender systems. Doing a random split would not be fair, as we could potentially be using a user’s recent reviews for training and earlier reviews for testing. This introduces data leakage with a look-ahead bias, and the performance of the trained model would not be generalizable to real-world performance.
The code below will split our ratings dataset into a train and test set using the leave-one-out methodology.
Converting the dataset into an implicit feedback dataset
As discussed earlier, we will train a recommender system using implicit feedback. However, the MovieLens dataset that we are using is based on explicit feedback. To convert this dataset into an implicit feedback dataset, we’ll simply binarize the ratings and convert them to ‘1’ (i.e. positive class). The value of ‘1’ represents that the user has interacted with the item.
It is important to note that using implicit feedback reframes the problem that our recommender is trying to solve. Instead of trying to predict movie ratings when using explicit feedback, we are trying to predict whether the user will interact (i.e. click/buy/watch) with each movie, with the aim of presenting to users the movies with the highest interaction likelihood.
We do have a problem now though. After binarizing our dataset, we see that every sample in the dataset now belongs to the positive class. However, we also require negative samples to train our models, to indicate movies that the user has not interacted with. We assume that such movies are those that the user are not interested in — even though this is a sweeping assumption that may not be true, it usually works out rather well in practice.
The code below generates 4 negative samples for each row of data. In other words, the ratio of negative to positive samples is 4:1. This ratio is chosen arbitrarily but I found that it works rather well in practice(feel free to find the best ratio yourself!).
Great! We now have the data in the format required by our model. Before we move on, let’s define a PyTorch Dataset to facilitate training. The class below simply encapsulates the code we have written above into a PyTorch Dataset class.
Our model — Neural Collaborative Filtering (NCF)
While there are many deep learning based architecture for recommendation systems, I find that the framework proposed by He et al. is the most straightforward and it is simple enough to be implemented in a tutorial such as this.
Before we dive into the architecture of the model, let’s familiarize ourselves with the concept of embeddings. An embedding is a low-dimensional space that captures the relationship of vectors from a higher dimensional space. To better understand this concept, let’s take a closer look at user embeddings.
Imagine that we want to represent our users according to their preference for two genres of movies — action and romance movies. Let the first dimension be how much the user likes action movies, and the second dimension be how much the user likes romance movies.
Now, assume that Bob is our first user. Bob likes action movies but isn’t a fan of romance movies. To represent Bob as a two dimensional vector, we place him in the graph according to his preference.
Our next user is Joe. Joe is a huge fan of both action and romance movies. We represent Joe using a two dimensional vector just like Bob.
This two dimensional space is known as an embedding. Essentially, the embedding reduces our users such that they can be represented in a meaningful manner in a lower dimensional space. In this embedding, users with similar movie preferences are placed near to each other, and vice versa.
Of course, we are not restricted to using just 2 dimensions to represent our users. We can use an arbitrary number of dimensions to represent our users. A larger number of dimensions would allow us to capture the traits of each user more accurately, at the cost of model complexity. In our code, we’ll use 8 dimensions (which we will see later).
Similarly, we will use a separate item embedding layer to represent the traits of the items (i.e. movies) in a lower dimensional space.
You might be wondering, how can we learn the weights of the embedding layer, such that it provides an accurate representation of users and items? In our previous example, we used Bob and Joe’s preference for action and romance movies to manually create our embedding. Is there a way to learn such embeddings automatically?
The answer is Collaborative Filtering— by using the ratings dataset, we can identify similar users and movies, creating user and item embeddings learned from existing ratings.
Now that we have a better understanding of embeddings, we are ready to define the model architecture. As you’ll see, the user and item embeddings are key to the model.
Let’s walk through the model architecture using the following training sample:
The inputs to the model are the one-hot encoded user and item vector for userId = 3 and movieId = 1. Because this is a positive sample (movie actually rated by the user), the true label (interacted) is 1.
The user input vector and item input vector are fed to the user embedding and item embedding respectively, which results in a smaller, denser user and item vectors.
The embedded user and item vectors are concatenated before passing through a series of fully connected layers, which maps the concatenated embeddings into a prediction vector as output. At the output layer, we apply a Sigmoid function to obtain the most probable class. In the example above, the most probable class is 1 (positive class), since 0.8 > 0.2.
Now, let’s define this NCF model using PyTorch Lightning!
Let’s train our NCF model for 5 epochs using the GPU.
Note: One advantage of PyTorch Lightning over vanilla PyTorch is that you don’t need to write your own boiler plate training code. Notice how the Trainer class allows us to train our model with just a few lines of code.
Evaluating our Recommender System
Now that we have trained out model, we are ready to evaluate it using the test data. In traditional Machine Learning projects, we evaluate our models using metrics such as Accuracy (for classification problems) and RMSE (for regression problems). However, such metrics are too simplistic for evaluating recommender systems.
To design a good metric for evaluating recommender systems, we need to first understand how modern recommender systems are used.
Looking at Netflix, we see a list of recommendations like the one below:
Similarly, Amazon uses a list of recommendations:
The key here is that we don’t need the user to interact with every single item in the list of recommendations. Instead, we just need the user to interact with at least one item on the list — as long as the user does that, the recommendations have worked.
To simulate this, let’s run the following evaluation protocol to generate a list of top 10 recommended items for each user.
- For each user, randomly select 99 items that the user has not interacted with.
- Combine these 99 items with the test item (the actual item that the user last interacted with). We now have 100 items.
- Run the model on these 100 items, and rank them according to their predicted probabilities.
- Select the top 10 items from the list of 100 items. If the test item is present within the top 10 items, then we say that this is a hit.
- Repeat the process for all users. The Hit Ratio is then the average hits.
This evaluation protocol is known as Hit Ratio @ 10, and it is commonly used to evaluate recommender systems.
Hit Ratio @ 10
Now, let’s evaluate our model using the described protocol.
We got a pretty decent Hit Ratio @ 10 score! To put this into context, what this means is that 86% of the users were recommended the actual item (among a list of 10 items) that they eventually interacted with. Not bad!
I hope that this has been a useful introduction to creating a deep learning based recommender systems. To learn more, I recommend the following resources: