Deep Learning for Collaborative Filtering (using FastAI)

Source: Deep Learning on Medium

Photo by 
Mohamed BEN ELLEFI on Profile-based Dataset Recommendation
Go to the profile of Matteo Cimini


Recommender systems are one of the most popular application of machine learning that gained increasing importance in recent years.

At a very high level, recommender systems are algorithm that make use of machine learning techniques to mimic the psychology and personality of humans, in order to predict their needs and desires.

Standard models for recommender systems work with two kinds of data:

  1. The user-item interactions, such as ratings or buying behaviour (collaborative filtering).
  2. The attribute information about the user and the item such as textual profiles or relevant informations (content-based recommender).

In this article I want to show how to construct a very simple deep learning model to solve collaborative filtering problems.

I start with a bit of background in order to clarify some basic vocabulary. 
Then a brief description of memory-based methods follows and latent factor methods are introduced.
The proposed model is then applied to the full Movielens 100k data dataset
( by making use of the fastai library by Jeremy Howard. For further details about fastai library I suggest to read this article.
Finally, a deeper insight at the problem is presented by showing the connection between factorization of the user rating matrix (latent factor models) and neural networks.


Suppose we have a set of users U={1, … , N} and a set of items I={1, … , M}.
Let rᵤᵢ ∈ ℝ be the rating of user u ∈ U give to item i ∈ I. Every user rates an item at most one. We have a set of known ratings R={rᵤᵢ| u ∈ U, i ∈ I} and denote pᵤᵢ ∈ ℝ the prediction made by the model for a rating rᵤᵢ and P the set of predictions made for R. Our goal is to minimize the loss between P and R.

The set R can be represented as a sparse matrix R where each row represents an user and each column an item. The entries of this matrix are given by rᵤᵢ. This matrix is usually very sparse.

The basic idea of collaborative filtering methods is to use either user-user similarity or item-item similarity in order to make recommendation from a ratings matrix. These methods are also referred to as memory-based methods.

There are two basic principles used in memory-based methods:

  • User based models: Similar users have similar rating on the same item. Thus, the ratings provided by like-minded users of a target user A are used in order to make the recommendations for A. Thus, the basic idea is to determine users, who are similar to the target user A, and recommend ratings for the unobserved ratings of A by computing weighted averages of the ratings of this peer group.

Example: if Alice and Bob have rated movies in a similar way in the past, then one can use Alice’s observed ratings on the movie Terminator to predict Bob’s unobserved ratings on this movie.

  • Item based models: Similar items have are rated in a similar way by the same user. In order to make the rating predictions for target item B by user A, the first step is to determine a set S of items that are most similar to target item B. The ratings in item set S, which are specified by A, are used to predict whether the user A will like item B.

Example: if Alice and Bob have rated movies in a similar way in the past, then one can use Alice’s observed ratings on the movie Terminator to predict Bob’s unobserved ratings on this movie.

Weakness of memory-based methods

Memory-base methods have several advantages related to their simplicity and intuitive approach. It is often easy to justify why a specific item is recommended, and the interpretability of item-based methods is particularly notable.

There are, however, some drawbacks to take into account when putting those models in production.
The main disadvantages of these methods is that we need to determine either similar users or similar items in order to make predictions. This operation can be impractical in large scale settings. User-based methods requires at least O(m²) time and space for the computation of similarities. This might sometimes be too slow or space-intensive when m is of the order of tens of millions.
The other main disadvantage of these methods is their limited coverage because of sparsity of the user rating matrix. For example, if none of John’s nearest neighbors have rated Terminator, it is not possible to provide a rating prediction of Terminator for John. Sparsity also creates challenges for robust similarity computation when the number of mutually rated items between two users is small.

Latent factors methods

In model-based methods, machine learning and data mining methods are used in the context of predictive models. In other words, a summarized model of the data is created avoiding the computation of similarity (either user-user similarity nor item-item similarity).

In latent factor models, we use matrix factorization to find latent factor matrices representing users and items, whose production then reconstructs the original user rating matrix.

More formally, latent factor models try to solve collaborative filtering problems by decomposing the user rating matrix R into two smaller latent factor matrices U and I with K ≪ N, M such that RUI:

  • Each row uᵢ ∈ ℝᵏ of matrix U describes the corresponding user with i=1, …, N
  • Each columns iⱼ∈ ℝᵏ of matrix I describes the corresponding item with j=1, …, M

Thus, we come up with low dimensional representation for users and items.

Rows of latent factor matrices are called embedding vectors. Thus, an embedding is a mapping form discrete objects, such as users and items, to a vector of continuous values.
This can be used to find similarities between the discrete objects, that wouldn’t be apparent to the model if it didn’t use embedding layers.

In essence the model is represented by the following equation:

Model prediction of the rating = dot product of the embeddings vector + user baias + item baias.

The role of users and items biases can be justified by the fact that there is an intrinsic nature of the entity which is independent of its interaction with other entities. 
In other words, some users are more generous compared to others. On the other end, some items can be considered generally good and are rated high by the majority of the users. Baises can capture this informations.

Latent factors methods with fastai

For our experiment, we will use the full Movielens 100k data dataset which consists of:

  • 100.000 ratings (1–5) from 943 users on 1682 movies.
  • Each user has rated at least 20 movies

After downloading the dataset we need to load fastai collab filtering module, specify the path to the data and load the csv containing the ratings.

We can also extract some useful information about the items (movies), such as title, release date and genre (numbered from 0 to 18) and merge them with the previous dataframe.

We can now create a CollabDataBunch which is a subclass of the DataBunch class specifically provided for collaborative filtering problems.
Generally, a DataBunch in fastai is the data collector object that ‘bunches’ together several PyTorch classes into one.
In PyTorch the Dataset contains all labels, and the DataLoader gives the model chunks of the items in the Dataset. The fastai DataBunch bundles a Dataset and a Dataloader (for both training and validation sets) into a single object.

Through the parameter valid_pct we define the size of our randomly chosen validation set.

Collab_learner create a learner for collaborative filtering on data. More specifically, it binds data with a model that is either an EmbeddingDotBias with n_factors if use_nn = False or an EmbeddingNN otherwise. EmbeddingDotBias is a Pytorch model to find embeddings. It is a standard multi-layer perceptron without biases and one hidden layer. EmbeddingNN, instead, creates a deeper Neural Network suitable for collaborative filtering.

Latent factors can be created using EmbeddingDotBias model. For further details refer to the last section of this article.

As further arguments, we can pass the collab learner the n_factors argument which represents the size of the embedding vectors as well as the yrange argument which specifies the range of the rating values.

Now we can find the learning rate and train the model using fit_one_cylcle. If you are not familiar with this process I would recommend the article “An AI-based pizza detector” , in which Pietro La Torre build an image classification application which could recognize a pizza taste by analyzing a photo.


Since Embeddings learned are a low dimensional representation for users and items, they should contain some interesting feature which we can extract to have a better insight.

To start off we get the 100 most popolar movies by how much reviews they have.

We also extract their biases as well as their mean ratings in order to get informations about movies that are generally rated low or high no matter what user is rating them.

In most cases, high mean ratings correspond to high biases and viceversa.

However, there are some movies with high mean rating for whom the model has provided a low bias.

Those are niche movies rated by a small number of user and cannot be considered generally good. For instance, “Romeo Is Bleeding (1993)” belongs to the lowest_bias set even though it has a quite good mean ratings of 3.0. Indeed, it has been rated by a small number of users (37).
Bias then turns out to be more descriptive indicator of the phenomenon.

A deeper insight at latent factor methods

The basic model to find these latent factor matrices is a standard multi-layer perceptron without biases and identity activation functions.

This network has three layers: an input layer L¹ with N inputs, a hidden layer L² with K units and an output layer L³ with M units.
Thus, our network has many units in the input layer has there are users and many units in the output as there are items. The size of the hidden layer determines the dimension of the latent factors.

Let Wˡ be the set of all weight connecting to the lᵗʰ layer with l=2,3
Let x be the input of the network and oₓ the corresponding output.

In a network with no biases and only identity activation functions, the output of the network is obtained by subsequently applying matrices Wˡ to the input, i.e. oₓ = W³W²x.

Let xᵤ=1ᵤ= (X₀, X₁, …) with j= 1, …, N, xⱼ=1 if j=u and xⱼ=0 otherwise be the input.

For simplicity, denote oᵤ the corresponding output and oᵤᵢ the iᵗʰ component of oᵤ.

Then, oᵤᵢ is given by the dot product between the uᵗʰ row of and the iᵗʰ column of . Therefore, we achieve the decomposition of the user rating matrix, i.e. RUI where U=W², I =W³.

Optimal and are learned with standard backpropagation algorithm and using the mean squared error as cost function.
The neural network is extended by integrating biases to further improve the final accuracy of the model. This is done by introducing a bias layer as the final layer of the network.


Collaborative filtering models use the collaborative power of the ratings provided by multiple users to make recommendations. These models can be divided into memory-based and model-based methods.

Despite their simplicity and intuitive approach, memory-based methods have some drawbacks to take into account when putting those models in production.

In this article a model-based method which makes use of matrix factorization and deep learning to find latent factor matrices representing users and items has been presented.
The proposed model has been then applied to the full Movielens 100k data dataset by making use of the fastai library.

The code covered in this article is available as a Github Repository.

Thank you for reading my post! If you liked it I would appreciate claps, follows and shares. Moreover, visit my company website, or follow our linkedin page to discover more contents.