Source: Deep Learning on Medium
Collaborative filtering using fastai — a sincere perspicacity
I found collaborative filtering much more stringent when I read it first. So, it took me a lot of days to understand this concept with a little sort of satisfaction. Thenceforth, I will try my best to cook this concept in the easiest possible way. I will be explaining the collaborative filtering in regards to fastai library. So, bear with me and let us get started.
❓ What is Collaborative filtering
Collaborative filtering knowledge is used for formulating the recommendation system. Most of the major IT companies are using this technology to woo the users, to know the users well so that they could recommend them things based on interests. Now, if we look at the data that we get for analysis, it is generally in the below formats.
- Generally, we store data in the first format. If you save it as a matrix where every combination of customer and product is a separate cell in that matrix, it’s going to be gigantic.
- So either you tend to store it like the left, or you can save it as a matrix using some unique sparse matrix format.
For practical insights, I am using an academic dataset provided by the fastai, i.e. MovieLens created by GroupLens.
❓ What is Cold Start problem
Cold start problem mainly arises due to the below reasons:
- When a new user comes to the website, then it becomes challenging to recommend the movies to the user due to the absence of metadata about the user.
- When a new movie arrives, we don’t have any information about that movie unless and until some users rate the movie.
We may solve the cold start problem by asking the users some questions to collect metadata regarding them.
Behind the scenes
In collaborative filtering problem, we start with taking random weights for each user and each movie.
- Random weights for each move and each user are declared as shown above. We can choose any number of weights for each movie and the user, but the count should be the same for the user and movie.
- Rating for any movie by any user is obtained by the sum of the dot product of corresponding weights for the movie and the user respectively. A basic starting point of a neural net is that you take the matrix multiplication of two matrices, and that’s what your first layer always is, and that’s what we have also done.
- So we just have come up with some way of saying what two matrices that we can multiply are. Clearly, you need a vector for a user (a matrix for all the users) and a vector for a movie (a matrix for all the movies) and multiply them together, and you get some numbers. So they don’t mean anything yet. They’re just random numbers.
- After the rating is declared, then we calculate the RMSE loss between the actual rating and the predictive rating. Then, the normal behaviour of the neural network starts and weights are adjusted to reduce the loss.
- This is how weights are set for the particular user and the particular movie.
- Now, we have defined the most boring neural network for this case that has a single linear layer and a non-linear layer(you will learn in a minute) at the end. But that’s all you need to solve the problem. 🆓
- The only drawback of the above is that it is not the sparse storage. But, we will look into it soon.
Some Practical Stuff
Now, I will be writing the code using fastai library. We will understand the code bit by bit while learning more.
Fastai has created a sample dataset of movie reviews, and I will use that.
from fastai import *
from fastai.collab import *path = untar_data(URLs.ML_SAMPLE); path= PosixPath('/root/.fastai/data/movie_lens_sample')ratings = pd.read_csv(path/'ratings.csv')
As usual, now we will create the data bunch. I will CollabDataBunch, which is mainly defined in fastai for collaborative filtering.
data = CollabDataBunch.from_df(ratings, seed=42)
After the data bunch is defined, We need to define the learner. Like CollabDataBunch, we CollabLearner.
y_range = [0,5.2]learn = collab_learner(data, n_factors=50, y_range=y_range)
- collab_learner is used to create a Learner for collaborative filtering on data.
- y_range is defined to restrict the outcome between the range. Internally, collab_learner uses a non-linear sigmoid function as the last layer. Statistically, sigmoid function asymptotes between the range in which we defined it. Therefore, we have defined the above range so that we could achieve the user’s ratings more than 0 and less than 5.2. Now, adding sigmoid is not necessary. Our model could learn to predict the ratings and grad its weights, but, we want our model to spend most of its time to learn something within the range we want. Therefore, our model will set its weights/parameters in such a way that the outcome, i.e. rating, lies between 0 to 5.5.
- n_factors — We will learn about it later on, but for now you may think of it as equal to the random number of weights which we have defined for users and movies above. Those weights later in the neural networks are regarded as different features/factors like if the movie has a particular actress/actor or if the user likes movies of any particular actor/actress and the movie has it which help the neural networks to learn better about the movie and the user.
This is how we use fastai for the collaborative filtering. Now, we will dig out the concepts and will find out everything behind the above horrifying code. 😆
❓ What is inside collab_learner
If you will see the fastai code for the collab_learner, it creates two types of models that are highlighted below in the image.
Let’s look into one of the models highlighted above.
- We are creating embedding matrix for the random weights/parameters that we initialize as discussed above. Now, why are we creating the embedding matrix and what is the use of it, we will certainly discuss below but for now, let us make it clear in our mind that we are creating something like embedding for storing the weights defined for the movies and the users. If you will see the code for the creating the embedding matrix bove, we are calling PyTorch embedding only.
- If you know the basics of PyTorch nn.Module, then you might know that forward is called automatically after dunder init function. And what is inside the forward function is already well known to us.
- We are multiplying the random movies and users weights. Then we are adding bias to it. Then we are comparing if we have defined the y_range, then we apply sigmoid function else we return the value.
❓ What is Embedding
Embedding is just an array lookup. We can extract a vector from the embedding matrix. It’s a matrix of weights which you can basically look up into, and grab one item out of it. As the name suggests, it embeds the information and lets the user look into it broadly. It is a contraction of one hot encoding matrix multiplied with the other matrix. When we say we have embedding matrix for the user and the movie, we mean the below.
- The vector in yellow is the embedding matrix for the user with id 0.82. The vector in red is the embedding matrix for the user with id 1.26.
- Similarly above, vector in yellow is the embedding matrix for the movie with id 2.39. The vector in red is the embedding matrix for the movie with id 1.13.
Now, the question arises ❔ Why do we use Embedding
As earlier stated also, Embedding is a multiplication of one-hot encoded version of something and the input weights. If it is user embedding matrix, then it is the multiplication of one-hot encoded user matrix and input weights for the user as defined above. If it is for a movie, then it is the multiplication of one-hot encoded movie matrix and input weights for the movie as described above. So, let us understand it diagrammatically.
- Now, if you understand the above operation, then we are basically multiplying the one-hot encoded user matrix with the input user activations.
- Now, for each input one-hot encoded user vector, the vector in the user input weights corresponds to the same positions i.r for the user with input id 1, the vector in the user weights matrix is also at the same index.
- Therefore, instead of doing one more matrix multiplication like shown above, we could do array lookup and save memory as well as the operation would be faster.
- That’s why we define and use embeddings — an array lookup.
❓What are latent features
There is imperative semantics behind the weights defined for users and matrices. We could think of the weights to represent some feature respective to the user or the movie. Now, I am not saying that weights define something exciting, but we could think like that way. If after optimization of the weights through gradient descent, if we get a particular weight for a user high and corresponding weight in the movie also high, then we can relate to like the user like a specific actress like Priyanka Chopra and the movie has the actress in the lead role. Or we could infer like it is the animated movie and user likes the animated movie. I am still sceptical about what happens inside the neural network, but the only way that this gradient descent could possibly come up with a good answer is if it figures out what the aspects of movie taste are and the corresponding features of movies are. Those underlying kind of features that appear that are called latent factors or latent features. They’re these hidden things that were there all along, and once we train this neural net, they suddenly appear, or we suddenly try to connect the dots.
Now, if you recall the primary neural network that you may have to learn, we always add a term other than matrix multiplications, and that is known as Bias. Now, we will also add bias in the above case. Refer to EmbeddingDotBias() above and see that we declare the bias for users and movies both. You may be thinking that adding bias is general trivia, but here it could mean something other. Let us understand what does this bias mean.
❓ What does bias define
Consider the below two cases:
- If we could think of a particular user who generally rates the movie high even if the movie is soo bad and it has a very low rating.
- Or if we could think of a movie that has Will Smith and everybody loves him and his movie, so we have weights define very high for Will Smith movie but then there comes a movie which has Will Smith in it, but the movie isn’t very great.
Now, in both of the above situations, we cannot change the generic weights for some particular poor and against situations, then we define bias for users and movies. Thus, the bias for that poor film of Will Smith will be low and similarly, a bias for the user who generally rates movie high, but the movie is poor overall will have small bias. Biases are typically added to the matrices multiplied together. We may use these biases to interpret the results.