Source: Deep Learning on Medium
Deep dive into what happens behind the scenes when you are busy binge-watching or listening to an automatically curated playlist
Have you ever used Netflix and been utterly astounded by how well it understands your viewing habits and what you like/dislike? Isn’t it eerily correct at times at predicting the next show you’d love to watch? Well, the backbone of such magic is a Recommendation Engine, a tool used by industry giants such as Facebook, Netflix and more. By the end of this article, you’ll have learned about various processes involved in Deep Learning and will be able to create recommendation systems on your own.
So, what is a Recommendation System?
Imagine that friend whose taste in music is almost similar to yours and who always recommends you the songs that make you hit that replay button again and again. Now extend the idea of this friend to a large population whose cumulative experience leads to the next song recommended for you. In Computer Science jargon, this group is expressed as a recommendation system. Technically speaking,
A recommendation engine filters the data using different algorithms and recommends the most relevant items to users. It first captures the past behaviour of a customer and based on that, recommends products which the users might be likely to buy.
You encounter the results of these systems daily — the next item to buy on Amazon, the next movie to watch on Netflix, the next song to play on Spotify, and even the next person to add to your friend list on Facebook.
The type of recommendation system that we will build in the article is based on the approach of Collaborative Filtering.
Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.
What do I need to make my Recommendation System?
Let’s assume you find an interesting application of this wizardry, so what now? The foremost thing required is data which specifies the past activities of users in the system. For our use-case, we’d assume data where various users give ratings to multiple items. This relation can be shown in the form of a matrix with users corresponding to row indices and items corresponding to column indices and their intersection cells signifying the rating given by that user to the corresponding item.
This method is rather inconvenient as all users will not rate every item, and humongous amounts of cells would be blank(Sparse Matrix). A better grouping would be in the form of a three column table(Triplet Representation) with each column representing user, item and the given rating respectively.
You might need to do some data pre-processing and data cleaning on your dataset to ensure that each user and each item have enough corresponding ratings for our recommendation system to learn something meaningful.
A GPU would surely come in handy to speed up the training time of the model. Be sure to check out Google Colab which gives you free access to a Linux instance with Python and an Nvidia GPU for over 12 hours.
This guide will assume a beginner level understanding of Deep Learning and intermediate level knowledge of Python.
Getting your Data Ready
Python makes it quite easy to fiddle around with data-frames(Spreadsheets), and it would be preferable to have your dataset in a CSV file format.
We need a unique index in the range (0, No. of Users/Items) for both users and items which will be used later in mapping them to a list of weights/parameters in our model.
As with all machine learning pipelines, we require a mutually exclusive training set and validation set to train and then verify our model, respectively. Here validation data is a subset of user-item ratings which are hidden from the model while training.
Batch size is a hyper-parameter we can tune during training to optimise the model. A high batch-size gives a better representation of the entire dataset than a small value which might contain irregularities and can push off the weights in the wrong direction. Also, a tradeoff needs to be balanced between the batch-size and no. of epochs. Note that batch-size must not be set too high that it uses up all the GPU memory.
The following code gist retrieves the list of unique users and items in the dataset and maps them to a unique index value.
Creating your Model
We will be using a 1-layer neural network with embedding matrices as the only parameters for our recommendation tasks. This is a somewhat different approach to the neural net as not all weights are used in every model call.
The model contains four matrices with sizes (n_users, n_factors), (n_items, n_factors), (n_users, 1) and (n_items, 1) which represent the weights for users and items and their bias terms respectively.
The weights for a given (user/item)_id can be seen as
Here each weight term defines a specific characteristic of the user/item. User bias can be understood as the tendency of the user to give a generally high rating, while the item bias is recognized as the actual quality of that item. To get a more intuitive understanding of the model’s functioning, consider the following example :
When a user_id and an item_id are passed through the model object, the embedding matrices return the weights and biases corresponding to the indexes mapped to by these objects in user2idx and item2idx dictionaries. The weight terms are dot multiplied and summed, and then both bias terms are added. The result is passed through a sigmoid activation function which is modified to give the output in the range of the possible ratings.
Let n_factors = 1 and uw0 be the user weight and iw0 be the item weight. Let’s assume that item is a book. Also, let uw0 signify whether the user likes mystery novels, and iw0 signify whether mystery is the genre of the book. A high value for both uw0 and iw0 will result in a high rating as can be seen from the model. This is obvious as a user who likes to read mystery would preferably like a mystery novel and thus rate it higher.
But what if the book is poorly written? In this case, the bias term for the book will be negative, which when added to the result of the above matrix dot multiplication will bring down the resultant rating by the user.
Thus, by optimising these parameters(weights & biases), the model aims to understand the semantics of the dataset.
Since the problem is similar to a regression problem, we will be using a Mean Square Error Loss(MSELoss) function. MSE is the sum of squared distances between our target and predicted ratings. It is defined as
In our training routine, we will use an SGD optimizer function.
Stochastic gradient descent (SGD) performs a parameter update for each training example x(i) and label y(i):
θ = θ − η * ∇J(θ; x(i); y(i))
Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.
Training your model
Training of a neural model has certain fixed steps, and each training procedure is built around them:
- Put the model in training mode
- Load a batch from the training set
- Zero the gradients of the model
- Input the data to the model
- Calculate the loss from the predicted output and required output
- Optimise the parameters of the model through back-propagation
- Repeat 1–5 for all training data
- Put the model in evaluation mode
- Repeat steps 2,4 and 5 for validation data and calculate performance metrics
- In every epoch/iteration, steps 1–9 are repeated
In our case, the model parameters being optimised are the weights of the embedding matrices. Through training our main aim is that the embedding matrices capture the semantics of the users and items efficiently and thus reduce the error which the model makes on predicting a rating for an item as per a user.
Knowing when to stop training can be challenging at times. One rule of thumb preached by Jeremy Howard is to keep training your model until your validation loss starts becoming worse. You shouldn’t care for overfitting even if your training error is lower than the validation error as long as both losses are decreasing. And if you think that your model is underfitting, you might want to train it a little longer or try increasing the complexity of the model by adding more layers or increasing the value of n_factors.
After you are done training your model, you might want to predict a new item for a user to try, after all this was the primary purpose of the recommendation system. This task can be achieved by passing the required user’s id and all other items’ ids which the user still hasn’t tried through the model. The predicted ratings can be sorted to get the item the user will like the most.
Another important application could be to predict a user for a specific item, which can also be easily implemented by modifying the above logic.
Voila! You have successfully built a recommendation engine. All that is left is for you to find/collect an apt dataset, an exciting application and do wonders with this new found power.
What should I do if I have a new user?
Since you don’t know the preferences and behaviour characteristics of the new user (yet), the safest bet would be to recommend the most popular items to them. This could be done by analysing the bias term for each item and recommending the ones with relatively high values, or you could base the popularity entirely on the actual average rating of the item. Also, we should analyse and collect the new user’s rating behaviour over time and recommend the items more suited to his/her taste eventually.
Get your hands dirty modifying and even adding new features to this basic approach to recommendation systems. Try using more data than just the ratings in your model, look up some more complex models that can be employed for the task or think of some novel application useful for your friends and colleagues. A follow-up project to this article could be visualising what our model has learned through the use of the TSNE algorithm and also analysing the embeddings to understand the unique traits of the data resembled by them.
Some popular datasets for collaborative filtering are MovieLens Dataset, Jester Joke Dataset, Last. Fm Music Recommendation dataset and Book-Crossing Dataset. TIP: If any encoding errors occur while loading any of these datasets, try setting the encoder to ‘Latin-1’ (Thanks to Jeremy Howard for this neat trick).