Source: Deep Learning on Medium
4. BPR in practice
Enough theory, let’s build a recommendation system in practice. I will use the MovieLens dataset from one of the kaggle competitions.
4.1 Data overview
Let’s look into the data. The dataset contains user ids, movie ids, and ratings:
Some common information:
- The number of unique users: 3255
- The number of unique movies: 3551
- The average number of rated movies per user: 153
- The average number of ratings per movie: 140
The distribution of the ratings looks like this:
More information you can find in a separate notebook.
4.2 Train/test split
Now we need to split into the training and test parts. To be able to predict a score for a pair of user-movie, both of them should appear in the training set. As a first step, we’re going to filter the users who have more than 20 ratings with value 5. As a second step, we randomly select two movies per user. That’s our test set — it won’t be used during the training, but these movies should appear in the top recommendations for selected users accordingly.
To be able to evaluate the quality of the model on the training set we also need the ground of the truth for it. The assumption is that the recommendations should contain as many as possible high ranked movies that a specific user has already watched. We filter out all movies with a rating of less than 4 and group data by users.
4.3 Building triplets
To train a model based on Bayes Personalized Ranking, we need to define the triplets of a user, positive item and negative item. For each user, we create the pairs of each movie with a positive rating (the rank is higher than 3) with all movies with negative(the rank is equal to 3 and lower).
Ideally, the difference between a positive and negative item should be significant. The answer will depend on the business task. For example, it can be very difficult to say how much more does a user prefer a movie with a rating 4 than a movie with rating 3.
4.4 Training the model
For building the model we will user TensorFlow 2.1. Here is the schema of the neural network:
Let’s go through the architecture:
- We have three inputs: a user, a positive item and a negative item.
- Our task is to make a difference between positive interaction and negative interaction significant. In this case, the rank of the positive item should be higher than the rank for the negative item for a specific user :
- Multiplication of the dense layers for user and item is equal to the user-item interaction.
- We have two types of items but want to have one way to represent any of both. Technically, we will use a single embedding layer for positive and negative items. This embedding layer shares the weights for the positive and negative items.
- Finally, as the last layer, we use a Lambda. That is a custom layer without any weights . We use it to wrap the function for calculation of triplet loss:
One of the important hyperparameters is latent dimension length. It’s a length of the vector, that will represent the users and movies. This number totally depends on your data, but the value in the range 200–350 is something, that is good to start from.
Ok, totally we got 2 382 100 trainable parameters of the model. The length of the training data is 37 642 252. So, there should be enough data for training according to the current architecture of the neural network.
4.5 Performance Evaluation
As soon as training is finished, we need to evaluate the quality of the model.
One of the commonly used performance metrics is user-averaged AUC (Area Under The Curve). The idea is that the movies that a specific user has ranked high should have higher predicted ranks compared to other movies in the dataset.
Another broadly used metric for evaluation of the algorithms that provide a ranked ordering of items is Mean Average Precision at k (MAP@k). MAP@k evaluates the relevance of the items your system recommends.
You can find more information about MAP@k here.