How to improve recommendations for highly sparse datasets using Hybrid Recommender Systems?

Original article was published by Caboom.ai on Artificial Intelligence on Medium


How to improve recommendations for highly sparse datasets using Hybrid Recommender Systems?

Methods for generating recommendations and how they work

Author: Anish Pandey

When it comes to recommender systems, the default choices are usually to use Collaborative Filtering that makes use of the wisdom of all the users, or Content-based Filtering that is based on the user’s past interactions and item attributes.

Let’s briefly have a look at the methods for generating recommendations and how they work.

Collaborative Filtering

Collaborative filtering (CF) is a method for generating recommendations by calculating preference scores of a user for an item using the historical preference for that item from other similar users in the database. This algorithm takes account of the explicit interaction with the item irrespective of the attributes of the item, so is domain agnostic as long as we have sufficient historical interaction data.

It works with an assumption that if a user A likes item X, Y, Z, and user B likes product X, Y, J then user A will probably like item J and user B item Z.

Note that these predictions are specific to the user, but makes use of the interactions gathered from many users. This can lead to recommending a diverse set of items that the user might not otherwise have discovered.

This method is mostly used on e-commerce websites. If you have come across recommendations under a section titled “People who liked item X also liked item Y” you have seen Collaborative Filtering in action.

‍Content-Based Filtering

In contrast to Collaborative Filtering, Content-based filtering (CBF) uses the attributes/features of the items directly to recommend items to a user. This is done by creating a “user profile” of the preferences of the user for each of these item attributes and then calculating other items most similar to the user’s profile. This is done by representing a profile vector of the user in the same dimensions as the item attribute vector and calculating the weights based on users’ historical interaction with the items.

Note that the recommendations are specific to this user, as the model did not use any information from other users. Thus, this type of recommendation does not provide a diverse range of choices as they are only based on items a user has previously interacted with.

This type of recommendation focuses on personal choice or content and is very useful in Job Recommendations as the choice of the job for the user needs to take into account to have a perfect matching job.

Hybrid Recommender Systems

As useful as these methods are they do have some limitations that we need to be aware of. For example, both CF and CBF do not work for new users for whom we don’t have any historical interactions or have very few of them. They also don’t make use of other information we might have about the user like their demographic, location, etc.

Ideally, we would like to make use of all the information available in different data sources and also combine the algorithmic strengths of various recommender systems to make more robust recommendations. Hybrid recommender systems are used for these purposes, and we explore them more through the rest of this blog.

Most recommender systems these days use a hybrid approach by combining CF and CBF approaches. However, there is no reason why several other techniques cannot be hybridized.

Hybrid approaches can be implemented in several ways:

  1. Making content-based and collaborative-based recommendations separately and then combining them.
  2. Adding content-based capabilities to a collaborative-based approach (and vice versa).
  3. Unifying the approaches into one model.

Several studies that empirically compare the performance of hybrid recommenders with pure CF and CBF methods have shown that hybrid methods can provide more accurate recommendations than pure approaches. These methods can also be used to overcome some of the common problems in recommender systems such as the cold start and the sparsity problem.

Netflix is a good example of the use of hybrid recommender systems. The website makes recommendations by comparing the view and search habits of similar users (i.e., collaborative filtering) as well as by offering movies that share characteristics with films that a user has rated highly (content-based filtering).

‍Content Boosted Collaborative Filtering for recommender systems

Given the many ways recommenders can be combined, there are many variations of hybrid recommenders in use. In this blog we will focus on one of the earliest and most successful hybrid recommender algorithms called Content boosted Collaborative Filtering (CBCF) algorithm.

CBCF is a type of hybrid recommendation technique that uses a combination of content-based filtering and collaborative filtering. Its main idea is to overcome the sparsity problem that degrades the performance of collaborative filtering algorithms by using item content to make the user-item interaction matrix dense.

Its benefits over the pure CF and CBF methods are:

  1. Reduces user-item data sparsity
  2. Handles cold-start problems when users have no or very few interactions.
  3. Handles the first rater problem for new items who have not been rated or interacted with by any users.

How does content-boosted collaborative filtering work?

As we mentioned the basic idea behind CBCF is to use content-based filtering to convert a sparse user interaction matrix to a dense matrix, and then use CF to make the recommendations.

This is shown in Figure 1.

Fig 1. Basic mechanism of converting a sparse matrix to dense

The overall architecture of the recommendation system is shown in Figure 2.

Fig 2. The architecture of CBCF based recommender system

This architecture tries to predict the preference for items not yet interacted by the user based on the similar items rated by the user. For this, it groups the most similar interacted items based on the content and then gives weights to each of these neighboring items based on user preferences. The average of these weighted content feature vectors is then used to calculate the predicted preference for the item. This way we will have a predicted score for each item for a user, which becomes the new user-item dense matrix.

Variation using Deep Learning Prediction Model for a better recommendation

With the advent of modern deep learning techniques, several variations of the content-based model have been developed. These methods create an item profile (feature vector) that is especially useful for text-based features that are prevalent like item reviews, description, etc.

Fig 3. Variation using Deep Learning-based Item Profiles

In this method, the item profiles are created using feature engineering techniques as before, but not grouped into clusters. The user profile is created based on the historical item interaction. Those two profiles are then combined into a single feature and the deep learning model is trained. After the model is created, the feature is created by combining the item profile not interacted by the user with the user profile, hence a prediction is generated which is used to make the sparse matrix dense.

‍Evaluation and Result

In order to test the algorithm, we used the MovieLens 20M Dataset that is widely used for recommender system testing and benchmarking..

The results we got can be seen below.

Pure CF method based on SVD Matrix Factorization

The default sparsity of the user-item matrix was 99.28%, and the average RMSE error tends to be around 1.0682 with minimum standard deviation.

Hybrid Recommender using CBCF

After applying the CBCF algorithm using a content-based filter, the sparsity 0 as all uninterested item preferences are replaced by the predicted value. The average RMSE tends to be around 0.1099.

As we can see both the RMSE and MAE values have decreased considerably for the CBCF model as compared to the pure CF method.

The time required to fit (train) and test the model does increase considerably using the hybrid method. This is something that needs to be taken into consideration in production use-cases.

In conclusion, we have seen that the type of recommendation engine algorithm needs to be chosen based on the type of dataset we have. Standard Collaborative Filtering methods although very popular and useful do not always work, especially when the data is very sparse.

For user-specific recommendations where the user preferences data is available and a high degree of personalization sought Content-based filtering methods are useful. Collaborative filtering will work better when there is explicit feedback given and data sparsity isn’t too high. Hybrid recommendation methods like Content Boosted Collaborative Filtering takes the advantages of both content and collaborative filtering when we have item attributes, and improves the models by removing data sparsity and reducing errors.