Original article was published on Artificial Intelligence on Medium
How I implemented explainable movie recommendations using Python
I also tested whether users actually liked them. Here’s the results.
This post is part 2 in my series of posts about explainable recommendations, based on my BSc dissertation. Part 1 introduces the concept of explainable recommendations, while part 3 discusses the application of post-hoc explainability in data science.
The previous post in the series discussed why improving the explainability of recommender system matters. This is not a tutorial, but rather an overview of the approach I took in implementing a movie recommending service with explainable recommendations. If you wish to explore it further, you can read the full dissertation or dive into the front-end or back-end codebases (which are open source!). I chose to first implement a black-box recommender system using the matrix factorisation algorithm SVD, and then implement two post-hoc explainers that would generate explanations to recommendations. The effect of adding explanations was then tested through a web application simulating a movie recommender service.
Design and implementation
The recommender system was implemented in Python using the Surprise library. The dataset used for the problem was the publicly available MovieLens dataset consisting of film ratings by real users . This dataset is widely used in recommender systems research, and since CF is not domain-specific, the models and algorithms will usually generalise to other fields besides film ratings. When developing the recommender system, the development dataset containing 100’000 ratings was used to lower training time; the 20 million rating benchmark set was used for evaluation and use study. Before training the latent factor model, the data was split to train and test data using a random 75%-25% train-test split of ratings.
After training, the recommender system can be used to predict ratings for the items and users in the training set. This is enough for static evaluation: however, the user study that was conducted using the system requires dynamic recommendations. Generating personalised recommendations for a newly-added user with a few ratings is not possible: the model first needs to learn the user’s latent factors. This can be done by fully retraining the model — however, this method is infeasible in live systems as training the model for millions of ratings is computationally expensive and would need to be done at every new rating. Because the existing model is “almost correct” and would work as a good starting point for adding the user, the method can be optimised in many ways depending on the underlying implementation of the SVD algorithm. For example, some systems based on gradient descent could be initialised with the original model’s weights (here, latent factors), which would allow the algorithm to convergence faster. However, Surprise’s SVD uses a fixed number of epochs in running its gradient descent, and doesn’t stop until all epochs are done. As such, a new operation for the SVD model was devised, which adds a new user to the model. This operation only trains the new user’s latent factors, and leaves the factors for items and other users unchanged.
The explanations I chose to add were based on association rules and influences and implemented in Python. The association rules explainer’s implementation followed a previously proposed method  and aims to explain recommendations by showing a rule from the dataset describing what previously watched movies caused it to be recommended. The influence explanation aims to show which previously watched movies influenced the recommendation the most. The used influence method is novel, but it generates the same explanations as previously proposed Fast Influence Analysis . It works by comparing the predicted rating of the recommended movie with and without having each previously watched movie in the training set. This is achieved through repeatedly re-training the model without these individual data points. Normally this would be impossibly expensive, but thanks to the optimised method developed for adding new users, it can be done efficiently.
A user study was conducted to test the effect of explanation type to the measured trustworthiness and persuasiveness of the recommendation, defined in . To achieve this, a web application simulating a movie recommending service was built. The front-end was built using React, and a back-end REST API was built using Flask. Besides providing access to the recommender system and explainers, the API also loaded data about real movies (e.g. titles, posters, and age ratings) through The Movie DP API. The web service was then deployed to my university web server.
Besides the two explainers, the user study also contained a baseline explanation of “This movie was recommended because you are similar to users who liked it”. The study was run by first asking the users to rate ten films they’ve seen in the past and then showing them recommendations with explanations from each category. 41 users participated in the study, and a statistically significant difference in both persuasiveness (p=0.008) and trust (p=0.001) was observed between explanation types in favour of the association rules explainer.
The explanation generators were also tested in various offline experiments. Most significantly, the association rules explainer was found to suffer from low model fidelity, the metric measuring the share of recommendations that can be made explainable . The low model fidelity caused a slight selection bias to the explanations in the user study: the association rule explanations were added to recommendations of most popular films (as they were most commonly represented in the mined associations rules). It is surprising that this would result in an increased effect — in general, users ought to prefer recommendations that are better-targeted to them. It is possible that the “fake” nature of the research platform affects this — the popular but less targeted content may represent films the users knew beforehand and can easily tell that they want to watch them, but would not watch in a real system. In other experiments, the algorithm for re-training the recommender system for a single user was shown to be just as accurate as a full re-training, and the influence calculation was shown to suffer from a moderately high variance (i.e. multiple calculations of the same film’s influence an produce quite different results) due to the random variation in re-training.