Creating a movie recommender using Convolutional Neural Networks

After finishing your favorite series on Netflix or a video on YouTube you have to make an important decision, what do I watch next? Most of the time you get some help from the recommendation system of your favorite video on demand platform. These platforms spend lots of time and effort (see: The Netflix Recommender System: Algorithms, Business Value, and Innovation & Deep Neural Networks for YouTube Recommendations) making your user experience as pleasant as possible and increase your total watch time on the platform. But even with all that help, how do you choose? I mostly choose the video thumbnail/poster that is most appealing to me. With that in mind, I build a movie recommender that only takes the movie thumbnail/poster as an input. Let’s see what that looks like.

Main idea

The main idea is to create a movie poster image dataset and extract features from a pre-trained Convolutional Neural Network (ConvNet) trained on ImageNet. I will use the extracted features to recommend the 5 most similar movie posters given a target movie poster. To verify this approach I will do the same thing using a shoe image dataset.

Step 1: Web Scraping movie Posters

First I need some movie Posters, I decided to use the TMDB 5000 Movie Dataset on Kaggle. With the information provided in the dataset, I used web scraping to download the poster images from IMDB using the Python library BeautifulSoup. I added the poster id to the name of every image and stored all 4911 successfully downloaded images in one folder.

Image from:

This is the complete WebScraping.ipynb notebook.

Step 2: The recommender

I am interested in finding similar movie posters based on the visual aspects of the poster image, therefore I will use ConvNets. ConvNets currently are the go-to models, when it comes to visual recognition. For my recommender, I will not train a ConvNet from scratch. But use a pre-trained model on ImageNet. Thereby saving time and having a state of the art model out of the box. This is called “transfer learning”. The recommender I will use the Inception-v3 model. One of the reasons for choosing this model is the relatively small output arrays compared to the VGG16 or VGG19 model. Making it easier to process everything in memory. I am interested in the features the model has learned not the class probabilities. The pre-learned layers that identify shapes, patterns etc. will hopefully come up with representations that result in meaningful recommendations. For that reason, I remove the output layer and treat the rest of the ConvNet as a feature extractor for my movie posters. See here an example of the what each layer or node may learn based on face image dataset.

Image from:

For the implementation of the recommender, I use Keras with TensorFlow as a backend. For every image in my dataset, I save the flattened output array of the last hidden layer of the model. With this new feature array, I can calculate the x number of Nearest Neighbors of a target image/poster based on the Euclidean distance of the arrays to each other. To compare the results to a baseline, I will also show the x number of Nearest Neighbors using the raw flattened image array for the given target poster. I will not share all my code but will share some code snippets in this blog.

Select model feature layer:

from keras.models import Model
from keras.applications.inception_v3 import InceptionV3
from keras.models import Model
selectedlayer = "..."
base_modelv3 = InceptionV3(weights='imagenet', include_top=False)
model= Model(inputs=base_modelv3.input, outputs=base_modelv3.get_layer(selectedlayer).output)

Extract features from image:

from os import listdir
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
def preprocess_input(x):
x /= 255.
x -= 0.5
x *= 2.
return x
def load_photos_predict(directory):
images = []
for name in listdir(directory):
# load an image from file
filename = directory + '/' + name
image = load_img(filename, target_size=(299, 299))
# convert the image pixels to a numpy array
image = img_to_array(image)
# reshape data for the model
image = np.expand_dims(image, axis=0)
# prepare the image for the model
image = preprocess_input(image)
# get image id
image_id = name.split('.')[0]
feature = model.predict(image).ravel()
images.append((image_id, image, feature))
return images

Find Nearest Neighbors:

from sklearn.neighbors import NearestNeighbors
import pandas as pd
import numpy
nn_num = 6
X = list(results[“modelfeatures”])
nbrs = NearestNeighbors(n_neighbors=nn_num, algorithm=’ball_tree’, metric=”euclidean”, n_jobs = -1).fit(X)

Step 3: The Results

Let’s have a look at the recommendations based on the James Bond movie Specter. I will show 5 recommendations based on the raw flattened image array and 5 based on the extracted feature array of the Inception-v3 model.

Recommender based on raw image array vs ConvNet features for James Bond — Specter

Although I have no clear evaluation metric defined or make use of A/B testing to see what recommendation approach is best, intuitively the model results seem slightly better. The model recommends even one additional Bond movie, Never Say Never Again as its 4th recommendation. Let have a look at an additional movie, Indiana Jones and the Kingdom of the Crystal Skull.

Recommender based on raw image array vs ConvNet features for Indiana Jones and the Kingdom of the Crystal Skull

The model recommends Indiana Jones and the Last Crusade as its first recommendation, that looks very nice. The others are a bit less appropriate, but it seems the ConvNet features again performed better than only using the raw image array as an input.

Show result functions:

import matplotlib.pyplot as plt
def udfsimular(indices, table):
neighbours = []
for i in range(len(indices)):
t = indices[i]
idv = table[(table.index == t)].iloc[0][‘ID’]
return neighbours
def udfidfpathh(ids,directory):
paths = []
for i in range(len(ids)):
t = ids[i]
filename = wdir + directory + t + “.jpg”
return paths
def show5recommendations(name, table, NearestN,  idnr, directory, columnfeature):
key = table[(table.ID == idnr)].iloc[0][columnfeature]
distances, indices = NearestN.kneighbors(key)
listindices = pd.DataFrame(indices).values.tolist()
listindices2 = listindices[0]
ids = udfsimular(listindices2, table)
paths2 = udfidfpathh(ids,directory)
fig, ((ax1, ax2, ax3, ax4, ax5, ax6)) = plt.subplots(nrows=1, ncols=6, sharex=True, sharey=True, figsize=(14,3))
# Doing each of these manually (ugh)
ax1.set_title(r"$\bf{" + str(name) + "}$"+"\n Targer:\n"+ ids[0])
ax2.set_title("Rec 1:\n"+ ids[1])
ax3.set_title("Rec 2:\n"+ ids[2])
ax4.set_title("Rec 3:\n"+ ids[3])
ax5.set_title("Rec 4:\n"+ ids[4])
ax6.set_title("Rec 5:\n"+ ids[5])

Step 4: Verify recommender using shoe images

Now that we have seen how well it works for movie posters, let’s use a different dataset. Websites like Amazon, Zalando, and other online shopping stores use similar techniques to recommend products to you. For example, the item you are looking for is out of stock and they want to recommend a similar product to you. So let’s use shoe images. The dataset I use is the UT Zappos50K with catalog images collected from I used 1882 shoe images.

So let’s repeat the same approach on this dataset and see what the results are for a “black open high heel shoe”:

Recommender based on raw image array vs ConvNet features for a black open high heel shoe

Looks like some good recommendations based on the extracted features, the model clearly learned to distinguish between different patterns of the shoes. Whereas the recommendations of the normal array clearly do not know what an open shoe is. And how about sneakers:

Recommender based on raw image array vs ConvNet features for a sneaker shoe

Good recommendations, again.

So why are these results better than those for the movie posters?

The Inception-v3 model was trained on ImageNet to distinguish between 1000 class predictions. The images the model is trained on had one object/class per image. One of the 1000 classes it was trained on is even called “running shoe”. Predicting one object per image is what the model was trained for and should be what the model does best. Whereas the movie posters are far more complex in the number of objects, text, etc. Therefore using a model trained on a different image dataset may result in better results for the movie posters.

Final remarks

As we have seen creating a movie recommender only using movie posters in combination with pre-trained ConvNets does result in some okay(ish) recommendations. The results are slightly better than only using the raw image array. For shoes, this approach already shows some very good recommendations. It was interesting to see what the pure visual recognition features of a ConvNets can already do. Depending on intended purpose or industry the recommender is created for, it seems like a good additional feature to add to the feature set used to develop a state of the art recommendation system.

Source: Deep Learning on Medium