Original article was published by Behic Guven on Artificial Intelligence on Medium
Content Based Recommender
Content based recommender is a recommendation model that returns a list of items based on a specific item. A nice example of this recommenders are Netflix, YouTube, Disney+ and more. For example, Netflix recommends similar shows that you watched before and liked more. With this project, you will have a better understanding of how these online streaming services’ algorithms work.
Back to the project, as an input to train our model we will use the overview of the movies that we checked earlier. And then we will use some sci-kit learn ready functions to build our model. Our recommender will be ready in four simple steps. Let’s begin!
1. Define Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizertfidf_vector = TfidfVectorizer(stop_words='english')movie_data['overview'] = movie_data['overview'].fillna('')tfidf_matrix = tfidf_vector.fit_transform(movie_data['overview'])
Understanding the above code
- Importing the vectorizer from sci-kit learn module. Learn more here.
- Tf-idf Vectorizer Object removes all English stop words such as ‘the’, ‘a’ etc.
- We are replacing the Null(empty) values with an empty string so that it doesn’t return an error message when training them.
- Lastly, we are constructing the required Tf-idf matrix by fitting and transforming the data
2. Linear Kernel
We will start by importing the linear kernel function from sci-kit learn module. The linear kernel will help us to create a similarity matrix. These lines take a bit longer to execute, don’t worry it’s normal. Calculating the dot product of two huge matrixes is not easy, even for machines 🙂
from sklearn.metrics.pairwise import linear_kernelsim_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)
Now, we have to construct a reverse map of the indices and movie titles. And in the second part of the Series function, we are cleaning the movie titles that are repeating with a simple function called drop_duplicates.
indices = pd.Series(movie_data.index, index=movie_data['title']).drop_duplicates()indices[:10]
4. Finally — Recommender Function
def content_based_recommender(title, cosine_sim=cosine_sim):
idx = indices[title] sim_scores = list(enumerate(sim_matrix[idx])) sim_scores = sorted(sim_scores, key=lambda x: x, reverse=True) sim_scores = sim_scores[1:11] movie_indices = [i for i in sim_scores] return movie_data['title'].iloc[movie_indices]