Original article was published by Towards AI Team on Artificial Intelligence on Medium

Calculate the average rating

`def get_average_rating(sparse_matrix, is_user):`

ax = 1 if is_user else 0

sum_of_ratings = sparse_matrix.sum(axis = ax).A1

no_of_ratings = (sparse_matrix != 0).sum(axis = ax).A1

rows, cols = sparse_matrix.shape

average_ratings = {i: sum_of_ratings[i]/no_of_ratings[i] for i in range(rows if is_user else cols) if no_of_ratings[i] != 0}

return average_ratings

Average Rating User

`average_rating_user = get_average_rating(train_sparse_data, True)`

Average Rating Movie

`avg_rating_movie = get_average_rating(train_sparse_data, False)`

**Check Cold Start Problem: User**

`total_users = len(np.unique(netflix_rating_df["customer_id"]))`

train_users = len(average_rating_user)

uncommonUsers = total_users - train_users

print("Total no. of Users = {}".format(total_users))

print("No. of Users in train data= {}".format(train_users))

print("No. of Users not present in train data = {}({}%)".format(uncommonUsers, np.round((uncommonUsers/total_users)*100), 2))

Here, 1% of total users are new, and they will have no proper rating available. Therefore, this can bring the issue of the cold start problem.

**Check Cold Start Problem: Movie**

`total_movies = len(np.unique(netflix_rating_df["movie_id"]))`

train_movies = len(avg_rating_movie)

uncommonMovies = total_movies - train_movies

print("Total no. of Movies = {}".format(total_movies))

print("No. of Movies in train data= {}".format(train_movies))

print("No. of Movies not present in train data = {}({}%)".format(uncommonMovies, np.round((uncommonMovies/total_movies)*100), 2))

Here, 20% of total movies are new, and their rating might not be available in the dataset. Consequently, this can bring the issue of the cold start problem.

## Similarity Matrix

A similarity matrix is critical to measure and calculate the similarity between user-profiles and movies to generate recommendations. Fundamentally, this kind of matrix calculates the similarity between two data points.

In the matrix shown in figure 17, video2 and video5 are very similar. The computation of the similarity matrix is a very tedious job because it requires a powerful computational system.

**Compute User Similarity Matrix**

Computation of user similarity to find similarities of the top 100 users:

`def compute_user_similarity(sparse_matrix, limit=100):`

row_index, col_index = sparse_matrix.nonzero()

rows = np.unique(row_index)

similar_arr = np.zeros(61700).reshape(617,100)

for row in rows[:limit]:

sim = cosine_similarity(sparse_matrix.getrow(row), train_sparse_data).ravel()

similar_indices = sim.argsort()[-limit:]

similar = sim[similar_indices]

similar_arr[row] = similar

return similar_arrsimilar_user_matrix = compute_user_similarity(train_sparse_data, 100)

**Compute Movie Similarity Matrix**

Load movies title data set

`movie_titles_df = pd.read_csv("movie_titles.csv",sep = ",", header = None, names=['movie_id', 'year_of_release', 'movie_title'],index_col = "movie_id", encoding = "iso8859_2")movie_titles_df.head()`

Compute similar movies:

`def compute_movie_similarity_count(sparse_matrix, movie_titles_df, movie_id):`

similarity = cosine_similarity(sparse_matrix.T, dense_output = False)

no_of_similar_movies = movie_titles_df.loc[movie_id][1], similarity[movie_id].count_nonzero()

return no_of_similar_movies

Get a similar movies list:

`similar_movies = compute_movie_similarity_count(train_sparse_data, movie_titles_df, 1775)`

print("Similar Movies = {}".format(similar_movies))

# Building the Machine Learning Model

## Create a Sample Sparse Matrix

`def get_sample_sparse_matrix(sparseMatrix, n_users, n_movies):`

users, movies, ratings = sparse.find(sparseMatrix)

uniq_users = np.unique(users)

uniq_movies = np.unique(movies)

np.random.seed(15)

userS = np.random.choice(uniq_users, n_users, replace = False)

movieS = np.random.choice(uniq_movies, n_movies, replace = False)

mask = np.logical_and(np.isin(users, userS), np.isin(movies, movieS))

sparse_sample = sparse.csr_matrix((ratings[mask], (users[mask], movies[mask])),

shape = (max(userS)+1, max(movieS)+1))

return sparse_sample

Sample Sparse Matrix for the training data:

`train_sample_sparse_matrix = get_sample_sparse_matrix(train_sparse_data, 400, 40)`

Sample Sparse Matrix for the test data:

`test_sparse_matrix_matrix = get_sample_sparse_matrix(test_sparse_data, 200, 20)`

## Featuring the Data

Featuring is a process to create new features by adding different aspects of variables. Here, five similar profile users and similar types of movies features will be created. These new features help relate the similarities between different movies and users. Below new features will be added in the data set after featuring of data:

`def create_new_similar_features(sample_sparse_matrix):`

global_avg_rating = get_average_rating(sample_sparse_matrix, False)

global_avg_users = get_average_rating(sample_sparse_matrix, True)

global_avg_movies = get_average_rating(sample_sparse_matrix, False)

sample_train_users, sample_train_movies, sample_train_ratings = sparse.find(sample_sparse_matrix)

new_features_csv_file = open("/content/netflix_dataset/new_features.csv", mode = "w")

for user, movie, rating in zip(sample_train_users, sample_train_movies, sample_train_ratings):

similar_arr = list()

similar_arr.append(user)

similar_arr.append(movie)

similar_arr.append(sample_sparse_matrix.sum()/sample_sparse_matrix.count_nonzero())

similar_users = cosine_similarity(sample_sparse_matrix[user], sample_sparse_matrix).ravel()

indices = np.argsort(-similar_users)[1:]

ratings = sample_sparse_matrix[indices, movie].toarray().ravel()

top_similar_user_ratings = list(ratings[ratings != 0][:5])

top_similar_user_ratings.extend([global_avg_rating[movie]] * (5 - len(ratings)))

similar_arr.extend(top_similar_user_ratings)

similar_movies = cosine_similarity(sample_sparse_matrix[:,movie].T, sample_sparse_matrix.T).ravel()

similar_movies_indices = np.argsort(-similar_movies)[1:]

similar_movies_ratings = sample_sparse_matrix[user, similar_movies_indices].toarray().ravel()

top_similar_movie_ratings = list(similar_movies_ratings[similar_movies_ratings != 0][:5])

top_similar_movie_ratings.extend([global_avg_users[user]] * (5-len(top_similar_movie_ratings)))

similar_arr.extend(top_similar_movie_ratings)

similar_arr.append(global_avg_users[user])

similar_arr.append(global_avg_movies[movie])

similar_arr.append(rating)

new_features_csv_file.write(",".join(map(str, similar_arr)))

new_features_csv_file.write("\n")

new_features_csv_file.close()

new_features_df = pd.read_csv('/content/netflix_dataset/new_features.csv', names = ["user_id", "movie_id", "gloabl_average", "similar_user_rating1",

"similar_user_rating2", "similar_user_rating3",

"similar_user_rating4", "similar_user_rating5",

"similar_movie_rating1", "similar_movie_rating2",

"similar_movie_rating3", "similar_movie_rating4",

"similar_movie_rating5", "user_average",

"movie_average", "rating"]) return new_features_df

Featuring (adding new similar features) for the training data:

`train_new_similar_features = create_new_similar_features(train_sample_sparse_matrix)train_new_similar_features.head()`

Featuring (adding new similar features) for the test data:

`test_new_similar_features = create_new_similar_features(test_sparse_matrix_matrix)test_new_similar_features.head()`

## Training and Prediction of the Model

Divide the train and test data from the similar_features dataset:

`x_train = train_new_similar_features.drop(["user_id", "movie_id", "rating"], axis = 1)x_test = test_new_similar_features.drop(["user_id", "movie_id", "rating"], axis = 1)y_train = train_new_similar_features["rating"]y_test = test_new_similar_features["rating"]`

Utility method to check accuracy:

`def error_metrics(y_true, y_pred):`

rmse = np.sqrt(mean_squared_error(y_true, y_pred))

return rmse

Fit to XGBRegressor algorithm with 100 estimators:

`clf = xgb.XGBRegressor(n_estimators = 100, silent = False, n_jobs = 10)clf.fit(x_train, y_train)`

Predict the result of the test data set:

`y_pred_test = clf.predict(x_test)`

Check accuracy of predicted data:

`rmse_test = error_metrics(y_test, y_pred_test)print("RMSE = {}".format(rmse_test))`

As shown in figure 24, the RMSE (Root mean squared error) for the predicted model dataset is 99%. If the accuracy is lower than our expectations, we would need to continue to train our model until the accuracy meets a high standard.

## Plot Feature Importance

**Feature importance** is an important technique that selects a score to input features based on how valuable they are at predicting a target variable.

`def plot_importance(model, clf):`

fig = plt.figure(figsize = (8, 6))

ax = fig.add_axes([0,0,1,1])

model.plot_importance(clf, ax = ax, height = 0.3)

plt.xlabel("F Score", fontsize = 20)

plt.ylabel("Features", fontsize = 20)

plt.title("Feature Importance", fontsize = 20)

plt.tick_params(labelsize = 15)

plt.show()plot_importance(xgb, clf)

The plot shown in figure 25 displays the feature importance of each feature. Here, the user_average rating is a critical feature. Its score is higher than the other features. Other features like similar user ratings and similar movie ratings have been created to relate the similarity between different users and movies.

# Conclusion

Over the years, Machine learning has solved several challenges for companies like Netflix, Amazon, Google, Facebook, and others. The recommender system for Netflix helps the user filter through information in a massive list of movies and shows based on his/her choice. A recommender system must interact with the users to learn their preferences to provide recommendations.

Collaborative filtering (CF) is a very popular recommendation system algorithm for the prediction and recommendation based on other users’ ratings and collaboration. User-based collaborative filtering was the first automated collaborative filtering mechanism. It is also called *k-NN* collaborative filtering. *The problem of collaborative filtering is to predict how well a user will like an item that he has not rated given a set of existing choice judgments for a population of users [**4**].*