Source: Deep Learning on Medium
In this article I’m going to build a movie recommendation service using C#, ML.NET, and NET Core.
ML.NET is Microsoft’s new machine learning library. It can run linear regression, logistic classification, clustering, deep learning, and many other machine learning algorithms.
And NET Core is the Microsoft multi-platform NET Framework that runs on Windows, OS/X, and Linux. It’s the future of cross-platform NET development.
The first thing I need for my movie recommendation app is a data file with thousands of movie reviews to train on. MovieLens has provided a list of 100,000 ratings from their GroupLens project that I can work with.
The training data file looks like this:
It’s a very simple CSV file with only four columns:
- The ID of the user
- The ID of the movie
- The movie rating on a scale from 1–5
- The timestamp of the rating
I also have a CSV file with all the movie IDs and titles:
I will build a machine learning model that reads in each user ID, movie ID, and rating, and then predicts the ratings each user would give for every movie in the dataset.
So that gives me a list of movies and ratings for every user. To recommend a movie, all I need to do is sort the list by rating and report the top 5.
Let’s get started. Here’s how to set up a new console project in NET Core:
$ dotnet new console -o Recommender
$ cd Recommender
Next, I need to install the ML.NET base package and the recommender extensions:
$ dotnet add package Microsoft.ML
$ dotnet add package Microsoft.ML.Recommender
Now I’m ready to add some classes. I’ll need one to hold a movie rating, and one to hold my model’s predictions.
I will modify the Program.cs file like this:
The MovieRating class holds one single movie rating. Note how each field is adorned with a Column attribute that tell the CSV data loading code which column to import data from.
I’m also declaring an MovieRatingPrediction class which will hold a single movie rating prediction.
Now I’m going to load the training data in memory:
This code uses the method LoadFromTextFile to load the CSV data directly into memory. The class field annotations tell the method how to store the loaded data in the MovieRating class.
Now I’m ready to start building the machine learning model:
Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.
My pipeline has the following components:
- MapValueToKey which reads the userId column and builds a dictionary of unique ID values. It then produces an output column called userIdEncoded containing an encoding for each ID. This step converts the IDs to numbers that the model can work with.
- Another MapValueToKey which reads the movieId column, encodes it, and stores the encodings in output column called movieIdEncoded.
- A MatrixFactorization component that performs matrix factorization on the encoded ID columns and the ratings. This step calculates the movie rating predictions for every user and movie.
With the pipeline fully assembled, I can train the model with a call to Fit(…).
I now have a fully- trained model. So now I need to load some validation data, predict the rating for each user and movie, and calculate the accuracy metrics of my model:
This code uses the Transform(…) method to make predictions for every user and movie in the test dataset.
The Evaluate(…) method compares these predictions to the actual area values and automatically calculates three metrics for me:
- Rms: this is the root mean square error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
- L1: this is the mean absolute prediction error, expressed as a rating.
- L2: this is the mean square prediction error, or MSE value. Note that RMSE and MSE are related: RMSE is just the square root of MSE.
To wrap up, let’s use the model to make a prediction.
I’m going to focus on a specific user, let’s say user number 6, and check if he or she likes the James Bond movie ‘GoldenEye’.
Here’s how to make the prediction:
I use the CreatePredictionEngine method to set up a prediction engine. The two type arguments are the input data class and the class to hold the prediction. And once my prediction engine is set up, I can simply call Predict(…) to make a single prediction on a MovieRating instance.
Let’s do one more thing and predict the top-5 favorite movies for this user:
This code uses a static helper class Movies to enumerate over every movie ID. It creates predictions for user 6 and every possible movie, sorts them by score in descending order, and takes the top 5 results.
Here’s the partial source of the helper class:
There’s a Movie class that represents a single movie. The static helper class Movies has an All property with a list of all movies, and a Get method to lookup a single movie by ID value.
With the code all done, it’s time to check the predictions. Here’s the code running in the Visual Studio Code debugger on my Mac:
The output is a bit small, so here’s the app again running in a zsh shell:
After 20 epochs of training, my final RMSE on training is 0.5603. A quick check on the validation data gives an RMSE of 0.97. These numbers are a bit far apart which means I need to take another look at my data partitioning.
The mean absolute prediction error is 0.55, which means that this model will make movie rating predictions that are on average 0.55 rating points off the mark.
That’s not bad at all!
The model believes that user 6 would have given the movie ‘GoldenEye’ a rating of 3.71. And its predictions for the top-5 movies of user 6 are:
- Babes in Toyland
- Strictly Sexual
- Adam’s Rib
- White Squall
- Guess Who’s Coming to Dinner?
You can easily make predictions for yourself. Just add your own movie preferences to the end of the data file and run the training again. Then make the predictions for your own user ID.
And this recommendation algorithm will work on any dataset of users, products, and numerical ratings.
We’ve used Matrix Factorization in this example, but there are two other recommendation algorithms you can use to make predictions.
Here’s how to pick the right one:
If you have user IDs, product IDs, and ratings, then you can use Matrix Factorization like in this article.
However, if there are no numerical ratings in the data file (i.e. likes, purchases, etc), then you need a One-Class Matrix Factorization.
And if you have ratings but want to include other fields into the training too, you’ll need a Field-Aware Factorization Machine.
The MLNET machine learning library has support for all three algorithms.
You can get the full source code from here: https://github.com/dotnet/machinelearning-samples/tree/master/samples/csharp/getting-started/MatrixFactorization_MovieRecommendation
So what do you think?
Are you ready to start writing C# machine learning apps with ML.NET?
Add a comment and let me know!