Use C# and ML.NET Machine Learning To Predict Taxi Fares In New York

Source: Deep Learning on Medium


Building machine learning apps in C# has never been easier!

ML.NET is Microsoft’s new machine learning library. It can run linear regression, logistic classification, clustering, deep learning, and many other machine learning algorithms.

ML.NET is a first-class NET library. There’s no need to use Python, you can easily tap into this library using any NET language, including C#.

Microsoft is pouring all their effort into ML.NET right now. This is going to be their go-to solution for all machine learning in NET going forward.

And it’s super easy to use. Watch this, I’m going to build an app that can predict taxi fares in New York.

The first thing I need is a data file with thousands of New York taxi rides. The NYC Taxi & Limousine Commission provides yearly TLC Trip Record Data files which have exactly what I need.

The data file looks like this:

I’m using the awesome Rainbow CSV plugin for Visual Studio Code which is highlighting my CSV data file with these nice colors.

The plugin can also run simple RBQL queries directly on the file:

The final column in the data file has the taxi fare I’m trying to predict.

I’ll use all the other columns as input data:

  • The data provider vendor ID
  • The rate code (standard, JFK, Newark, Nassau, negotiated, group)
  • Number of passengers
  • Trip time
  • Trip distance
  • Payment type (credit card, cash)

I’ll build a machine learning model in C# that will use these columns as input, and use them to accurately predict the taxi fare for every trip.

And I will use NET Core to build my app.

NET Core is really cool. It’s the multi-platform version of the NET framework and it runs flawlessly on Windows, OS/X, and Linux.

I’m using the 3.0 preview on my Mac right now and haven’t touched my Windows 10 virtual machine in days.

Here’s how to set up a new console project in NET Core:

$ dotnet new console -o PricePrediction
$ cd PricePrediction

Next, I need to install the ML.NET NuGet package:

$ dotnet add package Microsoft.ML --version 0.10.0

Now I’m ready to add some classes. I’ll need one to hold a taxi trip, and one to hold my model’s predictions.

I will modify the Program.cs file like this:

The TaxiTrip class holds one single taxi trip. Note how each field is adorned with a Column attribute that tell the CSV data loading code which column to import data from.

I’m also declaring a TaxiTripFarePrediction class which will hold a single fare prediction.

Now I’m going to load the training data in memory:

This code sets up a TextLoader to load the CSV data into memory. Note that all column data types are what you’d expect, except RateCode. This column holds a numeric value from 0 to 6, but I’m loading it as a text field.

The reason I’m doing this is because RateCode is an enumeration with the following values:

  • 1 = standard
  • 2 = JFK
  • 3 = Newark
  • 4 = Nassau
  • 5 = negotiated
  • 6 = group

The actual numbers in this context don’t mean anything. And I certainly don’t want the machine learning model to start believing that a trip to Newark is three times as important as a standard fare.

So converting these values to strings is a perfect trick to show the model that RateCode is just a label, and the underlying numbers don’t mean anything.

With the TextLoader all set up, a single call to Read() is sufficient to load the entire data file in memory.

Now I’m ready to start building the machine learning model:

Machine learning models in ML.NET are built with pipelines, which are sequences of data-loading, transformation, and learning components.

My pipeline has the following components:

  • CopyColumns which copies the FareAmount column to a new column called Label. This Label column holds the actual taxi fare that the model has to predict.
  • A group of three OneHotEncodings to perform one hot encoding on the three columns that contains enumerative data: VendorId, RateCode, and PaymentType. This is a required step because machine learning models cannot handle enumerative data directly.
  • Concatenate which combines all input data columns into a single column called Features. This is a required step because ML.NET can only train on a single input column.
  • A final FastTree regression learner which will train the model to make accurate predictions.

The FastTreeRegressionTrainer is a very nice training algorithm that uses gradient boosting, a machine learning technique for regression problems.

A gradient boosting algorithm builds up a collection of weak regression models. It starts out with a weak model that tries to predict the taxi fare. Then it adds a second model that attempts to correct the error in the first model. And then it adds a third model, and so on.

The result is a fairly strong prediction model that is actually just an ensemble of weaker prediction models stacked on top of each other.

With the pipeline fully assembled, I can train the model with a call to Fit().

I now have a fully- trained model. So now I need to load some validation data, predict the taxi fare for each trip, and calculate the accuracy of my model:

This code uses the TextLoader class to load another taxi trip data file for testing. And with a single call to Transform(…) I can set up predictions for every single trip in the file.

The Evaluate(…) method compares these predictions to the actual taxi fares and automatically calculates three very handy metrics for me:

  • Rms: this is the root mean square error or RMSE value. It’s the go-to metric in the field of machine learning to evaluate models and rate their accuracy. RMSE represents the length of a vector in n-dimensional space, made up of the error in each individual prediction.
  • L1: this is the mean absolute prediction error, expressed in dollars.
  • L2: this is the mean square prediction error, or MSE value. Note that RMSE and MSE are related: RMSE is just the square root of MSE.

To wrap up, let’s use the model to make a prediction.

I’m going to take a taxi trip for 3.75 miles and the trip will take me 19 minutes. I’ll be the only passenger and I’ll pay by credit card.

Here’s how to make the prediction:

I use the CreatePredictionEngine<…>(…) method to set up a prediction engine. The two type arguments are the input data class and the class to hold the prediction. And once my prediction engine is set up, I can simply call Predict(…) to make a single prediction.

I know from the data file that this trip is supposed to cost $15.50. How accurate will the model prediction be?

Here’s the code running in the Visual Studio Code debugger on my Mac:

The output is a bit small, so here’s the app again running in a zsh shell:

I get an RMSE value of 2.06 and an L1 value of 0.42. This means that my predictions are on average only 42 cents off.

How about that!

According to the model, my 19-minute trip covering 3.75 miles will cost me $15.79. But the actual fare price is $15.50, so in this case my model prediction is off by only 29 cents.

So what do you think?

Are you ready to start writing C# machine learning apps with ML.NET?

Add a comment and let me know!