Predicting Taxi fares in NYC using Google Cloud AI Platform(Billion + rows) Part 1

Source: Deep Learning on Medium

Predicting Taxi fares in NYC using Google Cloud AI Platform(Billion + rows) Part 1


This project aims at creating a Machine Learning model to estimate taxi fares in New York City using a dataset corresponding to taxi rides which is hosted in BigQuery. There are more than a Billion rows with a size of 130 GB. You can find it here.

To begin, we’ll have to set up a project on the Google Cloud Platform and start a notebook instance.

This instance would cost us around $0.14 per hour. We also need to enable the Compute Engine API and create a storage bucket.

Data Preparation

Install the BigQuery library and preview the data

We used the RAND() function to take a small sample from this humungous dataset.

The disadvantage is that we get a different sample every time we run this query. Thus, we’ll use a hash function.

MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))), 5000)= 1

This function enables us to get a constant sample with 0.02% of the original data. It would take enormous computing power to utilize the entire dataset for exploration. I plan on building a model using the entire dataset in the future.

Load the data into a Pandas Dataframe

We definitely need to clean our data. For example, the latitude and longitude values are off (latitude should be between -90 and 90; longitude should be between -180 and 180). Some trips have a negative fare amount and some have a zero passenger count.

Let’s plot trip_distance vs fare_amount

Some bad data is coded as zero. Also, we are not going to include tips in our prediction as they tend to be unpredictable. Our target variable would be fare_amount + tolls_amount

Our predictors would be:

We have selected these as they are best in line with our objective. One can argue that trip_distance can help us determine the fare, but we are excluding it as it cannot always be known in advance. Our project might have applications in the ride-hailing industry where we can predict the trip cost in advance, and the location coordinates of the source/destination should be chosen over trip_distance for accuracy.

Data Cleaning

We follow the following to filter unreasonable data:

  • Fare amounts should be greater than $2.5
  • Source/Destination location coordinates should be within NYC limits
  • Passenger Counts cannot be zero
  • Trip Distance cannot be zero
(tolls_amount + fare_amount) AS fare_amount, -- label
-- Clean Data
trip_distance > 0
AND passenger_count > 0
AND fare_amount >= 2.5
AND pickup_longitude > -78
AND pickup_longitude < -70
AND dropoff_longitude > -78
AND dropoff_longitude < -70
AND pickup_latitude > 37
AND pickup_latitude < 45
AND dropoff_latitude > 37
AND dropoff_latitude < 45
-- repeatable 1/5000th sample
AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))),5000) = 1

Let us also create some categorical features by getting dayofweek and hourofday from pickup _datetime

(tolls_amount + fare_amount) AS fare_amount, -- label
EXTRACT(DAYOFWEEK from pickup_datetime) AS dayofweek,
EXTRACT(HOUR from pickup_datetime) AS hourofday,

Split the data into Train, Validation, and Test sets

We split our 1/5000th sample 70–15–15

Write the data to CSV

In the next steps, we are going to create a machine learning model using TensorFlow which is slow to read files directly from BigQuery. Thus, we write our data out as .csv files

Given that we are only using a sample (0.02%) of the total data, we still have 151k rows to train on which is pretty good.

Check here for the next post where I build models.

Connect with me on LinkedIn. You can find the full code here.

Cheers !!