NYC Taxi Fare Prediction using Pytorch

Table of Contents

Overview

Introduction

The New York City Taxi Fare Prediction Kaggle competition provides a dataset with about 55 million records. The data contains features such as pickup date & time, the latitude & longitude (GPS coordinates) of the pickup and dropoff locations, and the number of passengers.

Goal - Our goal is to predict the fare_amount for a taxi ride given the gps coordinates, time and day of week etc.

Data - For this analysis, we will use the data from April'11 to April'24, 2010 (120,000 records). The data has been randomly sorted. Here are some details about the dataset:

Features

Target

Read the Data

We can see that fares range from \$2.50 to \\$49.90, with a mean of \$10.04 and a median of \\$7.70

Feature Engineering

In this section, we will:

Calculate distance

Calculate the distance between pickup and dropoff locations using Haversine Distance. The formula is:

${\displaystyle d=2r\arcsin \left({\sqrt {\sin ^{2}\left({\frac {\varphi _{2}-\varphi _{1}}{2}}\right)+\cos(\varphi _{1})\:\cos(\varphi _{2})\:\sin ^{2}\left({\frac {\lambda _{2}-\lambda _{1}}{2}}\right)}}\right)}$

where

$\begin{split} r&: \textrm {radius of the sphere (Earth's radius averages 6371 km)}\\ \varphi_1, \varphi_2&: \textrm {latitudes of point 1 and point 2}\\ \lambda_1, \lambda_2&: \textrm {longitudes of point 1 and point 2}\end{split}$

Extract date time columns

Here, we will extract information like "day of the week", "am vs. pm" etc. The original data was saved in UTC time ans so we'll make an adjustment to EDT using UTC-4 (subtracting 4 hours).

Add columns for hour, day of week and am/pm

Data Exploration

Let's explore the data.

Distribution of Fare Amount

The plot is heavily skewed to the right showing that majority of fare amounts are around $6 with a steep decline indicating that most trips were for a short distance. There are a couple of larger fare amounts indicating longer trips (possibly to the JFK airport... just a wild guess).

Distribution of Passengers

The plot shows that most trips had only one passenger. While some trips had two passengers, there are very few trips where taxis had three or more passengers in a single trip.

Trips by Day of the Week

The plot shows that most trips occured on Friday followed by Sturday resonating the fact that people tend to go out more as the weekend starts. Sunday had the least amount of trips as people relaxed at their homes to get recharged for next week.

Trips by Hour of the Day

The plot shows that majority of the trips occured between 2PM - 6PM. Given that our data is for New York City, we can see large count of trips throughout the day.

Trips by Day of the Week and Hour

The plot shows that majority of the trips on a Saturday occured between 8am and 9pm indicating that trips started a little late and that there were more trips in the later hours of the day as compared to other days.

Distribution of Trip Distance

The plot is heavily skewed to the right indicating that majority of the trips taken were for a short distance. There are a couple of blips at around 10 km and 20 km mark indicating longer trips (possibly to the JFK airport... just a wild guess).

Distance travelled by Weekday and AM/PM

The plot shows that trips were longer (in km) on Friday, Sunday in the PM and Monday morning. Thats is also the time when most people travel to the airport. IQR range for trips show that people travelled a little further in the later half of the day (PM) than in the morning.

Distance travelled by Day and Hour

While the plot below may look difficult to intepret, we are only interested in the trend of distance for each day focusing on the median distance travelled.

The plot shows that the median distance travelled was more at around 12am on Modanys and Thrusdays. We can also see the trend for median distance travelled increasing in the evenings for Monday, Tuesday and Wednesdays.

Fare Amount by Day and Hour

While the plot below may look difficult to intepret, we are only interested in the trend of fare amount for each day focusing on the median amount.

The trend for median fare amounts seems to follow the median distance travelled. The plot shows that the median fare amount was more at around 12am on Modanys and Thrusdays. We can also see the trend for median fare amount increasing in the evenings for Monday, Tuesday and Wednesdays.

Fare amount by Distance

The plot shows increasing fare amount as the distance increaseswith some anomolies showing lower fare for larger distance travelled.

Categorify the data

Convert categorical columns to category type.

Separate categorical and continous columns

Convert columns to category type

Here our categorical names are the integers 0 through 23, for a total of 24 unique categories. These values also correspond to the codes assigned to each name.

Combine categorical columns

Here, we will combine the three columns into a single array.

Tensors and Embeddings

Create tensors for categorical and continous data

Set Embedding Sizes (One hot encoding for categorical data)

Embeddings in a tensor are like one-hot encoding for categorical variables. The rule of thumb for determining the embedding size is to divide the number of unique entries in each column by 2, but not to exceed 50.

Build the Model

Define Tabular Model

Here we will define a TabularModel() class that follows the fast.ai library. The goal is to define a model based on the number of continuous columns (given by conts.shape[1]) plus the number of categorical columns and their embeddings (given by len(emb_szs) and emb_szs respectively).

A Fully Connected Network with 3 hidden layers of size [200,300,200] will be used. On each layer, we will:

Build Model

Train and Test the Model

Split the data

Define loss function and optimizer

Train the Model

Here, we will train the model for 300 epochs and optimize using the Adam optimizer.

Plot Training Loss

We can see the loss reducing steeply up until 150 epochs and then reducing gradually.

Validate the Model

Here, we will validate the model on test data.

This means that on average, predicted values are within ±$3.2 of the actual value.

Now let's look at the first 20 predicted values:

So while many predictions were off by a few cents, some were off by $17.65.

Save the Model

Here, we will save the state of the model (weights & biases) and not the full definition.

Summary

In this project, we used PyTorch to predict taxi fare precies for the NY taxi fare dataset. We built a TabularModel() class to handle both categorical and continous variables. A Fully Connected Network with 3 hidden layers of size [200,300,200] was used. ReLu activation function was used on each layer, the layer was normalized and then a dropout layer was used to randomly drop 40% of the data. The model was run for 300 epochs resulting in an RMSE of 3.24.