Taxi Tip¶

No description has been provided for this imageRun in Google Colab

Objective: Train a Decision Tree regressor to predict the amount of a taxi tip.

Import libraries¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
sns.set_style("whitegrid")

Load the dataset¶

In [2]:
file_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/yellow_tripdata_2019-06.csv'
df = pd.read_csv(file_url)
df.head()
Out[2]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge
0 1 2019-06-01 00:55:13 2019-06-01 00:56:17 1.0 0.0 1.0 N 145.0 145.0 2.0 3.0 0.5 0.5 0.00 0.0 0.3 4.30 0.0
1 1 2019-06-01 00:06:31 2019-06-01 00:06:52 1.0 0.0 1.0 N 262.0 263.0 2.0 2.5 3.0 0.5 0.00 0.0 0.3 6.30 2.5
2 1 2019-06-01 00:17:05 2019-06-01 00:36:38 1.0 4.4 1.0 N 74.0 7.0 2.0 17.5 0.5 0.5 0.00 0.0 0.3 18.80 0.0
3 1 2019-06-01 00:59:02 2019-06-01 00:59:12 0.0 0.8 1.0 N 145.0 145.0 2.0 2.5 1.0 0.5 0.00 0.0 0.3 4.30 0.0
4 1 2019-06-01 00:03:25 2019-06-01 00:15:42 1.0 1.7 1.0 N 113.0 148.0 1.0 9.5 3.0 0.5 2.65 0.0 0.3 15.95 2.5

Understand the dataset¶

The data was collected and provided by the NYC Taxi and Limousine Commission (TLC). The dataset records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, driver-reported passenger counts, and tip amount.

Each row in the dataset represents a taxi trip taken in June 2019, the variable tip_amount represents the target variable.

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3936004 entries, 0 to 3936003
Data columns (total 18 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   VendorID               int64  
 1   tpep_pickup_datetime   object 
 2   tpep_dropoff_datetime  object 
 3   passenger_count        float64
 4   trip_distance          float64
 5   RatecodeID             float64
 6   store_and_fwd_flag     object 
 7   PULocationID           float64
 8   DOLocationID           float64
 9   payment_type           float64
 10  fare_amount            float64
 11  extra                  float64
 12  mta_tax                float64
 13  tip_amount             float64
 14  tolls_amount           float64
 15  improvement_surcharge  float64
 16  total_amount           float64
 17  congestion_surcharge   float64
dtypes: float64(14), int64(1), object(3)
memory usage: 540.5+ MB

Preprocess the dataset¶

In [4]:
original_size = df.size

# Some trips report $0 tip, so it is assumed that these tips were paid in cash. We drop all these rows.
df = df[df['tip_amount'] > 0]

# We also remove some outliers, namely those where the tip was larger than the fare cost.
df = df[(df['tip_amount'] <= df['fare_amount'])]

# Convert 'tpep_dropoff_datetime' and 'tpep_pickup_datetime' columns to datetime objects.
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])

# Extract dropoff hour.
df['dropoff_hour'] = df['tpep_dropoff_datetime'].dt.hour

# Extract dropoff day of the week (0 = Monday, 6 = Sunday).
df['dropoff_day'] = df['tpep_dropoff_datetime'].dt.weekday

# Calculate trip time in seconds.
df['trip_time'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds()

# Drop unnecessary variables.
df = df.drop(['total_amount', 'VendorID', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'mta_tax', 
              'improvement_surcharge', 'congestion_surcharge', 'tpep_pickup_datetime', 'tpep_dropoff_datetime'], axis=1)

print("The dataset was reduced by {:.2f}%".format((1 - df.size / original_size) * 100))
The dataset was reduced by 61.71%
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2713009 entries, 4 to 3936000
Data columns (total 10 columns):
 #   Column           Dtype  
---  ------           -----  
 0   passenger_count  float64
 1   trip_distance    float64
 2   payment_type     float64
 3   fare_amount      float64
 4   extra            float64
 5   tip_amount       float64
 6   tolls_amount     float64
 7   dropoff_hour     int32  
 8   dropoff_day      int32  
 9   trip_time        float64
dtypes: float64(8), int32(2)
memory usage: 207.0 MB

Visualize the dataset¶

In [6]:
columns = list(df.columns)
columns.remove("tip_amount")
number_features = len(columns)
grid_rows = int(np.ceil(number_features / 3))
fig, axs = plt.subplots(grid_rows, 3, figsize=(15, 5 * grid_rows))

for ax, feature in zip(axs.flatten(), columns):
    if len(df[feature].unique()) <= 10:
        sns.countplot(data=df, x=feature, hue=feature, ax=ax, palette="tab10", legend=False)
        ax.set_xlabel("")
        ax.set_yscale("log")
        ax.set_title(feature)
    else:
        sns.histplot(data=df, x=feature, ax=ax, bins="doane")
        ax.set_xlabel("")
        ax.set_yscale("log")
        ax.set_title(feature)

for ax in axs.flatten()[number_features:]:
    ax.axis("off")

plt.tight_layout()
plt.show()
No description has been provided for this image

Visualize the target feature¶

Plot the distribution of the target feature and the relationship between the target feature and the feature with the highest Pearson correlation coefficient.

In [7]:
correlations = {}

for feature in columns:
    correlations[feature] = stats.pearsonr(df[feature], df["tip_amount"])[0]

max_correlation_feature = max(correlations, key=lambda key: abs(correlations[key]))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))

sns.histplot(x="tip_amount", data=df, ax=ax1, bins="doane")
ax1.set_yscale("log")
ax1.set_title("Distribution of tip_amount")

sns.scatterplot(x=df[max_correlation_feature], y=df["tip_amount"], ax=ax2)
ax2.set_title(f"Correlation coefficient = {correlations[max_correlation_feature]:.2f}")

plt.show()
No description has been provided for this image

Convert categorical variables into dummy/indicator variables¶

In [8]:
categorical_columns = ["payment_type", "dropoff_hour", "dropoff_day"]
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
df.head()
Out[8]:
passenger_count trip_distance fare_amount extra tip_amount tolls_amount trip_time payment_type_2.0 payment_type_3.0 payment_type_4.0 ... dropoff_hour_20 dropoff_hour_21 dropoff_hour_22 dropoff_hour_23 dropoff_day_1 dropoff_day_2 dropoff_day_3 dropoff_day_4 dropoff_day_5 dropoff_day_6
4 1.0 1.70 9.5 3.0 2.65 0.0 737.0 False False False ... False False False False False False False False True False
5 2.0 1.60 9.5 3.0 1.00 0.0 652.0 False False False ... False False False False False False False False True False
7 2.0 1.20 7.5 3.0 1.00 0.0 488.0 False False False ... False False False False False False False False True False
9 1.0 8.60 31.5 3.0 7.05 0.0 2041.0 False False False ... False False False False False False False False True False
10 1.0 1.74 11.0 0.5 2.96 0.0 858.0 False False False ... False False False False False False False False True False

5 rows × 39 columns

Split the dataset into train and test subsets¶

In [9]:
X = df.drop("tip_amount", axis=1)
y = df["tip_amount"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (2034756, 38)
X_test shape: (678253, 38)

Train a Decision Tree regressor¶

In [10]:
regressor = DecisionTreeRegressor(max_depth=5)
regressor.fit(X_train, y_train)
Out[10]:
DecisionTreeRegressor(max_depth=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
criterion  'squared_error'
splitter  'best'
max_depth  5
min_samples_split  2
min_samples_leaf  1
min_weight_fraction_leaf  0.0
max_features  None
random_state  None
max_leaf_nodes  None
min_impurity_decrease  0.0
ccp_alpha  0.0
monotonic_cst  None

Evaluate the Decision Tree regressor¶

In [11]:
y_pred = regressor.predict(X_test)

print(f"MSE = {mean_squared_error(y_test, y_pred):.2f}")
print(f"R² = {r2_score(y_test, y_pred):.2f}")
MSE = 1.68
R² = 0.77