Taxi Tip¶
Objective: Train a Decision Tree regressor to predict the amount of a taxi tip.
Import libraries¶
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
sns.set_style("whitegrid")
Load the dataset¶
In [2]:
file_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/yellow_tripdata_2019-06.csv'
df = pd.read_csv(file_url)
df.head()
Out[2]:
| VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | congestion_surcharge | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2019-06-01 00:55:13 | 2019-06-01 00:56:17 | 1.0 | 0.0 | 1.0 | N | 145.0 | 145.0 | 2.0 | 3.0 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 4.30 | 0.0 |
| 1 | 1 | 2019-06-01 00:06:31 | 2019-06-01 00:06:52 | 1.0 | 0.0 | 1.0 | N | 262.0 | 263.0 | 2.0 | 2.5 | 3.0 | 0.5 | 0.00 | 0.0 | 0.3 | 6.30 | 2.5 |
| 2 | 1 | 2019-06-01 00:17:05 | 2019-06-01 00:36:38 | 1.0 | 4.4 | 1.0 | N | 74.0 | 7.0 | 2.0 | 17.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 18.80 | 0.0 |
| 3 | 1 | 2019-06-01 00:59:02 | 2019-06-01 00:59:12 | 0.0 | 0.8 | 1.0 | N | 145.0 | 145.0 | 2.0 | 2.5 | 1.0 | 0.5 | 0.00 | 0.0 | 0.3 | 4.30 | 0.0 |
| 4 | 1 | 2019-06-01 00:03:25 | 2019-06-01 00:15:42 | 1.0 | 1.7 | 1.0 | N | 113.0 | 148.0 | 1.0 | 9.5 | 3.0 | 0.5 | 2.65 | 0.0 | 0.3 | 15.95 | 2.5 |
Understand the dataset¶
The data was collected and provided by the NYC Taxi and Limousine Commission (TLC). The dataset records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, driver-reported passenger counts, and tip amount.
Each row in the dataset represents a taxi trip taken in June 2019, the variable tip_amount represents the target variable.
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3936004 entries, 0 to 3936003 Data columns (total 18 columns): # Column Dtype --- ------ ----- 0 VendorID int64 1 tpep_pickup_datetime object 2 tpep_dropoff_datetime object 3 passenger_count float64 4 trip_distance float64 5 RatecodeID float64 6 store_and_fwd_flag object 7 PULocationID float64 8 DOLocationID float64 9 payment_type float64 10 fare_amount float64 11 extra float64 12 mta_tax float64 13 tip_amount float64 14 tolls_amount float64 15 improvement_surcharge float64 16 total_amount float64 17 congestion_surcharge float64 dtypes: float64(14), int64(1), object(3) memory usage: 540.5+ MB
Preprocess the dataset¶
In [4]:
original_size = df.size
# Some trips report $0 tip, so it is assumed that these tips were paid in cash. We drop all these rows.
df = df[df['tip_amount'] > 0]
# We also remove some outliers, namely those where the tip was larger than the fare cost.
df = df[(df['tip_amount'] <= df['fare_amount'])]
# Convert 'tpep_dropoff_datetime' and 'tpep_pickup_datetime' columns to datetime objects.
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
# Extract dropoff hour.
df['dropoff_hour'] = df['tpep_dropoff_datetime'].dt.hour
# Extract dropoff day of the week (0 = Monday, 6 = Sunday).
df['dropoff_day'] = df['tpep_dropoff_datetime'].dt.weekday
# Calculate trip time in seconds.
df['trip_time'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds()
# Drop unnecessary variables.
df = df.drop(['total_amount', 'VendorID', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'mta_tax',
'improvement_surcharge', 'congestion_surcharge', 'tpep_pickup_datetime', 'tpep_dropoff_datetime'], axis=1)
print("The dataset was reduced by {:.2f}%".format((1 - df.size / original_size) * 100))
The dataset was reduced by 61.71%
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 2713009 entries, 4 to 3936000 Data columns (total 10 columns): # Column Dtype --- ------ ----- 0 passenger_count float64 1 trip_distance float64 2 payment_type float64 3 fare_amount float64 4 extra float64 5 tip_amount float64 6 tolls_amount float64 7 dropoff_hour int32 8 dropoff_day int32 9 trip_time float64 dtypes: float64(8), int32(2) memory usage: 207.0 MB
Visualize the dataset¶
In [6]:
columns = list(df.columns)
columns.remove("tip_amount")
number_features = len(columns)
grid_rows = int(np.ceil(number_features / 3))
fig, axs = plt.subplots(grid_rows, 3, figsize=(15, 5 * grid_rows))
for ax, feature in zip(axs.flatten(), columns):
if len(df[feature].unique()) <= 10:
sns.countplot(data=df, x=feature, hue=feature, ax=ax, palette="tab10", legend=False)
ax.set_xlabel("")
ax.set_yscale("log")
ax.set_title(feature)
else:
sns.histplot(data=df, x=feature, ax=ax, bins="doane")
ax.set_xlabel("")
ax.set_yscale("log")
ax.set_title(feature)
for ax in axs.flatten()[number_features:]:
ax.axis("off")
plt.tight_layout()
plt.show()
Visualize the target feature¶
Plot the distribution of the target feature and the relationship between the target feature and the feature with the highest Pearson correlation coefficient.
In [7]:
correlations = {}
for feature in columns:
correlations[feature] = stats.pearsonr(df[feature], df["tip_amount"])[0]
max_correlation_feature = max(correlations, key=lambda key: abs(correlations[key]))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
sns.histplot(x="tip_amount", data=df, ax=ax1, bins="doane")
ax1.set_yscale("log")
ax1.set_title("Distribution of tip_amount")
sns.scatterplot(x=df[max_correlation_feature], y=df["tip_amount"], ax=ax2)
ax2.set_title(f"Correlation coefficient = {correlations[max_correlation_feature]:.2f}")
plt.show()
Convert categorical variables into dummy/indicator variables¶
In [8]:
categorical_columns = ["payment_type", "dropoff_hour", "dropoff_day"]
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
df.head()
Out[8]:
| passenger_count | trip_distance | fare_amount | extra | tip_amount | tolls_amount | trip_time | payment_type_2.0 | payment_type_3.0 | payment_type_4.0 | ... | dropoff_hour_20 | dropoff_hour_21 | dropoff_hour_22 | dropoff_hour_23 | dropoff_day_1 | dropoff_day_2 | dropoff_day_3 | dropoff_day_4 | dropoff_day_5 | dropoff_day_6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 1.0 | 1.70 | 9.5 | 3.0 | 2.65 | 0.0 | 737.0 | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 5 | 2.0 | 1.60 | 9.5 | 3.0 | 1.00 | 0.0 | 652.0 | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 7 | 2.0 | 1.20 | 7.5 | 3.0 | 1.00 | 0.0 | 488.0 | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 9 | 1.0 | 8.60 | 31.5 | 3.0 | 7.05 | 0.0 | 2041.0 | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 10 | 1.0 | 1.74 | 11.0 | 0.5 | 2.96 | 0.0 | 858.0 | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
5 rows × 39 columns
Split the dataset into train and test subsets¶
In [9]:
X = df.drop("tip_amount", axis=1)
y = df["tip_amount"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (2034756, 38) X_test shape: (678253, 38)
Train a Decision Tree regressor¶
In [10]:
regressor = DecisionTreeRegressor(max_depth=5)
regressor.fit(X_train, y_train)
Out[10]:
DecisionTreeRegressor(max_depth=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| criterion | 'squared_error' | |
| splitter | 'best' | |
| max_depth | 5 | |
| min_samples_split | 2 | |
| min_samples_leaf | 1 | |
| min_weight_fraction_leaf | 0.0 | |
| max_features | None | |
| random_state | None | |
| max_leaf_nodes | None | |
| min_impurity_decrease | 0.0 | |
| ccp_alpha | 0.0 | |
| monotonic_cst | None |
Evaluate the Decision Tree regressor¶
In [11]:
y_pred = regressor.predict(X_test)
print(f"MSE = {mean_squared_error(y_test, y_pred):.2f}")
print(f"R² = {r2_score(y_test, y_pred):.2f}")
MSE = 1.68 R² = 0.77
Run in Google Colab