Price of Houses¶

No description has been provided for this imageRun in Google Colab

Objective: Train a Decision Tree regressor to predict the median price of houses in various areas of Boston.

A Decision Tree regressor is a type of supervised learning algorithm used for regression tasks, where the goal is to predict a continuous value or a numerical output. It is a variant of the Decision Tree algorithm, which is commonly used for classification tasks.

Import libaries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
sns.set_style("whitegrid")

Load the dataset¶

In [2]:
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv"
df = pd.read_csv(file_url)
df.head()
Out[2]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3 222 18.7 NaN 36.2

Understand the dataset¶

The dataset contains information on areas/towns not individual houses, the features are:

  • CRIM: Crime per capita
  • ZN: Proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS: Proportion of non-retail business acres per town
  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX: Nitric oxides concentration (parts per 10 million)
  • RM: Average number of rooms per dwelling
  • AGE: Proportion of owner-occupied units built prior to 1940
  • DIS: Weighted distances to five Boston employment centers
  • RAD: Index of accessibility to radial highways
  • TAX: Full-value property-tax rate per $10,000
  • PTRATIO: Pupil-teacher ratio by town
  • LSTAT: Percent lower status of the population
  • MEDV: Median value of owner-occupied homes in $1000s
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     486 non-null    float64
 1   ZN       486 non-null    float64
 2   INDUS    486 non-null    float64
 3   CHAS     486 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      486 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  LSTAT    486 non-null    float64
 12  MEDV     506 non-null    float64
dtypes: float64(11), int64(2)
memory usage: 51.5 KB

Visualize the dataset¶

In [4]:
number_features = df.shape[1] - 1
grid_rows = int(np.ceil(number_features / 3))
fig, axs = plt.subplots(grid_rows, 3, figsize=(15, 5 * grid_rows))

for ax, feature in zip(axs.flatten(), df.columns):
    if len(df[feature].unique()) <= 10:
        sns.countplot(data=df, x=feature, hue=feature, ax=ax, palette="tab10", legend=False)
        ax.set_xlabel("")
        ax.set_title(feature)
    else:
        sns.histplot(data=df, x=feature, ax=ax, bins="doane")
        ax.set_xlabel("")
        ax.set_title(feature)

plt.tight_layout()
plt.show()
No description has been provided for this image

Preprocess the dataset¶

Drop rows with missing values:

In [5]:
df.dropna(inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 394 entries, 0 to 504
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     394 non-null    float64
 1   ZN       394 non-null    float64
 2   INDUS    394 non-null    float64
 3   CHAS     394 non-null    float64
 4   NOX      394 non-null    float64
 5   RM       394 non-null    float64
 6   AGE      394 non-null    float64
 7   DIS      394 non-null    float64
 8   RAD      394 non-null    int64  
 9   TAX      394 non-null    int64  
 10  PTRATIO  394 non-null    float64
 11  LSTAT    394 non-null    float64
 12  MEDV     394 non-null    float64
dtypes: float64(11), int64(2)
memory usage: 43.1 KB
In [6]:
X = df.drop("MEDV", axis=1)
y = df["MEDV"]

Visualize the target feature¶

Plot the distribution of the target feature and the relationship between the target feature and the feature with the highest Pearson correlation coefficient.

In [7]:
correlations = {}

for feature in X.columns:
    correlations[feature] = stats.pearsonr(X[feature], y)[0]

max_correlation_feature = max(correlations, key=lambda key: abs(correlations[key]))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))

sns.histplot(x="MEDV", data=df, ax=ax1)
ax1.set_title("Distribution of MEDV")

sns.scatterplot(x=X[max_correlation_feature], y=y, ax=ax2)
ax2.set_title(f"Correlation coefficient = {correlations[max_correlation_feature]:.2f}")

plt.show()
No description has been provided for this image

Split the dataset into train and test subsets¶

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (295, 12)
X_test shape: (99, 12)

Train a Decision Tree regressor¶

In [9]:
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
Out[9]:
DecisionTreeRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
criterion  'squared_error'
splitter  'best'
max_depth  None
min_samples_split  2
min_samples_leaf  1
min_weight_fraction_leaf  0.0
max_features  None
random_state  None
max_leaf_nodes  None
min_impurity_decrease  0.0
ccp_alpha  0.0
monotonic_cst  None

Evaluate the Decision Tree regressor¶

In [10]:
y_pred = regressor.predict(X_test)

print(f"MSE = {mean_squared_error(y_test, y_pred):.2f}")
print(f"R² = {r2_score(y_test, y_pred):.2f}")
MSE = 12.42
R² = 0.83