CO2 Emissions¶
Objective: Train Simple and Multiple Linear Regression models to predict CO2 emissions from light-duty vehicles.
Simple Linear Regression is a statistical approach that models the relationship between a dependent variable and a single independent variable. The goal is to establish a linear relationship between the two variables, where the dependent variable is predicted based on the value of the independent variable.
Multiple Linear Regression is an extension of Simple Linear Regression, where the goal is to model the relationship between a dependent variable and multiple independent variables. In this case, the model attempts to explain the variation in the dependent variable using a combination of multiple independent variables.
Import libraries¶
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats
sns.set_style("whitegrid")
Load the dataset¶
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv"
df = pd.read_csv(file_url)
df.head()
| MODELYEAR | MAKE | MODEL | VEHICLECLASS | ENGINESIZE | CYLINDERS | TRANSMISSION | FUELTYPE | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | FUELCONSUMPTION_COMB_MPG | CO2EMISSIONS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014 | ACURA | ILX | COMPACT | 2.0 | 4 | AS5 | Z | 9.9 | 6.7 | 8.5 | 33 | 196 |
| 1 | 2014 | ACURA | ILX | COMPACT | 2.4 | 4 | M6 | Z | 11.2 | 7.7 | 9.6 | 29 | 221 |
| 2 | 2014 | ACURA | ILX HYBRID | COMPACT | 1.5 | 4 | AV7 | Z | 6.0 | 5.8 | 5.9 | 48 | 136 |
| 3 | 2014 | ACURA | MDX 4WD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.7 | 9.1 | 11.1 | 25 | 255 |
| 4 | 2014 | ACURA | RDX AWD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.1 | 8.7 | 10.6 | 27 | 244 |
Understand the dataset¶
The fuel consumption dataset contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for light-duty vehicles.
- MODELYEAR e.g. 2014
- MAKE e.g. Acura
- MODEL e.g. ILX
- VEHICLE CLASS e.g. SUV
- ENGINE SIZE e.g. 4.7
- CYLINDERS e.g 6
- TRANSMISSION e.g. A6
- FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9
- FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9
- FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2
- CO2 EMISSIONS (g/km) e.g. 182
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1067 entries, 0 to 1066 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MODELYEAR 1067 non-null int64 1 MAKE 1067 non-null object 2 MODEL 1067 non-null object 3 VEHICLECLASS 1067 non-null object 4 ENGINESIZE 1067 non-null float64 5 CYLINDERS 1067 non-null int64 6 TRANSMISSION 1067 non-null object 7 FUELTYPE 1067 non-null object 8 FUELCONSUMPTION_CITY 1067 non-null float64 9 FUELCONSUMPTION_HWY 1067 non-null float64 10 FUELCONSUMPTION_COMB 1067 non-null float64 11 FUELCONSUMPTION_COMB_MPG 1067 non-null int64 12 CO2EMISSIONS 1067 non-null int64 dtypes: float64(4), int64(4), object(5) memory usage: 108.5+ KB
Visualize some features of the dataset¶
features = []
for feature in df.columns[1:]:
if df[feature].dtype == 'O':
if len(df[feature].unique()) <= 10:
features.append(feature)
else:
features.append(feature)
number_features = len(features)
grid_rows = int(np.ceil(number_features / 3))
fig, axs = plt.subplots(grid_rows, 3, figsize=(15, 5 * grid_rows))
for ax, feature in zip(axs.flatten(), features):
if len(df[feature].unique()) <= 10:
sns.countplot(data=df, x=feature, hue=feature, ax=ax, palette="tab10", legend=False)
ax.set_xlabel("")
ax.set_title(feature)
else:
sns.histplot(data=df, x=feature, ax=ax)
ax.set_xlabel("")
ax.set_title(feature)
for ax in axs.flatten()[number_features:]:
ax.axis("off")
plt.tight_layout()
plt.show()
Visualize the relationship between the target and the feature with the highest correlation coefficient¶
features = ["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_CITY", "FUELCONSUMPTION_HWY", "FUELCONSUMPTION_COMB"]
correlations = {}
for feature in features:
correlations[feature] = stats.pearsonr(df[feature], df["CO2EMISSIONS"])[0]
max_correlation_feature = max(correlations, key=correlations.get)
plt.figure()
sns.scatterplot(df, x=max_correlation_feature, y="CO2EMISSIONS")
plt.title(f"Correlation coefficient = {correlations[max_correlation_feature]:.2f}")
plt.show()
Preprocess the dataset¶
df_subset = df[["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_CITY", "FUELCONSUMPTION_HWY", "FUELCONSUMPTION_COMB", "CO2EMISSIONS"]]
df_subset.head()
| ENGINESIZE | CYLINDERS | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | CO2EMISSIONS | |
|---|---|---|---|---|---|---|
| 0 | 2.0 | 4 | 9.9 | 6.7 | 8.5 | 196 |
| 1 | 2.4 | 4 | 11.2 | 7.7 | 9.6 | 221 |
| 2 | 1.5 | 4 | 6.0 | 5.8 | 5.9 | 136 |
| 3 | 3.5 | 6 | 12.7 | 9.1 | 11.1 | 255 |
| 4 | 3.5 | 6 | 12.1 | 8.7 | 10.6 | 244 |
Split the dataset into train and test subsets¶
X = df_subset.drop("CO2EMISSIONS", axis=1)
y = df_subset["CO2EMISSIONS"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (800, 5) X_test shape: (267, 5)
Train a Simple Linear Regression model¶
Use the FUELCONSUMPTION_COMB feature as training data.
simple_linear_regressor = linear_model.LinearRegression()
simple_linear_regressor.fit(X_train["FUELCONSUMPTION_CITY"].to_frame(), y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| fit_intercept | True | |
| copy_X | True | |
| tol | 1e-06 | |
| n_jobs | None | |
| positive | False |
Evaluate the Simple Linear Regression model¶
The Mean Square Error (MSE) is a measure used to quantify the average squared difference between the predicted values and the actual values in a dataset.
The coefficient of determination (R²) provides an indication of goodness of fit and therefore a measure of how well unseen samples are likely to be predicted by the model, through the proportion of explained variance. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
y_pred = simple_linear_regressor.predict(X_test["FUELCONSUMPTION_CITY"].to_frame())
print(f"MSE = {mean_squared_error(y_test, y_pred):.2f}")
print(f"R² = {r2_score(y_test, y_pred):.2f}")
plt.figure()
sns.scatterplot(x=X_test["FUELCONSUMPTION_CITY"], y=y_test)
sns.scatterplot(x=X_test["FUELCONSUMPTION_CITY"], y=y_pred)
plt.legend(["y_test", "y_pred"])
plt.show()
MSE = 755.36 R² = 0.81
Train a Multiple Linear Regression model¶
multiple_linear_regressor = linear_model.LinearRegression()
multiple_linear_regressor.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| fit_intercept | True | |
| copy_X | True | |
| tol | 1e-06 | |
| n_jobs | None | |
| positive | False |
Evaluate the Multiple Linear Regression model¶
y_pred = multiple_linear_regressor.predict(X_test)
print(f"MSE = {mean_squared_error(y_test, y_pred):.2f}")
print(f"R² = {r2_score(y_test, y_pred):.2f}")
MSE = 597.46 R² = 0.85
Run in Google Colab