Iris Flower Species¶
Objective: Train a model using the classic iris dataset for multi-class classification.
Multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary classification)
Import libraries¶
In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
Load the dataset¶
In [2]:
X, y = datasets.load_iris(return_X_y=True, as_frame=True)
df = pd.concat([X, y.to_frame()], axis=1)
df.head()
Out[2]:
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
Understand the dataset¶
The dataset consists of 50 samples of each of three Iris species (Iris setosa, Iris virginica and Iris versicolor). Four features were measured for each sample: the length and the width of the sepals and petals, in centimeters.
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal length (cm) 150 non-null float64 1 sepal width (cm) 150 non-null float64 2 petal length (cm) 150 non-null float64 3 petal width (cm) 150 non-null float64 4 target 150 non-null int64 dtypes: float64(4), int64(1) memory usage: 6.0 KB
Visualize the dataset¶
In [4]:
sns.pairplot(data=df, hue="target", diag_kind="hist", palette="dark")
plt.show()
Visualize the class distribution¶
In [5]:
labels, sizes = np.unique(df["target"], return_counts=True)
fig, ax = plt.subplots()
ax.pie(sizes, textprops={'color': "w", 'fontsize': '12'}, autopct=lambda pct: "{:.2f}%\n({:d})".format(pct, round(pct/100 * sum(sizes))))
ax.legend(labels)
ax.set_title("target")
plt.show()
Split the dataset into train and test subsets¶
In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (112, 4) X_test shape: (38, 4)
Train a Logistic Regression model¶
In [7]:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
Out[7]:
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Evaluate the model¶
In [8]:
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.grid(False)
plt.show()
precision recall f1-score support 0 1.0000 1.0000 1.0000 13 1 1.0000 0.9375 0.9677 16 2 0.9000 1.0000 0.9474 9 accuracy 0.9737 38 macro avg 0.9667 0.9792 0.9717 38 weighted avg 0.9763 0.9737 0.9740 38