Iris Flower Species¶

No description has been provided for this imageRun in Google Colab

Objective: Train a Logistic Regression classifier to predict Iris flower species.

Multiclass classification or multinomial classification is the problem of classifying instances into multiple classes.

Import libraries¶

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

Load the dataset¶

In [2]:
X, y = datasets.load_iris(return_X_y=True, as_frame=True)
df = pd.concat([X, y.to_frame()], axis=1)
df.head()
Out[2]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

Understand the dataset¶

The well-known Iris flower dataset consists of 50 samples of each of three Iris species (Iris setosa, Iris virginica and Iris versicolor). Four features were measured for each sample: the length and the width of the sepals and petals, in centimeters.

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB

Visualize the dataset¶

In [4]:
sns.pairplot(data=df, hue="target", diag_kind="hist", palette="dark")
plt.show()
No description has been provided for this image

Visualize the class distribution¶

In [5]:
target_feature = df.columns[-1]
class_names = ["setosa", "versicolor", "virginica"]
labels, sizes = np.unique(df[target_feature], return_counts=True)

fig, ax = plt.subplots()
ax.pie(sizes, textprops={'color': "w", 'fontsize': '12'}, autopct=lambda pct: "{:.2f}%\n({:d})".format(pct, round(pct/100 * sum(sizes))))
ax.legend([str(i) + " (" + class_names[i] + ")" for i in labels])
ax.set_title(target_feature)
plt.show()
No description has been provided for this image

Split the dataset into train and test subsets¶

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (112, 4)
X_test shape: (38, 4)

Train a Logistic Regression classifier¶

In [7]:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
Out[7]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
penalty  'l2'
dual  False
tol  0.0001
C  1.0
fit_intercept  True
intercept_scaling  1
class_weight  None
random_state  None
solver  'lbfgs'
max_iter  100
multi_class  'deprecated'
verbose  0
warm_start  False
n_jobs  None
l1_ratio  None

Evaluate the Logistic Regression classifier¶

In [8]:
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=[str(i) + " (" + class_names[i] + ")" for i in labels])
plt.grid(False)
plt.show()
              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000        13
           1     1.0000    0.9375    0.9677        16
           2     0.9000    1.0000    0.9474         9

    accuracy                         0.9737        38
   macro avg     0.9667    0.9792    0.9717        38
weighted avg     0.9763    0.9737    0.9740        38

No description has been provided for this image