Poisonous Mushrooms¶

No description has been provided for this imageRun in Google Colab

Objective: Train a Support Vector Machine (SVM) model to predict whether or not a mushroom is poisonous.

Import libaries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import seaborn as sns
sns.set_style('whitegrid')

Load the dataset¶

In [2]:
file_url = 'https://raw.githubusercontent.com/LuisAngelMendozaVelasco/Applied_Data_Science_with_Python_Specialization/main/Applied_Machine_Learning_in_Python/Week2/Labs/data/mushrooms.csv'
df = pd.read_csv(file_url)
df.head()
Out[2]:
class cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size gill-color ... stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat
0 p x s n t p f c n k ... s w w p w o p k s u
1 e x s y t a f c b k ... s w w p w o p n n g
2 e b s w t l f c b n ... s w w p w o p n n m
3 p x y w t p f c n n ... s w w p w o p k s u
4 e x s g f n f w b k ... s w w p w o e n a g

5 rows × 23 columns

Understand the dataset¶

The UCI Mushroom Data Set includes descriptions of samples corresponding to 23 species of gilled mushrooms. Each species is identified as definitely edible or definitely poisonous.

Variable Name Description
class poisonous=p, edible=e
cap-shape bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
cap-surface fibrous=f,grooves=g,scaly=y,smooth=s
cap-color brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
bruises bruises=t,no=f
odor almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
gill-attachment attached=a,descending=d,free=f,notched=n
gill-spacing close=c,crowded=w,distant=d
gill-size broad=b,narrow=n
gill-color black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
stalk-shape enlarging=e,tapering=t
stalk-root bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
stalk-surface-above-ring fibrous=f,scaly=y,silky=k,smooth=s
stalk-surface-below-ring fibrous=f,scaly=y,silky=k,smooth=s
stalk-color-above-ring brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
stalk-color-below-ring brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
veil-type partial=p,universal=u
veil-color brown=n,orange=o,white=w,yellow=y
ring-number none=n,one=o,two=t
ring-type cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
spore-print-color black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
population abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
habitat grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring    8124 non-null   object
 15  stalk-color-below-ring    8124 non-null   object
 16  veil-type                 8124 non-null   object
 17  veil-color                8124 non-null   object
 18  ring-number               8124 non-null   object
 19  ring-type                 8124 non-null   object
 20  spore-print-color         8124 non-null   object
 21  population                8124 non-null   object
 22  habitat                   8124 non-null   object
dtypes: object(23)
memory usage: 1.4+ MB

Visualize some features of the dataset¶

In [4]:
column_index = [3, 4, 5, 9, 14, 15, 17, 19, 20, 21, 22]
number_features = len(column_index)
grid_rows = int(np.ceil(number_features / 3))
fig, axs = plt.subplots(grid_rows, 3, figsize=(15, 5 * grid_rows))

for ax, feature in zip(axs.flatten(), df.columns[column_index]):
    sns.countplot(data=df, x=feature, hue="class", ax=ax)
    ax.set_xlabel("")
    ax.set_title(feature)

for ax in axs.flatten()[number_features:]:
    ax.axis("off")

plt.tight_layout()
plt.show()
No description has been provided for this image

Visualize the class distribution¶

In [5]:
target_feature = df.columns[0]
labels, sizes = np.unique(df[target_feature], return_counts=True)

fig, ax = plt.subplots()
ax.pie(sizes, textprops={'color': "w", 'fontsize': '12'}, autopct=lambda pct: "{:.2f}%\n({:d})".format(pct, round(pct/100 * sum(sizes))))
ax.legend(["poisonous" if i == "p" else "edible" for i in labels])
ax.set_title(target_feature)
plt.show()
No description has been provided for this image

Preprocess the dataset¶

In [6]:
preprocessed_df = pd.get_dummies(df, drop_first=True)
preprocessed_df.head()
Out[6]:
class_p cap-shape_c cap-shape_f cap-shape_k cap-shape_s cap-shape_x cap-surface_g cap-surface_s cap-surface_y cap-color_c ... population_n population_s population_v population_y habitat_g habitat_l habitat_m habitat_p habitat_u habitat_w
0 True False False False False True False True False False ... False True False False False False False False True False
1 False False False False False True False True False False ... True False False False True False False False False False
2 False False False False False False False True False False ... True False False False False False True False False False
3 True False False False False True False False True False ... False True False False False False False False True False
4 False False False False False True False True False False ... False False False False True False False False False False

5 rows × 96 columns

Split the dataset into train and test subsets¶

In [7]:
X = preprocessed_df.drop("class_p", axis=1)
y = preprocessed_df["class_p"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (6093, 95)
X_test shape: (2031, 95)

Get the most important features¶

In [8]:
decision_tree_classifier = DecisionTreeClassifier(random_state=0).fit(X_train, y_train)
most_important_features = pd.DataFrame(data={"Feature": X_train.columns, "Gini_importance": decision_tree_classifier.feature_importances_})\
                        .sort_values(by="Gini_importance", ascending=False).reset_index(drop=True).query("Gini_importance >= 0.01")
most_important_features
Out[8]:
Feature Gini_importance
0 odor_n 0.625144
1 stalk-root_c 0.169176
2 stalk-surface-below-ring_y 0.100325
3 spore-print-color_r 0.034375
4 odor_l 0.023504
5 stalk-color-above-ring_w 0.017094
6 spore-print-color_u 0.010353

Find the optimal regularization parameter value for a SVM model¶

A SVM model aims to find the best possible line, or hyperplane, that separates data points into different classes by maximizing the margin between the closest points of each category, known as support vectors. SVMs are particularly effective in high-dimensional spaces and can handle both linearly separable and non-linearly separable datasets through the use of kernel functions, which transform data into higher-dimensional space where it is easier to find a boundary.

In [9]:
C = np.logspace(-2, 2, 5)
parameters = {'C': C}

svc = SVC()
clf = GridSearchCV(svc, parameters, scoring='recall')
clf.fit(X_train[most_important_features["Feature"]], y_train)
Out[9]:
GridSearchCV(estimator=SVC(),
             param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             scoring='recall')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=SVC(),
             param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             scoring='recall')
SVC(C=np.float64(1.0))
SVC(C=np.float64(1.0))
In [10]:
plt.figure()
plt.plot(C, clf.cv_results_['mean_test_score'])
plt.scatter(clf.best_params_['C'], clf.best_score_, color='red')
plt.xlabel("C")
plt.ylabel("Mean recall score")
plt.xscale("log")
plt.show()
No description has been provided for this image

Evaluate the best SVM model¶

In [11]:
y_pred = clf.best_estimator_.predict(X_test[most_important_features["Feature"]])
print(classification_report(y_test, y_pred, digits=4))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=["False (edible)", "True (poisonous)"])
plt.grid(False)
plt.show()
              precision    recall  f1-score   support

       False     1.0000    0.9906    0.9953      1061
        True     0.9898    1.0000    0.9949       970

    accuracy                         0.9951      2031
   macro avg     0.9949    0.9953    0.9951      2031
weighted avg     0.9951    0.9951    0.9951      2031

No description has been provided for this image