Poisonous Mushrooms¶
Objective: Train a Support Vector Machine (SVM) model to predict whether or not a mushroom is poisonous.
Import libaries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import seaborn as sns
sns.set_style('whitegrid')
Load the dataset¶
file_url = 'https://raw.githubusercontent.com/LuisAngelMendozaVelasco/Applied_Data_Science_with_Python_Specialization/main/Applied_Machine_Learning_in_Python/Week2/Labs/data/mushrooms.csv'
df = pd.read_csv(file_url)
df.head()
| class | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | p | x | s | n | t | p | f | c | n | k | ... | s | w | w | p | w | o | p | k | s | u |
| 1 | e | x | s | y | t | a | f | c | b | k | ... | s | w | w | p | w | o | p | n | n | g |
| 2 | e | b | s | w | t | l | f | c | b | n | ... | s | w | w | p | w | o | p | n | n | m |
| 3 | p | x | y | w | t | p | f | c | n | n | ... | s | w | w | p | w | o | p | k | s | u |
| 4 | e | x | s | g | f | n | f | w | b | k | ... | s | w | w | p | w | o | e | n | a | g |
5 rows × 23 columns
Understand the dataset¶
The UCI Mushroom Data Set includes descriptions of samples corresponding to 23 species of gilled mushrooms. Each species is identified as definitely edible or definitely poisonous.
| Variable Name | Description |
|---|---|
| class | poisonous=p, edible=e |
| cap-shape | bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s |
| cap-surface | fibrous=f,grooves=g,scaly=y,smooth=s |
| cap-color | brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y |
| bruises | bruises=t,no=f |
| odor | almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s |
| gill-attachment | attached=a,descending=d,free=f,notched=n |
| gill-spacing | close=c,crowded=w,distant=d |
| gill-size | broad=b,narrow=n |
| gill-color | black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y |
| stalk-shape | enlarging=e,tapering=t |
| stalk-root | bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? |
| stalk-surface-above-ring | fibrous=f,scaly=y,silky=k,smooth=s |
| stalk-surface-below-ring | fibrous=f,scaly=y,silky=k,smooth=s |
| stalk-color-above-ring | brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y |
| stalk-color-below-ring | brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y |
| veil-type | partial=p,universal=u |
| veil-color | brown=n,orange=o,white=w,yellow=y |
| ring-number | none=n,one=o,two=t |
| ring-type | cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z |
| spore-print-color | black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y |
| population | abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y |
| habitat | grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8124 entries, 0 to 8123 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 class 8124 non-null object 1 cap-shape 8124 non-null object 2 cap-surface 8124 non-null object 3 cap-color 8124 non-null object 4 bruises 8124 non-null object 5 odor 8124 non-null object 6 gill-attachment 8124 non-null object 7 gill-spacing 8124 non-null object 8 gill-size 8124 non-null object 9 gill-color 8124 non-null object 10 stalk-shape 8124 non-null object 11 stalk-root 8124 non-null object 12 stalk-surface-above-ring 8124 non-null object 13 stalk-surface-below-ring 8124 non-null object 14 stalk-color-above-ring 8124 non-null object 15 stalk-color-below-ring 8124 non-null object 16 veil-type 8124 non-null object 17 veil-color 8124 non-null object 18 ring-number 8124 non-null object 19 ring-type 8124 non-null object 20 spore-print-color 8124 non-null object 21 population 8124 non-null object 22 habitat 8124 non-null object dtypes: object(23) memory usage: 1.4+ MB
Visualize some features of the dataset¶
column_index = [3, 4, 5, 9, 14, 15, 17, 19, 20, 21, 22]
number_features = len(column_index)
grid_rows = int(np.ceil(number_features / 3))
fig, axs = plt.subplots(grid_rows, 3, figsize=(15, 5 * grid_rows))
for ax, feature in zip(axs.flatten(), df.columns[column_index]):
sns.countplot(data=df, x=feature, hue="class", ax=ax)
ax.set_xlabel("")
ax.set_title(feature)
for ax in axs.flatten()[number_features:]:
ax.axis("off")
plt.tight_layout()
plt.show()
Visualize the class distribution¶
target_feature = df.columns[0]
labels, sizes = np.unique(df[target_feature], return_counts=True)
fig, ax = plt.subplots()
ax.pie(sizes, textprops={'color': "w", 'fontsize': '12'}, autopct=lambda pct: "{:.2f}%\n({:d})".format(pct, round(pct/100 * sum(sizes))))
ax.legend(["poisonous" if i == "p" else "edible" for i in labels])
ax.set_title(target_feature)
plt.show()
Preprocess the dataset¶
preprocessed_df = pd.get_dummies(df, drop_first=True)
preprocessed_df.head()
| class_p | cap-shape_c | cap-shape_f | cap-shape_k | cap-shape_s | cap-shape_x | cap-surface_g | cap-surface_s | cap-surface_y | cap-color_c | ... | population_n | population_s | population_v | population_y | habitat_g | habitat_l | habitat_m | habitat_p | habitat_u | habitat_w | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | True | False | False | False | False | True | False | True | False | False | ... | False | True | False | False | False | False | False | False | True | False |
| 1 | False | False | False | False | False | True | False | True | False | False | ... | True | False | False | False | True | False | False | False | False | False |
| 2 | False | False | False | False | False | False | False | True | False | False | ... | True | False | False | False | False | False | True | False | False | False |
| 3 | True | False | False | False | False | True | False | False | True | False | ... | False | True | False | False | False | False | False | False | True | False |
| 4 | False | False | False | False | False | True | False | True | False | False | ... | False | False | False | False | True | False | False | False | False | False |
5 rows × 96 columns
Split the dataset into train and test subsets¶
X = preprocessed_df.drop("class_p", axis=1)
y = preprocessed_df["class_p"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (6093, 95) X_test shape: (2031, 95)
Get the most important features¶
decision_tree_classifier = DecisionTreeClassifier(random_state=0).fit(X_train, y_train)
most_important_features = pd.DataFrame(data={"Feature": X_train.columns, "Gini_importance": decision_tree_classifier.feature_importances_})\
.sort_values(by="Gini_importance", ascending=False).reset_index(drop=True).query("Gini_importance >= 0.01")
most_important_features
| Feature | Gini_importance | |
|---|---|---|
| 0 | odor_n | 0.625144 |
| 1 | stalk-root_c | 0.169176 |
| 2 | stalk-surface-below-ring_y | 0.100325 |
| 3 | spore-print-color_r | 0.034375 |
| 4 | odor_l | 0.023504 |
| 5 | stalk-color-above-ring_w | 0.017094 |
| 6 | spore-print-color_u | 0.010353 |
Find the optimal regularization parameter value for a SVM model¶
A SVM model aims to find the best possible line, or hyperplane, that separates data points into different classes by maximizing the margin between the closest points of each category, known as support vectors. SVMs are particularly effective in high-dimensional spaces and can handle both linearly separable and non-linearly separable datasets through the use of kernel functions, which transform data into higher-dimensional space where it is easier to find a boundary.
C = np.logspace(-2, 2, 5)
parameters = {'C': C}
svc = SVC()
clf = GridSearchCV(svc, parameters, scoring='recall')
clf.fit(X_train[most_important_features["Feature"]], y_train)
GridSearchCV(estimator=SVC(),
param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
scoring='recall')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=SVC(),
param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
scoring='recall')SVC(C=np.float64(1.0))
SVC(C=np.float64(1.0))
plt.figure()
plt.plot(C, clf.cv_results_['mean_test_score'])
plt.scatter(clf.best_params_['C'], clf.best_score_, color='red')
plt.xlabel("C")
plt.ylabel("Mean recall score")
plt.xscale("log")
plt.show()
Evaluate the best SVM model¶
y_pred = clf.best_estimator_.predict(X_test[most_important_features["Feature"]])
print(classification_report(y_test, y_pred, digits=4))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=["False (edible)", "True (poisonous)"])
plt.grid(False)
plt.show()
precision recall f1-score support
False 1.0000 0.9906 0.9953 1061
True 0.9898 1.0000 0.9949 970
accuracy 0.9951 2031
macro avg 0.9949 0.9953 0.9951 2031
weighted avg 0.9951 0.9951 0.9951 2031
Run in Google Colab