Tumor Classifier¶
Objective: Train a Support Vector Machine (SVM) model to classify human cell samples as benign or malignant.
Support Vector Machine (SMV) is a type of supervised learning algorithm used for classification and regression tasks. It’s a powerful and widely used machine learning technique, particularly effective in handling high-dimensional spaces and non-linear relationships between features.
Import libraries¶
In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
Load the dataset¶
In [2]:
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv"
df = pd.read_csv(file_url)
df.head()
Out[2]:
| ID | Clump | UnifSize | UnifShape | MargAdh | SingEpiSize | BareNuc | BlandChrom | NormNucl | Mit | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 2 |
| 1 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
| 2 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
| 3 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
| 4 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
Understand the dataset¶
The dataset consists of several hundred records of human cell samples, each containing the values of a set of cellular features. The Class field contains the diagnosis, the samples are benign (value = 2) or malignant (value = 4).
| Field name | Description |
|---|---|
| ID | Patient identifier |
| Clump | Clump thickness |
| UnifSize | Uniformity of cell size |
| UnifShape | Uniformity of cell shape |
| MargAdh | Marginal adhesion |
| SingEpiSize | Single epithelial cell size |
| BareNuc | Bare nuclei |
| BlandChrom | Bland chromatin |
| NormNucl | Normal nucleoli |
| Mit | Mitoses |
| Class | Benign or malignant |
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 699 entries, 0 to 698 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 699 non-null int64 1 Clump 699 non-null int64 2 UnifSize 699 non-null int64 3 UnifShape 699 non-null int64 4 MargAdh 699 non-null int64 5 SingEpiSize 699 non-null int64 6 BareNuc 699 non-null object 7 BlandChrom 699 non-null int64 8 NormNucl 699 non-null int64 9 Mit 699 non-null int64 10 Class 699 non-null int64 dtypes: int64(10), object(1) memory usage: 60.2+ KB
Preprocess the dataset¶
Drop rows that contain NaN values and the ID field:
In [4]:
original_size = df.size
df = df[pd.to_numeric(df['BareNuc'], errors='coerce').notnull()].drop("ID", axis=1)
df['BareNuc'] = df['BareNuc'].astype('int')
print("The dataset was reduced by {:.2f}%".format((1 - df.size / original_size) * 100))
The dataset was reduced by 11.17%
Visualize the dataset¶
In [ ]:
number_features = len(df.columns) - 1
grid_rows = int(np.ceil(number_features / 3))
fig, axs = plt.subplots(grid_rows, 3, figsize=(15, 5 * grid_rows))
for ax, feature in zip(axs.flatten(), df.columns[:-1]):
if len(df[feature].unique()) <= 10:
sns.countplot(data=df, x=feature, hue="Class", ax=ax, palette="tab10")
ax.set_xlabel("")
ax.set_title(feature)
else:
sns.histplot(data=df, x=feature, hue="Class", ax=ax)
ax.set_xlabel("")
ax.set_title(feature)
for ax in axs.flatten()[number_features:]:
ax.axis("off")
plt.tight_layout()
plt.show()
Visualize the class distribution¶
In [6]:
target_feature = df.columns[-1]
class_names = ["benign", "malignant"]
labels, sizes = np.unique(df[target_feature], return_counts=True)
fig, ax = plt.subplots()
ax.pie(sizes, textprops={'color': "w", 'fontsize': '12'}, autopct=lambda pct: "{:.2f}%\n({:d})".format(pct, round(pct/100 * sum(sizes))))
ax.legend([str(j) + " (" + class_names[i] + ")" for i, j in enumerate(labels)])
ax.set_title(target_feature)
plt.show()
Split the dataset into train and test subsets¶
In [7]:
X = df.drop("Class", axis=1)
y = df["Class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (512, 9) X_test shape: (171, 9)
Train a SVM model¶
In [8]:
classifier = SVC()
classifier.fit(X_train, y_train)
Out[8]:
SVC()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| C | 1.0 | |
| kernel | 'rbf' | |
| degree | 3 | |
| gamma | 'scale' | |
| coef0 | 0.0 | |
| shrinking | True | |
| probability | False | |
| tol | 0.001 | |
| cache_size | 200 | |
| class_weight | None | |
| verbose | False | |
| max_iter | -1 | |
| decision_function_shape | 'ovr' | |
| break_ties | False | |
| random_state | None |
Evaluate the model¶
In [9]:
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=[str(j) + " (" + class_names[i] + ")" for i, j in enumerate(labels)])
plt.grid(False)
plt.show()
precision recall f1-score support
2 0.9714 0.9533 0.9623 107
4 0.9242 0.9531 0.9385 64
accuracy 0.9532 171
macro avg 0.9478 0.9532 0.9504 171
weighted avg 0.9538 0.9532 0.9534 171
Run in Google Colab