Tumor Classifier¶

No description has been provided for this imageRun in Google Colab

Objective: Train a Support Vector Machine (SVM) model to classify human cell samples as benign or malignant.

Support Vector Machine (SMV) is a type of supervised learning algorithm used for classification and regression tasks. It’s a powerful and widely used machine learning technique, particularly effective in handling high-dimensional spaces and non-linear relationships between features.

Import libraries¶

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

Load the dataset¶

In [2]:
file_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv"
df = pd.read_csv(file_url)
df.head()
Out[2]:
ID Clump UnifSize UnifShape MargAdh SingEpiSize BareNuc BlandChrom NormNucl Mit Class
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2

Understand the dataset¶

The dataset consists of several hundred records of human cell samples, each containing the values of a set of cellular features. The Class field contains the diagnosis, the samples are benign (value = 2) or malignant (value = 4).

Field name Description
ID Patient identifier
Clump Clump thickness
UnifSize Uniformity of cell size
UnifShape Uniformity of cell shape
MargAdh Marginal adhesion
SingEpiSize Single epithelial cell size
BareNuc Bare nuclei
BlandChrom Bland chromatin
NormNucl Normal nucleoli
Mit Mitoses
Class Benign or malignant
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           699 non-null    int64 
 1   Clump        699 non-null    int64 
 2   UnifSize     699 non-null    int64 
 3   UnifShape    699 non-null    int64 
 4   MargAdh      699 non-null    int64 
 5   SingEpiSize  699 non-null    int64 
 6   BareNuc      699 non-null    object
 7   BlandChrom   699 non-null    int64 
 8   NormNucl     699 non-null    int64 
 9   Mit          699 non-null    int64 
 10  Class        699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB

Preprocess the dataset¶

Drop rows that contain NaN values and the ID field:

In [4]:
original_size = df.size
df = df[pd.to_numeric(df['BareNuc'], errors='coerce').notnull()].drop("ID", axis=1)
df['BareNuc'] = df['BareNuc'].astype('int')
print("The dataset was reduced by {:.2f}%".format((1 - df.size / original_size) * 100))
The dataset was reduced by 11.17%

Visualize the dataset¶

In [ ]:
number_features = len(df.columns) - 1
grid_rows = int(np.ceil(number_features / 3))
fig, axs = plt.subplots(grid_rows, 3, figsize=(15, 5 * grid_rows))

for ax, feature in zip(axs.flatten(), df.columns[:-1]):
    if len(df[feature].unique()) <= 10:
        sns.countplot(data=df, x=feature, hue="Class", ax=ax, palette="tab10")
        ax.set_xlabel("")
        ax.set_title(feature)
    else:
        sns.histplot(data=df, x=feature, hue="Class", ax=ax)
        ax.set_xlabel("")
        ax.set_title(feature)

for ax in axs.flatten()[number_features:]:
    ax.axis("off")

plt.tight_layout() 
plt.show()
No description has been provided for this image

Visualize the class distribution¶

In [6]:
target_feature = df.columns[-1]
class_names = ["benign", "malignant"]
labels, sizes = np.unique(df[target_feature], return_counts=True)

fig, ax = plt.subplots()
ax.pie(sizes, textprops={'color': "w", 'fontsize': '12'}, autopct=lambda pct: "{:.2f}%\n({:d})".format(pct, round(pct/100 * sum(sizes))))
ax.legend([str(j) + " (" + class_names[i] + ")" for i, j in enumerate(labels)])
ax.set_title(target_feature)
plt.show()
No description has been provided for this image

Split the dataset into train and test subsets¶

In [7]:
X = df.drop("Class", axis=1)
y = df["Class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (512, 9)
X_test shape: (171, 9)

Train a SVM model¶

In [8]:
classifier = SVC()
classifier.fit(X_train, y_train) 
Out[8]:
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
C  1.0
kernel  'rbf'
degree  3
gamma  'scale'
coef0  0.0
shrinking  True
probability  False
tol  0.001
cache_size  200
class_weight  None
verbose  False
max_iter  -1
decision_function_shape  'ovr'
break_ties  False
random_state  None

Evaluate the model¶

In [9]:
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=[str(j) + " (" + class_names[i] + ")" for i, j in enumerate(labels)])
plt.grid(False)
plt.show()
              precision    recall  f1-score   support

           2     0.9714    0.9533    0.9623       107
           4     0.9242    0.9531    0.9385        64

    accuracy                         0.9532       171
   macro avg     0.9478    0.9532    0.9504       171
weighted avg     0.9538    0.9532    0.9534       171

No description has been provided for this image