Spam Detection¶

No description has been provided for this imageRun in Google Colab

Objective: Train multiple classifiers and evaluate their effectiveness in predicting whether a message is spam or not.

Import libraries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.svm import SVC
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import LogisticRegression
import seaborn as sns
sns.set_style("whitegrid")

Load the dataset¶

In [2]:
file_url = 'https://raw.githubusercontent.com/LuisAngelMendozaVelasco/Applied_Data_Science_with_Python_Specialization/main/Applied_Text_Mining_in_Python/Week3/Labs/data/spam.csv'
df = pd.read_csv(file_url)
df['target'] = np.where(df['target'] == 'spam', 1, 0)
df.head()
Out[2]:
text target
0 Go until jurong point, crazy.. Available only ... 0
1 Ok lar... Joking wif u oni... 0
2 Free entry in 2 a wkly comp to win FA Cup fina... 1
3 U dun say so early hor... U c already then say... 0
4 Nah I don't think he goes to usf, he lives aro... 0

Understand the dataset¶

The SMS Spam Collection dataset is a widely used benchmark for text classification, specifically designed for identifying spam SMS messages. It contains a collection of 5,574 SMS messages in English, labeled as either “spam” or “non-spam”.

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5572 non-null   object
 1   target  5572 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 87.2+ KB

Visualize the class distribution¶

In [4]:
target_feature = df.columns[-1]
class_names = ["non-spam", "spam"]
labels, sizes = np.unique(df[target_feature], return_counts=True)

fig, ax = plt.subplots()
ax.pie(sizes, textprops={'color': "w", 'fontsize': '12'}, autopct=lambda pct: "{:.2f}%\n({:d})".format(pct, round(pct/100 * sum(sizes))))
ax.legend([str(i) + " (" + class_names[i] + ")" for i in labels])
ax.set_title(target_feature)
plt.show()
No description has been provided for this image

Preprocess the dataset¶

Convert the messages to a matrix of token counts¶

Count vectorization is a method used in Natural Language Processing (NLP) to convert text documents into numerical vectors based on the frequency of words or tokens. It involves tokenizing the text, which means breaking it down into individual words or tokens, and then counting the occurrences of each token in the document. This process results in a matrix where each row represents a document and each column represents a unique token, with the cell values indicating the frequency of each token in the corresponding document.

In [5]:
vectorizer = CountVectorizer()
vectorizer.fit(df['text'])
Out[5]:
CountVectorizer()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()

Tokens with the largest lengths:

In [6]:
tokens = [(token, len(token)) for token in vectorizer.vocabulary_.keys()]
print("There are {} unique tokens.".format(len(tokens)))
tokens = pd.DataFrame(sorted(tokens, key=lambda item: item[1], reverse=True), columns=["token", "length"])
tokens.head(10)
There are 8672 unique tokens.
Out[6]:
token length
0 hypotheticalhuagauahahuagahyuhagga 34
1 com1win150ppmx3age16subscription 32
2 minmoremobsemspobox45po139wa 28
3 50pmmorefrommobile2bremoved 27
4 minmobsmorelkpobox177hp51fl 27
5 150ppmpobox10183bhamb64xe 25
6 callcost150ppmmobilesvary 25
7 150ppermesssubscription 23
8 tscs087147403231winawk 22
9 datebox1282essexcm61xn 22
In [7]:
plt.figure()
sns.histplot(data=tokens, x="length", bins="doane")
plt.title("Distribution of token lengths")
plt.xlabel("Token length")
plt.yscale("log")
plt.show()
No description has been provided for this image

Get the length of the messages¶

In [8]:
df['length'] = df['text'].str.len()
nonspam_mean_length = np.mean(df[df['target'] == 0]['length'])
spam_mean_length = np.mean(df[df['target'] == 1]['length'])

print("Average length of spam messages: {:.2f}".format(spam_mean_length))
print("Average length of non-spam messages: {:.2f}".format(nonspam_mean_length))
Average length of spam messages: 138.87
Average length of non-spam messages: 71.02
In [9]:
plt.figure()
sns.histplot(data=df, x="length", hue="target", bins="doane")
plt.title("Distribution of message lengths")
plt.xlabel("Message length")
plt.yscale("log")
plt.show()
No description has been provided for this image

Get the number of digits in the messages¶

In [10]:
df['digits'] = df['text'].str.findall(r'\d').str.len()
nonspam_mean_digits = np.mean(df[df['target'] == 0]['digits'])
spam_mean_digits = np.mean(df[df['target'] == 1]['digits'])

print("Average number of digits in spam messages: {:.2f}".format(spam_mean_digits))
print("Average number of digits in not spam messages: {:.2f}".format(nonspam_mean_digits))
Average number of digits in spam messages: 15.76
Average number of digits in not spam messages: 0.30
In [11]:
plt.figure()
sns.histplot(data=df, x="digits", hue="target", bins="doane")
plt.title("Distribution of message digit counts")
plt.xlabel("Message digit count")
plt.yscale("log")
plt.show()
No description has been provided for this image

Get the number of non-word characters in the messages¶

In [12]:
df['non_word'] = df['text'].str.findall(r'\W').str.len()
nonspam_mean_non_word = np.mean(df[df['target'] == 0]['non_word'])
spam_mean_non_word = np.mean(df[df['target'] == 1]['non_word'])

print("Average number of non-word characters in spam messages: {:.2f}".format(spam_mean_non_word))
print("Average number of non-word characters in non-spam messages: {:.2f}".format(nonspam_mean_non_word))
Average number of non-word characters in spam messages: 29.04
Average number of non-word characters in non-spam messages: 17.29
In [13]:
plt.figure()
sns.histplot(data=df, x="non_word", hue="target", bins="doane")
plt.title("Distribution of message non-word counts")
plt.xlabel("Message non-word count")
plt.yscale("log")
plt.show()
No description has been provided for this image
In [14]:
df.head()
Out[14]:
text target length digits non_word
0 Go until jurong point, crazy.. Available only ... 0 111 0 28
1 Ok lar... Joking wif u oni... 0 29 0 11
2 Free entry in 2 a wkly comp to win FA Cup fina... 1 155 25 33
3 U dun say so early hor... U c already then say... 0 49 0 16
4 Nah I don't think he goes to usf, he lives aro... 0 61 0 14
In [15]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

X_vectorized = vectorizer.transform(df['text'])
X_vectorized = add_feature(X_vectorized, [df['length'], df['digits'], df['non_word']])

Split the dataset into train and test subsets¶

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, df['target'], random_state=0)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (4179, 8675)
X_test shape: (1393, 8675)

Train a Multinomial Naive Bayes classifier¶

A Multinomial Naive Bayes classifier is particularly effective in text classification and natural language processing applications. It assumes that the features are discrete counts or frequencies, such as word counts in documents, and it models the likelihood of these features using a multinomial distribution. This classifier is based on Bayes' theorem and assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. It is widely used for tasks like spam filtering, document classification, sentiment analysis, and customer segmentation.

In [17]:
classifier = MultinomialNB(alpha=1e-3)
classifier.fit(X_train, y_train)
Out[17]:
MultinomialNB(alpha=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB(alpha=0.001)

Evalute the Multinomial Naive Bayes classifier¶

In [18]:
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=[str(i) + " (" + class_names[i] + ")" for i in labels])
plt.grid(False)
plt.show()
              precision    recall  f1-score   support

           0     0.9933    0.9975    0.9954      1196
           1     0.9844    0.9594    0.9717       197

    accuracy                         0.9921      1393
   macro avg     0.9889    0.9784    0.9836      1393
weighted avg     0.9921    0.9921    0.9921      1393

No description has been provided for this image

Train a SVM model¶

In [19]:
classifier = SVC(C=1e3)
classifier.fit(X_train, y_train)
Out[19]:
SVC(C=1000.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=1000.0)

Evaluate the SVM model¶

In [20]:
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=[str(i) + " (" + class_names[i] + ")" for i in labels])
plt.grid(False)
plt.show()
              precision    recall  f1-score   support

           0     0.9917    0.9983    0.9950      1196
           1     0.9894    0.9492    0.9689       197

    accuracy                         0.9914      1393
   macro avg     0.9906    0.9738    0.9820      1393
weighted avg     0.9914    0.9914    0.9913      1393

No description has been provided for this image

Train a Logistic Regression classifier¶

In [21]:
classifier = LogisticRegression(C=100, max_iter=1000)
classifier.fit(X_train, y_train)
Out[21]:
LogisticRegression(C=100, max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=100, max_iter=1000)

Evaluate the Logistic Regression classifier¶

In [22]:
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=[str(i) + " (" + class_names[i] + ")" for i in labels])
plt.grid(False)
plt.show()
              precision    recall  f1-score   support

           0     0.9925    0.9975    0.9950      1196
           1     0.9843    0.9543    0.9691       197

    accuracy                         0.9914      1393
   macro avg     0.9884    0.9759    0.9820      1393
weighted avg     0.9914    0.9914    0.9913      1393

No description has been provided for this image