Spam Detection¶
Objective: Train multiple classifiers and evaluate their effectiveness in predicting whether a message is spam or not.
Import libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.svm import SVC
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import LogisticRegression
import seaborn as sns
sns.set_style("whitegrid")
Load the dataset¶
file_url = 'https://raw.githubusercontent.com/LuisAngelMendozaVelasco/Applied_Data_Science_with_Python_Specialization/main/Applied_Text_Mining_in_Python/Week3/Labs/data/spam.csv'
df = pd.read_csv(file_url)
df['target'] = np.where(df['target'] == 'spam', 1, 0)
df.head()
| text | target | |
|---|---|---|
| 0 | Go until jurong point, crazy.. Available only ... | 0 |
| 1 | Ok lar... Joking wif u oni... | 0 |
| 2 | Free entry in 2 a wkly comp to win FA Cup fina... | 1 |
| 3 | U dun say so early hor... U c already then say... | 0 |
| 4 | Nah I don't think he goes to usf, he lives aro... | 0 |
Understand the dataset¶
The SMS Spam Collection dataset is a widely used benchmark for text classification, specifically designed for identifying spam SMS messages. It contains a collection of 5,574 SMS messages in English, labeled as either “spam” or “non-spam”.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5572 entries, 0 to 5571 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 text 5572 non-null object 1 target 5572 non-null int64 dtypes: int64(1), object(1) memory usage: 87.2+ KB
Visualize the class distribution¶
target_feature = df.columns[-1]
class_names = ["non-spam", "spam"]
labels, sizes = np.unique(df[target_feature], return_counts=True)
fig, ax = plt.subplots()
ax.pie(sizes, textprops={'color': "w", 'fontsize': '12'}, autopct=lambda pct: "{:.2f}%\n({:d})".format(pct, round(pct/100 * sum(sizes))))
ax.legend([str(i) + " (" + class_names[i] + ")" for i in labels])
ax.set_title(target_feature)
plt.show()
Preprocess the dataset¶
Convert the messages to a matrix of token counts¶
Count vectorization is a method used in Natural Language Processing (NLP) to convert text documents into numerical vectors based on the frequency of words or tokens. It involves tokenizing the text, which means breaking it down into individual words or tokens, and then counting the occurrences of each token in the document. This process results in a matrix where each row represents a document and each column represents a unique token, with the cell values indicating the frequency of each token in the corresponding document.
vectorizer = CountVectorizer()
vectorizer.fit(df['text'])
CountVectorizer()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CountVectorizer()
Tokens with the largest lengths:
tokens = [(token, len(token)) for token in vectorizer.vocabulary_.keys()]
print("There are {} unique tokens.".format(len(tokens)))
tokens = pd.DataFrame(sorted(tokens, key=lambda item: item[1], reverse=True), columns=["token", "length"])
tokens.head(10)
There are 8672 unique tokens.
| token | length | |
|---|---|---|
| 0 | hypotheticalhuagauahahuagahyuhagga | 34 |
| 1 | com1win150ppmx3age16subscription | 32 |
| 2 | minmoremobsemspobox45po139wa | 28 |
| 3 | 50pmmorefrommobile2bremoved | 27 |
| 4 | minmobsmorelkpobox177hp51fl | 27 |
| 5 | 150ppmpobox10183bhamb64xe | 25 |
| 6 | callcost150ppmmobilesvary | 25 |
| 7 | 150ppermesssubscription | 23 |
| 8 | tscs087147403231winawk | 22 |
| 9 | datebox1282essexcm61xn | 22 |
plt.figure()
sns.histplot(data=tokens, x="length", bins="doane")
plt.title("Distribution of token lengths")
plt.xlabel("Token length")
plt.yscale("log")
plt.show()
Get the length of the messages¶
df['length'] = df['text'].str.len()
nonspam_mean_length = np.mean(df[df['target'] == 0]['length'])
spam_mean_length = np.mean(df[df['target'] == 1]['length'])
print("Average length of spam messages: {:.2f}".format(spam_mean_length))
print("Average length of non-spam messages: {:.2f}".format(nonspam_mean_length))
Average length of spam messages: 138.87 Average length of non-spam messages: 71.02
plt.figure()
sns.histplot(data=df, x="length", hue="target", bins="doane")
plt.title("Distribution of message lengths")
plt.xlabel("Message length")
plt.yscale("log")
plt.show()
Get the number of digits in the messages¶
df['digits'] = df['text'].str.findall(r'\d').str.len()
nonspam_mean_digits = np.mean(df[df['target'] == 0]['digits'])
spam_mean_digits = np.mean(df[df['target'] == 1]['digits'])
print("Average number of digits in spam messages: {:.2f}".format(spam_mean_digits))
print("Average number of digits in not spam messages: {:.2f}".format(nonspam_mean_digits))
Average number of digits in spam messages: 15.76 Average number of digits in not spam messages: 0.30
plt.figure()
sns.histplot(data=df, x="digits", hue="target", bins="doane")
plt.title("Distribution of message digit counts")
plt.xlabel("Message digit count")
plt.yscale("log")
plt.show()
Get the number of non-word characters in the messages¶
df['non_word'] = df['text'].str.findall(r'\W').str.len()
nonspam_mean_non_word = np.mean(df[df['target'] == 0]['non_word'])
spam_mean_non_word = np.mean(df[df['target'] == 1]['non_word'])
print("Average number of non-word characters in spam messages: {:.2f}".format(spam_mean_non_word))
print("Average number of non-word characters in non-spam messages: {:.2f}".format(nonspam_mean_non_word))
Average number of non-word characters in spam messages: 29.04 Average number of non-word characters in non-spam messages: 17.29
plt.figure()
sns.histplot(data=df, x="non_word", hue="target", bins="doane")
plt.title("Distribution of message non-word counts")
plt.xlabel("Message non-word count")
plt.yscale("log")
plt.show()
df.head()
| text | target | length | digits | non_word | |
|---|---|---|---|---|---|
| 0 | Go until jurong point, crazy.. Available only ... | 0 | 111 | 0 | 28 |
| 1 | Ok lar... Joking wif u oni... | 0 | 29 | 0 | 11 |
| 2 | Free entry in 2 a wkly comp to win FA Cup fina... | 1 | 155 | 25 | 33 |
| 3 | U dun say so early hor... U c already then say... | 0 | 49 | 0 | 16 |
| 4 | Nah I don't think he goes to usf, he lives aro... | 0 | 61 | 0 | 14 |
def add_feature(X, feature_to_add):
"""
Returns sparse feature matrix with added feature.
feature_to_add can also be a list of features.
"""
return hstack([X, csr_matrix(feature_to_add).T], 'csr')
X_vectorized = vectorizer.transform(df['text'])
X_vectorized = add_feature(X_vectorized, [df['length'], df['digits'], df['non_word']])
Split the dataset into train and test subsets¶
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, df['target'], random_state=0)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (4179, 8675) X_test shape: (1393, 8675)
Train a Multinomial Naive Bayes classifier¶
A Multinomial Naive Bayes classifier is particularly effective in text classification and natural language processing applications. It assumes that the features are discrete counts or frequencies, such as word counts in documents, and it models the likelihood of these features using a multinomial distribution. This classifier is based on Bayes' theorem and assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. It is widely used for tasks like spam filtering, document classification, sentiment analysis, and customer segmentation.
classifier = MultinomialNB(alpha=1e-3)
classifier.fit(X_train, y_train)
MultinomialNB(alpha=0.001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB(alpha=0.001)
Evalute the Multinomial Naive Bayes classifier¶
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=[str(i) + " (" + class_names[i] + ")" for i in labels])
plt.grid(False)
plt.show()
precision recall f1-score support
0 0.9933 0.9975 0.9954 1196
1 0.9844 0.9594 0.9717 197
accuracy 0.9921 1393
macro avg 0.9889 0.9784 0.9836 1393
weighted avg 0.9921 0.9921 0.9921 1393
Train a SVM model¶
classifier = SVC(C=1e3)
classifier.fit(X_train, y_train)
SVC(C=1000.0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(C=1000.0)
Evaluate the SVM model¶
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=[str(i) + " (" + class_names[i] + ")" for i in labels])
plt.grid(False)
plt.show()
precision recall f1-score support
0 0.9917 0.9983 0.9950 1196
1 0.9894 0.9492 0.9689 197
accuracy 0.9914 1393
macro avg 0.9906 0.9738 0.9820 1393
weighted avg 0.9914 0.9914 0.9913 1393
Train a Logistic Regression classifier¶
classifier = LogisticRegression(C=100, max_iter=1000)
classifier.fit(X_train, y_train)
LogisticRegression(C=100, max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=100, max_iter=1000)
Evaluate the Logistic Regression classifier¶
y_pred = classifier.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=[str(i) + " (" + class_names[i] + ")" for i in labels])
plt.grid(False)
plt.show()
precision recall f1-score support
0 0.9925 0.9975 0.9950 1196
1 0.9843 0.9543 0.9691 197
accuracy 0.9914 1393
macro avg 0.9884 0.9759 0.9820 1393
weighted avg 0.9914 0.9914 0.9913 1393
Run in Google Colab