Credit Card Fraud Detection¶
Objective: Train multiple classifiers and evaluate their effectiveness in predicting credit card fraud on an imbalanced dataset.
Import libraries¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, precision_recall_curve
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
Load the dataset¶
file_url = "https://raw.githubusercontent.com/LuisAngelMendozaVelasco/Applied_Data_Science_with_Python_Specialization/main/Applied_Machine_Learning_in_Python/Week3/Labs/data/fraud_data.csv.gz"
df = pd.read_csv(file_url)
df.head()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.176563 | 0.323798 | 0.536927 | 1.047002 | -0.368652 | -0.728586 | 0.084678 | -0.069246 | -0.266389 | 0.155315 | ... | -0.109627 | -0.341365 | 0.057845 | 0.499180 | 0.415211 | -0.581949 | 0.015472 | 0.018065 | 4.67 | 0 |
| 1 | 0.681109 | -3.934776 | -3.801827 | -1.147468 | -0.735540 | -0.501097 | 1.038865 | -0.626979 | -2.274423 | 1.527782 | ... | 0.652202 | 0.272684 | -0.982151 | 0.165900 | 0.360251 | 0.195321 | -0.256273 | 0.056501 | 912.00 | 0 |
| 2 | 1.140729 | 0.453484 | 0.247010 | 2.383132 | 0.343287 | 0.432804 | 0.093380 | 0.173310 | -0.808999 | 0.775436 | ... | -0.003802 | 0.058556 | -0.121177 | -0.304215 | 0.645893 | 0.122600 | -0.012115 | -0.005945 | 1.00 | 0 |
| 3 | -1.107073 | -3.298902 | -0.184092 | -1.795744 | 2.137564 | -1.684992 | -2.015606 | -0.007181 | -0.165760 | 0.869659 | ... | 0.130648 | 0.329445 | 0.927656 | -0.049560 | -1.892866 | -0.575431 | 0.266573 | 0.414184 | 62.10 | 0 |
| 4 | -0.314818 | 0.866839 | -0.124577 | -0.627638 | 2.651762 | 3.428128 | 0.194637 | 0.670674 | -0.442658 | 0.133499 | ... | -0.312774 | -0.799494 | -0.064488 | 0.953062 | -0.429550 | 0.158225 | 0.076943 | -0.015051 | 2.67 | 0 |
5 rows × 30 columns
Understand the dataset¶
Each row in the dataframe corresponds to a credit card transaction. Features include confidential variables V1 through V28 as well as Amount which is the amount of the transaction. The target is stored in the Class column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21693 entries, 0 to 21692 Data columns (total 30 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 V1 21693 non-null float64 1 V2 21693 non-null float64 2 V3 21693 non-null float64 3 V4 21693 non-null float64 4 V5 21693 non-null float64 5 V6 21693 non-null float64 6 V7 21693 non-null float64 7 V8 21693 non-null float64 8 V9 21693 non-null float64 9 V10 21693 non-null float64 10 V11 21693 non-null float64 11 V12 21693 non-null float64 12 V13 21693 non-null float64 13 V14 21693 non-null float64 14 V15 21693 non-null float64 15 V16 21693 non-null float64 16 V17 21693 non-null float64 17 V18 21693 non-null float64 18 V19 21693 non-null float64 19 V20 21693 non-null float64 20 V21 21693 non-null float64 21 V22 21693 non-null float64 22 V23 21693 non-null float64 23 V24 21693 non-null float64 24 V25 21693 non-null float64 25 V26 21693 non-null float64 26 V27 21693 non-null float64 27 V28 21693 non-null float64 28 Amount 21693 non-null float64 29 Class 21693 non-null int64 dtypes: float64(29), int64(1) memory usage: 5.0 MB
Analyze the Amount feature¶
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
sns.histplot(data=df, x="Amount", hue="Class", bins="doane", ax=axs[0])
axs[0].set_yscale('log')
axs[0].set_title("Histogram")
sns.boxplot(data=df, x="Class", y="Amount", hue="Class", ax=axs[1], legend=False)
axs[1].set_yscale('log')
axs[1].set_title("Box plot")
plt.show()
Visualize the class distribution¶
target_feature = df.columns[-1]
labels, sizes = np.unique(df[target_feature], return_counts=True)
labels = ["1 (fraud)" if i else "0 (non-fraud)" for i in labels]
plt.figure()
g = sns.barplot(x=labels, y=sizes)
g.bar_label(g.containers[0], [str(round(100 * size / sum(sizes), 2)) + "%" + " (" + str(size) + ")" for size in sizes])
plt.yscale("log")
plt.title(target_feature)
plt.show()
Split the dataset into train and test subsets¶
X = df.drop(target_feature, axis=1)
y = df[target_feature]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
X_train shape: (16269, 29) X_test shape: (5424, 29)
Find the optimal regularization parameter value for a SVM model¶
C = np.logspace(0, 6, 3)
parameters = {'C': C}
svc = SVC()
clf_svc = GridSearchCV(svc, parameters, scoring='recall')
clf_svc.fit(X_train, y_train)
GridSearchCV(estimator=SVC(), param_grid={'C': array([1.e+00, 1.e+03, 1.e+06])},
scoring='recall')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=SVC(), param_grid={'C': array([1.e+00, 1.e+03, 1.e+06])},
scoring='recall')SVC(C=np.float64(1000000.0))
SVC(C=np.float64(1000000.0))
plt.figure()
plt.plot(C, clf_svc.cv_results_['mean_test_score'])
plt.scatter(clf_svc.best_params_['C'], clf_svc.best_score_, color='red')
plt.xlabel("C")
plt.ylabel("Mean recall score")
plt.xscale("log")
plt.xticks(C, C)
plt.show()
Evaluate the best SVM model¶
y_pred = clf_svc.best_estimator_.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=["0 (non-fraud)", "1 (fraud)"])
plt.grid(False)
plt.show()
precision recall f1-score support
0 0.9970 0.9991 0.9980 5344
1 0.9275 0.8000 0.8591 80
accuracy 0.9961 5424
macro avg 0.9623 0.8995 0.9285 5424
weighted avg 0.9960 0.9961 0.9960 5424
Find the optimal regularization parameter for a Logistic Regression classifier¶
A Logistic Regression classifier uses a logistic (or sigmoid) function to transform a linear combination of input features into a probability value ranging between 0 and 1. This probability indicates the likelihood that a given input corresponds to one of two predefined categories.
C = np.logspace(-3, 3, 3)
parameters = {'C': C}
logistic_regression_classifier = LogisticRegression(max_iter=1000)
clf_lr = GridSearchCV(logistic_regression_classifier, parameters, scoring='recall')
clf_lr.fit(X_train, y_train)
GridSearchCV(estimator=LogisticRegression(max_iter=1000),
param_grid={'C': array([1.e-03, 1.e+00, 1.e+03])},
scoring='recall')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=LogisticRegression(max_iter=1000),
param_grid={'C': array([1.e-03, 1.e+00, 1.e+03])},
scoring='recall')LogisticRegression(C=np.float64(1.0), max_iter=1000)
LogisticRegression(C=np.float64(1.0), max_iter=1000)
plt.figure()
plt.plot(C, clf_lr.cv_results_['mean_test_score'])
plt.scatter(clf_lr.best_params_['C'], clf_lr.best_score_, color='red')
plt.xlabel("C")
plt.ylabel("Mean recall score")
plt.xscale("log")
plt.xticks(C, C)
plt.show()
Evaluate the best Logistic Regression classifier¶
y_pred = clf_lr.best_estimator_.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=["0 (non-fraud)", "1 (fraud)"])
plt.grid(False)
plt.show()
precision recall f1-score support
0 0.9968 0.9996 0.9982 5344
1 0.9692 0.7875 0.8690 80
accuracy 0.9965 5424
macro avg 0.9830 0.8936 0.9336 5424
weighted avg 0.9964 0.9965 0.9963 5424
Generally, the Precision-Recall Curve is used when there is a moderate to large class imbalance.
y_proba = clf_lr.best_estimator_.predict_proba(X_test)[:, -1]
precision, recall, _ = precision_recall_curve(y_test, y_proba)
plt.figure()
plt.plot(precision, recall)
plt.title('Precision - Recall Curve')
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.show()
Run in Google Colab