Table of contents
1.
Introduction
2.
Description
2.1.
Challenges Involved
2.2.
Dataset Used
3.
Building The Project
3.1.
Step 1: Importing Libraries
3.2.
Step 2: Loading Data
3.2.1.
Showing Data
3.3.
Step 3: Classification of Data
3.4.
Step 4: Details of Amount of Fraud Data
3.5.
Step 5: Correlation Matrix
3.6.
Step 6: Separation of X and Y Values
3.7.
Step 7: Building Random Forest Model
3.8.
Step 8: Evaluation Parameters
3.9.
Step 9: Confusion Matrix
4.
Frequently Asked Questions
4.1.
What do you mean by Credit Card Fraud Detection Project in Data Mining?
4.2.
What do you mean by Random Forests?
4.3.
What is a Correlation Matrix?
4.4.
What is a Confusion Matrix?
5.
Conclusion
Last Updated: Mar 27, 2024
Medium

Credit Card Fraud Detection Project in Data Mining

Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Data Mining is a field where we extract and discover patterns in large amounts of data. It helps find patterns at the midway point of machine learning, database systems, and statistical analysis. And, in today’s world, many scams and frauds are happening. Many people misuse this technology and hinder other people’s lives. A classic example of it is credit card fraud.

Credit Card Fraud Detection Project in Data Mining

This article will study the Credit Card Fraud Detection Project in Data Mining.

Description

We recognize fraudulent credit card transactions in the Credit Card Fraud Detection Project in Data Mining. It is helpful because it helps the companies not charge the customers for the transactions they haven't made. It prevents financial losses and helps customers maintain trust in their credit card companies by accurately detecting fraud. It helps to create a secure financial transactional system.

The Prerequisites for building this project are PythonStatistics, Data Mining, Machine LearningData Cleaning, and Data Visualization.

Challenges Involved

Before moving to the project, let us look at some of the challenges involved in the Credit Card Fraud Detection Project in Data Mining.

  • There is an enormous amount of data. So, the model we aim to build must be fast enough to process it.
     
  • Imbalanced data should be taken care of. There are very few fraud cases in total.
     
  • Data availability is a concern since it is private information.
     
  • Scammers are clever, constantly adapting to changes and updating their fraud techniques. The model should be well-updated for them.

Dataset Used

We use this dataset for the following project taken from Kaggle. The features of this dataset are as follows:

  • V1-V28: Attributes of the credit card transaction.
     
  • Amount: The amount involved in the credit card transaction.
     
  • Class: This represents whether the transaction is fraud(1) or genuine(0).

Building The Project

Let us start building the Credit Card Fraud Detection Project in Data Mining with the following steps. You can build it in your Jupyter Notebook or use Google Collaboratory.

Step 1: Importing Libraries

First, let us import all the necessary libraries as follows.

import pandas as pd
import numpy as np
import seaborn as snsbn
import matplotlib.pyplot as plot
from matplotlib import gridspec
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix

 

Step 2: Loading Data

The next step is to load the data into our project. If you are using Google Collaboratory, like us, for the project, you can upload the file to the session storage for using it. Next, we write the following code to import the file.

total_data = pd.read_csv("Book1.csv")

 

Here, we use the Panda library to read our CSV file.

Showing Data

Next, let us show the data we are using using the following command:

total_data.head()

 

OUTPUT

dataset

Step 3: Classification of Data

Now, it is time to show the imbalanced data. Use the following code to display the number and percentage of fraud cases from the dataset:

fraud = total_data[total_data['Class'] == 1]
valid = total_data[total_data['Class'] == 0]
fraudpercentage = (len(fraud)/float(len(valid)+len(fraud)))*100
validpercentage = (len(valid)/float(len(valid)+len(fraud)))*100
print('Percentage of fraud transactions')
print(fraudpercentage)
print('Percentage of genuine transactions')
print(validpercentage)
print('Number of Fraud Cases: {}'.format(len(total_data[total_data['Class'] == 1])))
print('Number of Total Transactions: {}'.format(len(total_data)))

 

Here, we have counted the number of fraudulent transactions by counting the rows where the attribute class has a value of 1. In the same way, we calculated the number of valid transactions where the attribute class has a value of 0. Then, we calculated their respective percentages and displayed them.

OUTPUT

fraud and genuine transactions percentages

We can see that only 0.38% of cases are fraud cases.

Step 4: Details of Amount of Fraud Data

Let us find out the details of the amount of fraudulent transactions using the following commands:

print('Details of Fraud Transactions:')
fraud.Amount.describe()

 

OUTPUT

details of fraud transactions

Step 5: Correlation Matrix

Now, let us print the correlation matrix for it. It tells us how the features correlate with each other. It helps us determine the most relevant features for our model. The commands are as follows:

correlation_matrix = total_data.corr()
figure_ = plot.figure(figsize = (14, 10))
snsbn.heatmap(correlation_matrix, vmax = .9, square = True)
plot.show()

 

We used the matplotlib library for it. Then, we used the seaborn library for heat mapping the correlation matrix with the maximum heat value of .9.

OUTPUT

correlation matrix

Most features do not correlate with each other. But some of them do. For example, V2 correlates negatively with the Amount. We also see the correlation between Amount and V7 is positive. This way, it helps us in the determination of necessary attributes.

Step 6: Separation of X and Y Values

Now, we divide the X and Y values from the dataset. It means dividing the data into input parameters and output values accordingly. Use the following commands to achieve the same:

x = total_data.drop(['Class'], axis = 1)
y = total_data["Class"]
print(x.shape)
print(y.shape)
xdata = x.values
ydata = y.values

 

OUTPUT

separate x and y values

Step 7: Building Random Forest Model

Now, we will divide the dataset into two categories. One will be for the Training Model and the other for Testing our Model. We write the following code to achieve it:

xtrain, xtest, ytrain, ytest = train_test_split(
    xdata, ydata, test_size = 0.4, random_state = 40)

 

We use Scikit Learn to divide the dataset into categories.

Next, we will use Scikit Learn to build a random forest model. We do this by writing the following code:

r_f_c = RandomForestClassifier()
r_f_c.fit(xtrain, ytrain)
ypred = r_f_c.predict(xtest)

 

Here, we used the RandomForestClassifier() for training the data first with the xtrain and ytrain parameters.  Then, we use it to predict the ypred output values using the xpred input values.

Step 8: Evaluation Parameters

Now that we have built a Random Forest Model let us calculate the value of some evaluation parameters for our model using the following code:

n_outliers = len(fraud)
n_errors = (ypred != ytest).sum()
print("Using Random Forest classifier")
 
accuracy_ = accuracy_score(ytest, ypred)
print("The accuracy is {}".format(accuracy_))
 
precision_ = precision_score(ytest, ypred)
print("The precision is {}".format(precision_))
 
recall_ = recall_score(ytest, ypred)
print("The recall is {}".format(recall_))
 
f1_score_ = f1_score(ytest, ypred)
print("The F1-Score is {}".format(f1_score_))
 
MCC_ = matthews_corrcoef(ytest, ypred)
print("The Matthews correlation coefficient is{}".format(MCC_))

 

We calculate the accuracy by comparing the predicted and calculated values. Next, we calculate the precision using these too. Then, we calculated the recall value, which is the number of class members that are correctly identified divided by the total members. Then, the F1 score is calculated, denoting the model’s accuracy. Finally, we calculate the Matthews Correlation Coefficient(MMC), which gives a high score only if there are good results in all four confusion matrix categories.

OUTPUT

random forest classifier evaluations

Step 9: Confusion Matrix

Finally, let us make the Confusion Matrix for our model. The Confusion Matrix gives results about the accuracy of the model. It expresses how many predictions were correct and incorrect. Write the following code to display the Confusion Matrix for our model:

LABELS = ['GENUINE', 'FRAUD']
conf_matrix = confusion_matrix(ytest, ypred)
plot.figure(figsize =(15, 15))
snsbn.heatmap(conf_matrix, xticklabels = LABELS,
            yticklabels = LABELS, annot = True, fmt ="d");
plot.title("CONFUSION MATRIX")
plot.ylabel('TRUE class')
plot.xlabel('PREDICTED class')
plot.show()

 

We label the data as either GENUINE or FRAUD. Then, we create our confusion matrix with the predicted and calculated output values and create a heatmap.

OUTPUT

confusion matrix

We can see that our model is very accurate. You can try different models too.

Frequently Asked Questions

What do you mean by Credit Card Fraud Detection Project in Data Mining?

By it, we mean that given an enormous dataset of credit card transactions, we have to find the fraudulent transaction patterns using Data Mining. It helps in securing the Financial transactional systems. We achieve it through Machine Learning, Python, Data Visualization, Statistics, and other concepts.

What do you mean by Random Forests?

Random Forest is a supervised machine-learning algorithm. It is a classifier containing many decision trees on multiple subsets of the dataset. It further calculates the average to improve the accuracy of the model's prediction capability.

What is a Correlation Matrix?

Correlation Matrix is a matrix formed among different attributes of a dataset. It is used to determine which factors correlate with which factors. Attributes can correlate both negatively and positively with each other. It helps in determining the necessary attributes for a model.

What is a Confusion Matrix?

The Confusion Matrix gives results about the accuracy of the model. It expresses how many predictions were correct and incorrect. It has four categories: good-good, good-bad, bad-good, and bad-bad.

Conclusion

Credit Card frauds are on the rise today. Scammers are clever and constantly adapt to new technology to hinder the lives of innocent people. Thus, it is crucial to identify and rectify these fraud cases to secure the financial transactions. This article studied the Credit Card Fraud Detection Project in Data Mining. We started with the project description, prerequisites, and challenges. Then, we took a sample dataset and built our project following the ordered steps. And we also created a Random Forest model for it.

If you wish to learn more, do read the following article:

To learn more about DSA, competitive coding, and many more knowledgeable topics, please look into the guided paths on Codestudio. Also, you can enroll in our courses and check out the mock tests and problems available. Please check out our interview experiences and interview bundle for placement preparations. 

Happy Coding!

Live masterclass