Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
Data Mining is a field where we extract and discover patterns in large amounts of data. It helps find patterns at the midway point of machine learning, database systems, and statistical analysis. And, in today’s world, many scams and frauds are happening. Many people misuse this technology and hinder other people’s lives. A classic example of it is credit card fraud.
This article will study the Credit Card Fraud Detection Project in Data Mining.
Description
We recognize fraudulent credit card transactions in the Credit Card Fraud Detection Project in Data Mining. It is helpful because it helps the companies not charge the customers for the transactions they haven't made. It prevents financial losses and helps customers maintain trust in their credit card companies by accurately detecting fraud. It helps to create a secure financial transactional system.
Before moving to the project, let us look at some of the challenges involved in the Credit Card Fraud Detection Project in Data Mining.
There is an enormous amount of data. So, the model we aim to build must be fast enough to process it.
Imbalanced data should be taken care of. There are very few fraud cases in total.
Data availability is a concern since it is private information.
Scammers are clever, constantly adapting to changes and updating their fraud techniques. The model should be well-updated for them.
Dataset Used
We use this dataset for the following project taken from Kaggle. The features of this dataset are as follows:
V1-V28: Attributes of the credit card transaction.
Amount: The amount involved in the credit card transaction.
Class: This represents whether the transaction is fraud(1) or genuine(0).
Building The Project
Let us start building the Credit Card Fraud Detection Project in Data Mining with the following steps. You can build it in your Jupyter Notebook or use Google Collaboratory.
Step 1: Importing Libraries
First, let us import all the necessary libraries as follows.
import pandas as pd
import numpy as np
import seaborn as snsbn
import matplotlib.pyplot as plot
from matplotlib import gridspec
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, matthews_corrcoef
from sklearn.metrics import confusion_matrix
Step 2: Loading Data
The next step is to load the data into our project. If you are using Google Collaboratory, like us, for the project, you can upload the file to the session storage for using it. Next, we write the following code to import the file.
total_data = pd.read_csv("Book1.csv")
Here, we use the Panda library to read our CSV file.
Showing Data
Next, let us show the data we are using using the following command:
total_data.head()
OUTPUT
Step 3: Classification of Data
Now, it is time to show the imbalanced data. Use the following code to display the number and percentage of fraud cases from the dataset:
fraud = total_data[total_data['Class'] == 1]
valid = total_data[total_data['Class'] == 0]
fraudpercentage = (len(fraud)/float(len(valid)+len(fraud)))*100
validpercentage = (len(valid)/float(len(valid)+len(fraud)))*100
print('Percentage of fraud transactions')
print(fraudpercentage)
print('Percentage of genuine transactions')
print(validpercentage)
print('Number of Fraud Cases: {}'.format(len(total_data[total_data['Class'] == 1])))
print('Number of Total Transactions: {}'.format(len(total_data)))
Here, we have counted the number of fraudulent transactions by counting the rows where the attribute class has a value of 1. In the same way, we calculated the number of valid transactions where the attribute class has a value of 0. Then, we calculated their respective percentages and displayed them.
OUTPUT
We can see that only 0.38% of cases are fraud cases.
Step 4: Details of Amount of Fraud Data
Let us find out the details of the amount of fraudulent transactions using the following commands:
print('Details of Fraud Transactions:')
fraud.Amount.describe()
OUTPUT
Step 5: Correlation Matrix
Now, let us print the correlation matrix for it. It tells us how the features correlate with each other. It helps us determine the most relevant features for our model. The commands are as follows:
We used the matplotlib library for it. Then, we used the seaborn library for heat mapping the correlation matrix with the maximum heat value of .9.
OUTPUT
Most features do not correlate with each other. But some of them do. For example, V2 correlates negatively with the Amount. We also see the correlation between Amount and V7 is positive. This way, it helps us in the determination of necessary attributes.
Step 6: Separation of X and Y Values
Now, we divide the X and Y values from the dataset. It means dividing the data into input parameters and output values accordingly. Use the following commands to achieve the same:
x = total_data.drop(['Class'], axis = 1)
y = total_data["Class"]
print(x.shape)
print(y.shape)
xdata = x.values
ydata = y.values
OUTPUT
Step 7: Building Random Forest Model
Now, we will divide the dataset into two categories. One will be for the Training Model and the other for Testing our Model. We write the following code to achieve it:
Here, we used the RandomForestClassifier() for training the data first with the xtrain and ytrain parameters. Then, we use it to predict the ypred output values using the xpred input values.
Step 8: Evaluation Parameters
Now that we have built a Random Forest Model let us calculate the value of some evaluation parameters for our model using the following code:
We calculate the accuracy by comparing the predicted and calculated Y values. Next, we calculate the precision using these too. Then, we calculated the recall value, which is the number of class members that are correctly identified divided by the total members. Then, the F1 score is calculated, denoting the model’s accuracy. Finally, we calculate the Matthews Correlation Coefficient(MMC), which gives a high score only if there are good results in all four confusion matrix categories.
OUTPUT
Step 9: Confusion Matrix
Finally, let us make the Confusion Matrix for our model. The Confusion Matrix gives results about the accuracy of the model. It expresses how many predictions were correct and incorrect. Write the following code to display the Confusion Matrix for our model:
We label the data as either GENUINE or FRAUD. Then, we create our confusion matrix with the predicted and calculated output values and create a heatmap.
OUTPUT
We can see that our model is very accurate. You can try different models too.
Frequently Asked Questions
What do you mean by Credit Card Fraud Detection Project in Data Mining?
By it, we mean that given an enormous dataset of credit card transactions, we have to find the fraudulent transaction patterns using Data Mining. It helps in securing the Financial transactional systems. We achieve it through Machine Learning, Python, Data Visualization, Statistics, and other concepts.
What do you mean by Random Forests?
Random Forest is a supervised machine-learning algorithm. It is a classifier containing many decision trees on multiple subsets of the dataset. It further calculates the average to improve the accuracy of the model's prediction capability.
What is a Correlation Matrix?
Correlation Matrix is a matrix formed among different attributes of a dataset. It is used to determine which factors correlate with which factors. Attributes can correlate both negatively and positively with each other. It helps in determining the necessary attributes for a model.
What is a Confusion Matrix?
The Confusion Matrix gives results about the accuracy of the model. It expresses how many predictions were correct and incorrect. It has four categories: good-good, good-bad, bad-good, and bad-bad.
Conclusion
Credit Card frauds are on the rise today. Scammers are clever and constantly adapt to new technology to hinder the lives of innocent people. Thus, it is crucial to identify and rectify these fraud cases to secure the financial transactions. This article studied the Credit Card Fraud Detection Project in Data Mining. We started with the project description, prerequisites, and challenges. Then, we took a sample dataset and built our project following the ordered steps. And we also created a Random Forest model for it.
If you wish to learn more, do read the following article: