Are these credit card transactions genuine or fraudulent?

Francisca Dias

Table of Contents

This report was done with the goal to show how to deal with highly imbalanced data.

For this purpose I will work on a dataset that contains transactions made by credit cards. There are 284,807 transactions in which 492 are fraud. This dataset is therefore highly imbalanced: the positive class (frauds) only account for 0.172% of all transactions.

Imbalanced data is common to classification problems: classes are not represented equally most of the time.

Here I will show that excellent accuracy scores (close to 100%) do not reflect the true value of model: it only reflects the underlying class distribution.

As we will see in a short time, the reason we get close to 100% accuracy on an imbalanced data is because our models look at the data and cleverly decide that the best thing to do is to always predict “Not Fraud” and achieve high accuracy.

To overcome this problem I will look at other performance metrics beyond "accuracy":

  • Confusion Matrix
  • Classification Report (precision, recall and f1 score) and
  • Precision-Recall

Decision trees often perform well on imbalanced datasets so I will use two classifiers: CART and Random Forest.

The datasets contains transactions made by credit cards in September 2013 by european cardholders. These transactions occurred in two days time.

Here we have 492 frauds out of 284,807 transactions: the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation (PCA transformation is a data reduction technique that transforms the dataset into a compressed form).

Due to confidentiality issues the original features are not provided.

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'.

Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset.

The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.

Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

This dataset can be found here.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('creditcard.csv')
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 31 columns

We have 284807 instances to work with and can confirm the data has 31 attributes including the class attribute.

In [2]:
(284807, 31)

We can see that all of the attributes are numeric (float).

In [3]:
Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

We can see that the classes are unbalanced between 0 and 1. fraud and not fraud.

In [4]:
0    284315
1       492
dtype: int64

The most commonly used measure of classifier performance is accuracy: the percent of correct classifications obtained.

Although accuracy is an easy metric to understand and compare different classifiers, it ignores factors that should be taken into account when honestly assessing the performance of a classifier.

In [5]:
from sklearn.model_selection import train_test_split

array = df.values
X = array[:,0:30]
y = array[:,30]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

model_cart = BaggingClassifier()
model_rf = RandomForestClassifier(), y_train), y_train)

predicted_cart = model_cart.predict(X_test)
predicted_rf = model_rf.predict(X_test)

print('CART score is:',model_cart.score(X_test, y_test))
print('Random Forest score is:',model_rf.score(X_test, y_test))
CART score is: 0.999525999789
Random Forest score is: 0.999578666479

As we can see above our classifiers predicted 0.999 in accuracy. But this is just the underlying class distribution. We need to see beyond this metric.

For this problem, imagine that we predicted that a transaction was fraudulent and it turns out that wasn't - False Positve. This is a mistake, but not a serious one.

BUT, what if we predicted that a transaction was not fraudulent and it turn out to be fraud? - False Negative. This is a big mistake with great 'financial 'consequences.

We need a method which will take into account all of these numbers.

In [7]:
confusion_cart = pd.crosstab(y_test, predicted_cart, rownames=['Actual'], colnames=['Predicted'], margins=True)
Predicted 0.0 1.0 All
0.0 56856 6 56862
1.0 21 79 100
All 56877 85 56962
In [8]:
confusion_rf = pd.crosstab(y_test, predicted_rf, rownames=['Actual'], colnames=['Predicted'], margins=True)
Predicted 0.0 1.0 All
0.0 56858 4 56862
1.0 20 80 100
All 56878 84 56962

We are trying to minimize the False Negatives. :

Random Forest (FN = 20) seems to perform better than CART (FN = 21).

In [9]:
from sklearn.metrics import classification_report

# Cart
report_cart = classification_report(y_test, predicted_cart)

# Random Forest
report_rf = classification_report(y_test, predicted_rf)
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00     56862
        1.0       0.93      0.79      0.85       100

avg / total       1.00      1.00      1.00     56962

             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00     56862
        1.0       0.95      0.80      0.87       100

avg / total       1.00      1.00      1.00     56962

In [15]:
from sklearn.metrics import average_precision_score

average_precision_Cart = average_precision_score(y_test, predicted_cart)
average_precision_RF = average_precision_score(y_test, predicted_rf)

print('Average precision-recall score for CART: {0:0.2f}'.format(average_precision_Cart))
print('Average precision-recall score for Random Forest: {0:0.2f}'.format(average_precision_RF))
Average precision-recall score for CART: 0.73
Average precision-recall score for Random Forest: 0.76

The official documentation can be found here.

Precision-Recall is a useful measure of success of prediction when the classes are very imbalanced. In information retrieval:

precision is a measure of result relevancy

while recall is a measure of how many truly relevant results are returned.

The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels. An ideal system with high precision and high recall will return many results, with all results labeled correctly.

Precision (P) is defined as the number of true positives (T_p) over the number of true positives plus the number of false positives (F_p).

In [ ]: