A Note on Model Evaluation for Imbalanced Data

Kyle Chung

2020-07-08 Last Updated


A little bit background first. Recently I attended a Google online workshop. The topic is on their XAI service, using a dataset of simulated payment for fraud detection exercise as a working demo.

The workshop script can be found here: https://bit.ly/xai-fraud-codelab

As the title of this notebook suggested, I'm more on to the imbalanced dataset problem rather than the model explainability. During the workshop I raised a question about how the model should be evaluated against an extremely imbalanced dataset like this one (and for all fraud-related applications). But unfortunately due to time constraint I couldn't get a fruitful discussion with the lecturer.

In this notebook I'm going to roughly reproduce the codelab demo but with a focus to point out how the model evaluation can MISLEAD the end users. I say roughly, because I'm going to do the exercise in a little bit different way, in that:

  1. I'm going to use datatable instead of pandas, just for fun!
  2. I'm going to use lightgbm instead of tensorflow for the boosted tree model
  3. I'm going to evaluate the model in a business-oriented manner
  4. I will skip the model explainability since it is out of the scope

The readers are of course encouraged to finish the codelab first before diving into this notebook.

Preparation

The dataset is from Kaggle and also hosted on Google Cloud Storage. We will use gstuil to download the data. In addition, we use datatable for manipulating data, lightgbm for machine learning, and plotly for a quick visualization. We also import scikit-learn but just for some metric generation since I'm lazy to code it on my own.

I'm a big fan of data.table when I'm coding in the R language. No doubt I'm so excited that they made it to Python. But the development is still in its early beta and the document is not very comprehensive. Anyway it is good to see something challenging the dominant pandas in the Python data community.

In [1]:
import os
import random

import datatable as dt
from datatable import f, by
import lightgbm as lgb
from sklearn.metrics import (
  confusion_matrix,
  precision_recall_curve,
  precision_recall_fscore_support
)
import plotly.express as px

random.seed(666)
In [2]:
infile = "fraud_data_kaggle.csv"
if not os.path.exists(infile):
  os.system("gsutil cp gs://financial_fraud_detection/{infile} .".format(infile=infile))
In [3]:
# Read the data.
DT = dt.fread(infile)

# Remove unused columns.
for col in ["nameOrig", "nameDest", "isFlaggedFraud"]:
  del DT[:, col]
In [4]:
DT.head(5)
Out[4]:
steptypeamountoldbalanceOrgnewbalanceOrigoldbalanceDestnewbalanceDestisFraud
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
01PAYMENT9839.64170136160296000
11PAYMENT1864.282124919384.7000
21TRANSFER1811810001
31CASH_OUT18118102118201
41PAYMENT11668.14155429885.9000

The dataset is extremely imbalanced.

Among all the records the positive labels (fraud cases) only comprise 0.13%.

In [5]:
# The expression is much more compact in datatable than in pandas.
# But due to Python's own design it is still not comparable to what we can write in the R language,
# which is even more compact.

y_cnt = DT[:, dt.count(), by(f.isFraud)]
y_cnt[:, "pct"] = f.count / dt.sum(f.count)
y_cnt
Out[5]:
isFraudcountpct
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0063544070.998709
1182130.00129082

Under-Sampling BEFORE Train-Test Split

Now we are departing from the original codelab approach.

In the original codelab we undersample the majority class and proceed to do a train-test split for our modeling purpose. Here We are going to separate the train and test BEFORE undersampling. In this way we are able to keep a testing set with the original label distribution intact.

Once we separate the testing set, we do the undersampling for the remaining training set. I set a sampling rate such that the resulting weight is similar to that of the original codelab, which is a 25:75 for negative to positive case.

In [6]:
# Do train-test split BEFORE undersampling.
test_rate = .1
test_id = random.sample(range(DT.nrows), int(DT.nrows * test_rate))
nontest_id = list(set(range(DT.nrows)) - set(test_id))

# On the training set, undersample the negative cases.
DT_train = DT[nontest_id, :][f.isFraud == 0, :]
DT_train = DT_train[random.sample(range(DT_train.nrows), int(DT_train.nrows * .004)), :]
DT_train.rbind(DT[nontest_id, :][f.isFraud == 1, :])

# Inspect the resulting class distribution.
DT_train[:, {"count": dt.count(), "pct": dt.count() / DT_train.nrows}, by(f.isFraud)]
Out[6]:
isFraudpctcount
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
000.75738322876
110.2426177328

Once the undersampling is done, we further separate a random set for validation purpose. Note that the label distribution will be the same for the final training set and the validation set. They are both subject to under-sampling of majority class.

In [7]:
# Further separate the undersampled training set into training and validation set.
valid_id = random.sample(range(DT_train.nrows), int(DT_train.nrows * .2))
train_id = list(set(range(DT_train.nrows)) - set(valid_id))

DT_valid = DT_train[valid_id, :]
DT_train = DT_train[train_id, :]
DT_test = DT[test_id, :]

y_train = DT_train[:, f.isFraud].to_numpy()[:, 0]
y_valid = DT_valid[:, f.isFraud].to_numpy()[:, 0]
y_test = DT_test[:, f.isFraud].to_numpy()[:, 0]

One-Hot Encoding

As a machine learning practitioner, I always appreciate the beauty of using lightgbm along with pandas, in that it can automatically fit categorical features (with the pandas.Categorical type) without we doing any pre-processing. But since we choose datatable this time we need to tackle this on our own.

Hoepfully there is only one categorical feature in the dataset: the transaction type.

In [8]:
# Do one-hot encoding on feature "type" since `datatable` doesn't support categorical encoding.
# Okay now we miss pandas a bit.
types = set(DT_valid[:, f.type].to_list()[0])


def onehotencode(DT):
  # Note that DT is changed in-place.
  for v in types:
    newv = "type_" +  v
    DT[:, newv] = (f.type == v)
  del DT[:, f.type]
  

for data in [DT_train, DT_valid, DT_test]:
  onehotencode(data)

Gradient Boosting Trees

We are not going to fine-tune the hyper-parameters since they are out of the scope. For readers interested in this kind of (very, very powerful) model I have another notebook with some detailed discussions:

Demystify Modern Gradient Boosting Trees

For now we are just going to setup some basic configuration.

In [9]:
# Setup for lightgbm modeling.
feature_names = DT_train[:, f[:].remove(f.isFraud)].names

lgb_train = lgb.Dataset(DT_train[:, f[:].remove(f.isFraud)], y_train,
                        feature_name=feature_names)
lgb_valid = lgb.Dataset(DT_valid[:, f[:].remove(f.isFraud)], y_valid,
                        feature_name=feature_names, reference=lgb_train)
lgb_test = lgb.Dataset(DT_test[:, f[:].remove(f.isFraud)], y_test,
                       feature_name=feature_names, reference=lgb_train)

params = {
  "boosting_type": "gbdt",
  "objective": "binary",
  "metric": ["binary_logloss", "binary_error", "auc"],
  "num_leaves": 31,
  "max_depth": 4,
  "learning_rate": .03,
  "verbose": 0,
  "first_metric_only": True  # For early stopping.
}
In [10]:
# Be aware that the early stopping works only on the last dataset in `valid_sets`.
# So we need to make sure the order is correct. We use the validation set to determine early stopping, if any.

metrics = dict()
model = lgb.train(
  params,
  lgb_train,
  num_boost_round=500,
  early_stopping_rounds=20,
  verbose_eval=-1,
  valid_sets=[lgb_train, lgb_test, lgb_valid],
  valid_names=["train", "test", "valid"],
  evals_result=metrics
)
Training until validation scores don't improve for 20 rounds
Did not meet early stopping. Best iteration is:
[500]	train's binary_logloss: 0.0120697	train's binary_error: 0.00285549	train's auc: 0.999942	test's binary_logloss: 0.0149658	test's binary_error: 0.00615155	test's auc: 0.999083	valid's binary_logloss: 0.021377	valid's binary_error: 0.00629139	valid's auc: 0.999301
Evaluated only: binary_logloss

Choose the Metric Wisely...

For imbalanced dataset the accuracy is always high since the majority class is easy to predict. Plus that we can cheat by predicting only the majority class all the time. Even if we don't cheat, less obviously, AUC will also be high since again the model is able to easily rank the majority class.

As one can see, in this example, even trained with under-sampling and evaluate on the testing set with the original label distribution, AUC is high, and, is not informative at all.

In [11]:
# Tidy the resulting metrics for the best iteration.

metrics_all = dt.Frame(metrics["train"])[model.best_iteration - 1,:]
metrics_valid = dt.Frame(metrics["valid"])[model.best_iteration - 1,:]
metrics_test = dt.Frame(metrics["test"])[model.best_iteration - 1,:]

metrics_all[:, "data"] = "train"
metrics_valid[:, "data"] = "valid"
metrics_test[:, "data"] = "test"

metrics_all.rbind(metrics_valid)
metrics_all.rbind(metrics_test)
metrics_all
Out[11]:
binary_loglossbinary_erroraucdata
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
00.01206970.002855490.999942train
10.0213770.006291390.999301valid
20.01496580.006151550.999083test

In general, reporting AUC won't be informative at all. The challenge of imbalanced dataset is usually about precision-recall trade-off. But before that, let's print the confusion matrix.

In [12]:
p_train = model.predict(DT_train[:, f[:].remove(f.isFraud)])
p_valid = model.predict(DT_valid[:, f[:].remove(f.isFraud)])
p_test = model.predict(DT_test[:, f[:].remove(f.isFraud)])
In [13]:
confusion_matrix(y_train, p_train > .5)
Out[13]:
array([[18233,    54],
       [   15,  5862]])
In [14]:
confusion_matrix(y_valid, p_valid > .5)
Out[14]:
array([[4559,   30],
       [   8, 1443]])

In the original codelab, the model evaluation just ended here, which I doubt is a good example to showcase how we should handle imbalanced dataset.

...AND ALSO the Evaluation Dataset

The confusion matrix for both training and validation set looks quite okay. Indeed it looks amazing! But that is not the end of story. How about the confusion matrix for our final testing set, which is NOT under-sampled?

In [15]:
confusion_matrix(y_test, p_test > .5)
Out[15]:
array([[631465,   3912],
       [     2,    883]])

Unfortunately, we see lots of FALSE POSITIVEs here. But this is indeed EXPECTED. And it is also more relevant than reporting the result on the under-sampled validation set. Why? Because when we are going to deploy the model, we will NOT under-sample our real data at inference time. Every single case matters. So what's the point showing a model performance on an under-sampled dataset that is NEVER going to be representative of the real data coming in the future?

For imbalaced dataset, when we under-sample or change the class weight in loss function so as to force the model to focus more on the miniroty case, the nature consequence is a low precision. Because now our model is more encouraged to make a positive prediction.

Let's see the precision and recall for the validation set (subject to under-sampling):

In [16]:
precision_valid, recall_valid, f1_valid, sup_valid = precision_recall_fscore_support(y_valid, p_valid > .5)
print("Precision on valid:", precision_valid[1])
print("Recall on valid   :", recall_valid[1])
Precision on valid: 0.9796334012219959
Recall on valid   : 0.994486560992419

And also the testing set (no under-sampling but representative of real data):

In [17]:
precision_test, recall_test, f1_test, sup_test = precision_recall_fscore_support(y_test, p_test > .5)
print("Precision on test:", precision_test[1])
print("Recall on test   :", recall_test[1])
Precision on test: 0.18415015641293014
Recall on test   : 0.9977401129943503

Or even more clear, to plot the trade-of between precision and recall along all the decision thresholds:

In [18]:
# Plot the precision-recall curve along the decision thresholds on testing set.
p2, r2, t2 = precision_recall_curve(y_test, p_test)
prc2 = dt.Frame({"precision": p2[1:], "recall": r2[1:], "threshold": t2})
prc2 = prc2.to_pandas().melt(id_vars=["threshold"], value_vars=["precision", "recall"],
                             var_name="metric", value_name="score")

px.line(prc2, x="threshold", y="score", color="metric",
        title="Precision-Recall Curve on Testing Set")

This is the relavant information for the end user. When we report the result on a under-sampled dataset, we lure the user into thinking our model is so good, which is not true. In the end, when the user is monitoring the model performance in the future, it will be obvious that our original evaluation is a false statement.

>97% of BOTH precision and recall? I'm sorry but that's not gonna happen. :(

Take-Aways

This is a technical note on handling imbalanced dataset in machine learning application.

  1. It is usually about precision-recall tradeoffs. So focus on reporting and improving them.
  2. The end user should have input on which one (precision or recall) is more important than the other, based on the use case.
  3. AUC, as a ranking metric, will be high by nature and is not informative to make a judgement of the model performance.
  4. We should report the modeling result on a representative dataset. A dataset that is under-sampled is NOT representative of the dataset coming in the future, unless under-sampling is part of the application.