Kyle Chung
2020-07-08 Last Updated
A little bit background first. Recently I attended a Google online workshop. The topic is on their XAI service, using a dataset of simulated payment for fraud detection exercise as a working demo.
The workshop script can be found here: https://bit.ly/xai-fraud-codelab
As the title of this notebook suggested, I'm more on to the imbalanced dataset problem rather than the model explainability. During the workshop I raised a question about how the model should be evaluated against an extremely imbalanced dataset like this one (and for all fraud-related applications). But unfortunately due to time constraint I couldn't get a fruitful discussion with the lecturer.
In this notebook I'm going to roughly reproduce the codelab demo but with a focus to point out how the model evaluation can MISLEAD the end users. I say roughly, because I'm going to do the exercise in a little bit different way, in that:
datatable
instead of pandas
, just for fun!lightgbm
instead of tensorflow
for the boosted tree modelThe readers are of course encouraged to finish the codelab first before diving into this notebook.
The dataset is from Kaggle and also hosted on Google Cloud Storage.
We will use gstuil
to download the data.
In addition, we use datatable
for manipulating data,
lightgbm
for machine learning,
and plotly
for a quick visualization.
We also import scikit-learn
but just for some metric generation since I'm lazy to code it on my own.
I'm a big fan of data.table
when I'm coding in the R language.
No doubt I'm so excited that they made it to Python.
But the development is still in its early beta and the document is not very comprehensive.
Anyway it is good to see something challenging the dominant pandas
in the Python data community.
import os
import random
import datatable as dt
from datatable import f, by
import lightgbm as lgb
from sklearn.metrics import (
confusion_matrix,
precision_recall_curve,
precision_recall_fscore_support
)
import plotly.express as px
random.seed(666)
infile = "fraud_data_kaggle.csv"
if not os.path.exists(infile):
os.system("gsutil cp gs://financial_fraud_detection/{infile} .".format(infile=infile))
# Read the data.
DT = dt.fread(infile)
# Remove unused columns.
for col in ["nameOrig", "nameDest", "isFlaggedFraud"]:
del DT[:, col]
DT.head(5)
The dataset is extremely imbalanced.
Among all the records the positive labels (fraud cases) only comprise 0.13%.
# The expression is much more compact in datatable than in pandas.
# But due to Python's own design it is still not comparable to what we can write in the R language,
# which is even more compact.
y_cnt = DT[:, dt.count(), by(f.isFraud)]
y_cnt[:, "pct"] = f.count / dt.sum(f.count)
y_cnt
Now we are departing from the original codelab approach.
In the original codelab we undersample the majority class and proceed to do a train-test split for our modeling purpose. Here We are going to separate the train and test BEFORE undersampling. In this way we are able to keep a testing set with the original label distribution intact.
Once we separate the testing set, we do the undersampling for the remaining training set. I set a sampling rate such that the resulting weight is similar to that of the original codelab, which is a 25:75 for negative to positive case.
# Do train-test split BEFORE undersampling.
test_rate = .1
test_id = random.sample(range(DT.nrows), int(DT.nrows * test_rate))
nontest_id = list(set(range(DT.nrows)) - set(test_id))
# On the training set, undersample the negative cases.
DT_train = DT[nontest_id, :][f.isFraud == 0, :]
DT_train = DT_train[random.sample(range(DT_train.nrows), int(DT_train.nrows * .004)), :]
DT_train.rbind(DT[nontest_id, :][f.isFraud == 1, :])
# Inspect the resulting class distribution.
DT_train[:, {"count": dt.count(), "pct": dt.count() / DT_train.nrows}, by(f.isFraud)]
Once the undersampling is done, we further separate a random set for validation purpose. Note that the label distribution will be the same for the final training set and the validation set. They are both subject to under-sampling of majority class.
# Further separate the undersampled training set into training and validation set.
valid_id = random.sample(range(DT_train.nrows), int(DT_train.nrows * .2))
train_id = list(set(range(DT_train.nrows)) - set(valid_id))
DT_valid = DT_train[valid_id, :]
DT_train = DT_train[train_id, :]
DT_test = DT[test_id, :]
y_train = DT_train[:, f.isFraud].to_numpy()[:, 0]
y_valid = DT_valid[:, f.isFraud].to_numpy()[:, 0]
y_test = DT_test[:, f.isFraud].to_numpy()[:, 0]
As a machine learning practitioner,
I always appreciate the beauty of using lightgbm
along with pandas
,
in that it can automatically fit categorical features (with the pandas.Categorical
type) without we doing any pre-processing.
But since we choose datatable
this time we need to tackle this on our own.
Hoepfully there is only one categorical feature in the dataset: the transaction type.
# Do one-hot encoding on feature "type" since `datatable` doesn't support categorical encoding.
# Okay now we miss pandas a bit.
types = set(DT_valid[:, f.type].to_list()[0])
def onehotencode(DT):
# Note that DT is changed in-place.
for v in types:
newv = "type_" + v
DT[:, newv] = (f.type == v)
del DT[:, f.type]
for data in [DT_train, DT_valid, DT_test]:
onehotencode(data)
We are not going to fine-tune the hyper-parameters since they are out of the scope. For readers interested in this kind of (very, very powerful) model I have another notebook with some detailed discussions:
Demystify Modern Gradient Boosting Trees
For now we are just going to setup some basic configuration.
# Setup for lightgbm modeling.
feature_names = DT_train[:, f[:].remove(f.isFraud)].names
lgb_train = lgb.Dataset(DT_train[:, f[:].remove(f.isFraud)], y_train,
feature_name=feature_names)
lgb_valid = lgb.Dataset(DT_valid[:, f[:].remove(f.isFraud)], y_valid,
feature_name=feature_names, reference=lgb_train)
lgb_test = lgb.Dataset(DT_test[:, f[:].remove(f.isFraud)], y_test,
feature_name=feature_names, reference=lgb_train)
params = {
"boosting_type": "gbdt",
"objective": "binary",
"metric": ["binary_logloss", "binary_error", "auc"],
"num_leaves": 31,
"max_depth": 4,
"learning_rate": .03,
"verbose": 0,
"first_metric_only": True # For early stopping.
}
# Be aware that the early stopping works only on the last dataset in `valid_sets`.
# So we need to make sure the order is correct. We use the validation set to determine early stopping, if any.
metrics = dict()
model = lgb.train(
params,
lgb_train,
num_boost_round=500,
early_stopping_rounds=20,
verbose_eval=-1,
valid_sets=[lgb_train, lgb_test, lgb_valid],
valid_names=["train", "test", "valid"],
evals_result=metrics
)
For imbalanced dataset the accuracy is always high since the majority class is easy to predict. Plus that we can cheat by predicting only the majority class all the time. Even if we don't cheat, less obviously, AUC will also be high since again the model is able to easily rank the majority class.
As one can see, in this example, even trained with under-sampling and evaluate on the testing set with the original label distribution, AUC is high, and, is not informative at all.
# Tidy the resulting metrics for the best iteration.
metrics_all = dt.Frame(metrics["train"])[model.best_iteration - 1,:]
metrics_valid = dt.Frame(metrics["valid"])[model.best_iteration - 1,:]
metrics_test = dt.Frame(metrics["test"])[model.best_iteration - 1,:]
metrics_all[:, "data"] = "train"
metrics_valid[:, "data"] = "valid"
metrics_test[:, "data"] = "test"
metrics_all.rbind(metrics_valid)
metrics_all.rbind(metrics_test)
metrics_all
In general, reporting AUC won't be informative at all. The challenge of imbalanced dataset is usually about precision-recall trade-off. But before that, let's print the confusion matrix.
p_train = model.predict(DT_train[:, f[:].remove(f.isFraud)])
p_valid = model.predict(DT_valid[:, f[:].remove(f.isFraud)])
p_test = model.predict(DT_test[:, f[:].remove(f.isFraud)])
confusion_matrix(y_train, p_train > .5)
confusion_matrix(y_valid, p_valid > .5)
In the original codelab, the model evaluation just ended here, which I doubt is a good example to showcase how we should handle imbalanced dataset.
The confusion matrix for both training and validation set looks quite okay. Indeed it looks amazing! But that is not the end of story. How about the confusion matrix for our final testing set, which is NOT under-sampled?
confusion_matrix(y_test, p_test > .5)
Unfortunately, we see lots of FALSE POSITIVEs here. But this is indeed EXPECTED. And it is also more relevant than reporting the result on the under-sampled validation set. Why? Because when we are going to deploy the model, we will NOT under-sample our real data at inference time. Every single case matters. So what's the point showing a model performance on an under-sampled dataset that is NEVER going to be representative of the real data coming in the future?
For imbalaced dataset, when we under-sample or change the class weight in loss function so as to force the model to focus more on the miniroty case, the nature consequence is a low precision. Because now our model is more encouraged to make a positive prediction.
Let's see the precision and recall for the validation set (subject to under-sampling):
precision_valid, recall_valid, f1_valid, sup_valid = precision_recall_fscore_support(y_valid, p_valid > .5)
print("Precision on valid:", precision_valid[1])
print("Recall on valid :", recall_valid[1])
And also the testing set (no under-sampling but representative of real data):
precision_test, recall_test, f1_test, sup_test = precision_recall_fscore_support(y_test, p_test > .5)
print("Precision on test:", precision_test[1])
print("Recall on test :", recall_test[1])
Or even more clear, to plot the trade-of between precision and recall along all the decision thresholds:
# Plot the precision-recall curve along the decision thresholds on testing set.
p2, r2, t2 = precision_recall_curve(y_test, p_test)
prc2 = dt.Frame({"precision": p2[1:], "recall": r2[1:], "threshold": t2})
prc2 = prc2.to_pandas().melt(id_vars=["threshold"], value_vars=["precision", "recall"],
var_name="metric", value_name="score")
px.line(prc2, x="threshold", y="score", color="metric",
title="Precision-Recall Curve on Testing Set")
This is the relavant information for the end user. When we report the result on a under-sampled dataset, we lure the user into thinking our model is so good, which is not true. In the end, when the user is monitoring the model performance in the future, it will be obvious that our original evaluation is a false statement.
>97% of BOTH precision and recall? I'm sorry but that's not gonna happen. :(
This is a technical note on handling imbalanced dataset in machine learning application.