Abstract
Model explainability has gained more and more attention recently among machine learning practitioners. Especially with the popularization of deep learning frameworks, which further promotes the use of increasingly complicated models to improve accuracy. In the reality, however, model with the highest accuracy may not be the one that can be deployed. Trust is one important factor affecting the adoption of complicated models. In this notebook we give a brief introduction to several popular methods on model explainability. And we focus more on the hands-on which demonstrates how we can actually explain a model, under a variety of model families.
This notebook is written with reticulate
, a package that allows inter-operation between R and Python.
Why do we need to explain a machine learning model? The benefit of an explanable model against a black-box model is for the model to be trusted. Trust can be important in many real applications where the successful deployment of a machine learning model requires the trust from end users. Sometimes trust plays a even bigger role than model accuracy.
Other than trust, model explainability (or interpretability, interchangeably used hereafter) may also guide us in the correct direction to further improve the model.1
In general, linear model is more interpretable than non-linear model. But the former also suffers from lower accuracy. More advanced and hence complicated model usually has worse interpretability.
One should not confuse model explainability with the actual causality. Being able to explain a model doesn’t mean that we can identify any ground-truth causal relation behind the model. Model explainability is for and only for the model, but not for the facts we’d like to model. Nevertheless, understand how we can reason the model definitely will help us better model the actual pattern behind the scence.
In this notebook we will walk through 3 popular approaches of model prediction explanation, each of them comes with a dedicated Python package:
All coding examples will be based on the above 3 packages. What if we’d like to explore these approaches in R? For lime
we have iml
which contains more than just LIME approach. For shap
, there is a R port called shapper
. As for interpret
, it already comes with its own R API.
An explanation model \(g(x\prime)\) is an interpretable approximation of the original model \(f(x)\). Its sole purpose is to give extra explainability the original model fails to provide, due to its own complexity.
The general idea is to use a simplified input \(x\prime\) such that \(x = h_x(x\prime)\), where \(h_x(\cdot)\) is a mapping function for any given raw input \(x\). Then the interpretable approximation can be written as:
\[ g(x\prime) \approx f(h_x(x\prime)). \]
The additive feature attribution methods specify the explanation model of the following form:
\[ g(z\prime) = \phi_0 + \sum_{j = 1}^n \phi_i z_i\prime, \]
where \(n\) is total number of simplified features, \(z\prime \in \{0, 1\}\) simply an indicator. In many such methods, the simplified input is the indicator of feature presence. Apparently, the choice of an additive model is for (linear) intrepretability. The simplified features are an interpretable representation of the original model features.
As we will see additivity is the key to explainability. All the approaches we will discuss in this notebook follow this philosophy.
One very popular such above additive model is LIME (Ribeiro, Singh, and Guestrin (2016)). LIME stands for Local Interpretable Model-Agnostic Explanations. As its full name suggests, LIME can be applied to any machine learning model. LIME achieves prediction-level interpretability by approxmiating the original model with an explanation model locally around that prediction.
From their original paper:
By “explaining a prediction”, we mean presenting textual or visual artifacts that provide qualitative understanding of the relationship between the instance’s components (e.g. words in text, patches in an image) and the model’s prediction.
Feature space in the original model will be in general different from that of the explanation model. An explanation model will use interpretable representation of the original feature as their training input. Different data type will have their different interpretable representation.
LIME proposes an explanation model \(g(x\prime)\) with a domain \(\{0, 1\}\). That is, it acts on absence or presence of the interpretable features \(x\prime\). The choice is, obviously, for better interpretability.
Language Data
For text classification problems with human language as source input, the most straightforward interpretable representation will be a binary indicator vector of bag of words. So the explanation model will try to reason which word or token is driving the prediction in what direction. And this is true no matter the form of the original model feature. May it be a word count matrix, a term frequency-inverse document frequency (TF-IDF) matrix, or numerical embeddings.
Image Data
For image tasks, the interpretable representation is a joint set of contiguous superpixels that divide the original image into pieces. A superpixel is a group of pixels with similar characteristics. So in plain words, just like we segment sentence into tokens, we simply segment image into multiple small pieces and again, use a binary vector to indicate the absence or presence of each piece for latter perturbation purpose.
Numerical Data
There is no binarization for numerical features. Instead, a random perturbation (discussed in the next section) is directly done on the target example. Here is the docstrings taken from LimeTabularExplainer
implementation in lime
:
Explains predictions on tabular (i.e. matrix) data.
For numerical features, perturb them by sampling from a Normal(0,1) and
doing the inverse operation of mean-centering and scaling, according to the
means and stds in the training data. For categorical features, perturb by
sampling according to the training distribution, and making a binary
feature that is 1 when the value is the same as the instance being
explained.
In order to estimate the explanation model given a prediction for a target example, features transformed into a interpretable space are then perturbed to generate similar examples around that target example. This is referred to as sampling for local exploration.
Given an example \(x\prime\) to be explained, its non-zero interpretable features (remember the space has a domain of \(\{0, 1\}\)) are uniformly sampled to generate its local similar example \(z\prime\). So \(z\prime\) will always have a subset (or at most equal set) of non-zero features that \(x\prime\) has.
Take text data for illustration. If a particular example has the following tokens in the interpretable space:
A B C H I J
Then a possible local sample can be something like:
A C H J
(Imagine those alphabets are actual words present in the raw text of the example.)
By default in the paper 5,000 samples are generated for each single explanation. A hyperparameter \(K\) (default at 10) is used to cap how many non-zero interpretable features we’d like to estimate in the subsequent model learning phase, to not only keep the model solving tractable but also manageable for human interpretation.
Note that for tabular data the notion of non-zero above does not apply. Because a value of 0 can be valid. So instead of doing local sampling subject to non-zero values, all feature values are taken into consideration and we randomize them (by also randomly choosing which feature values to randomized).
Now each of the perturbed example will be first transformed back to their original feature space, then feed into the original model to get the predicted label. That is, from \(z\prime\) we need to get \(f(z)\) where \(f\) is the original model and \(z\) the original feature representation. These labels serve exactly as the labels to train the local explanation model, where all random perturbations \(z\) are weigthed by \(\pi_x(z)\), a proximity function \(\pi_x(z)\) can be defined to measure how close \(z\) is to \(x\), in the original space.
In the original paper the proximity function is set to be an exponential kernel:
\[ \pi_x(z) = \exp \bigg( \frac{-D(x, z)^2}{\sigma^2} \bigg), \]
where \(D(x, z)\) is cosine distance for text and L2 distance for image, \(\sigma\) is a hyperparameter default at 25 in lime
.
The learner is a simple linear model:
\[ g(x\prime) = W \cdot x\prime. \]
We learn the explanation model weights \(W\) by minimizing the sum of proximity-weighted squared losses for all perturbed local samples:
\[ Loss = \sum_{z, z\prime} \pi_x(z) \cdot \big(f(z) - g(z\prime)\big)^2. \]
The actual learning algorithm proposed by LIME in the original paper is a LASSO. But in the actual implementation of the lime
package, Ridge regression is used instead as the default learner. Despite this, the top \(K\) features for the learner are still chosen by a LASSO path.
Another discrepancy between the original paper and the actual implementation is the notion of proximity function \(\pi_x(z)\). In the actual implementation proximity is calculated in the interpretable space rather than in the original space. So essentially we should have denoted the function as \(\pi_{x\prime}(z\prime)\). Indeed, in the package source code the proximity function is defined as:2
\[ \pi_{x\prime}(z\prime) = \sqrt{ \exp \bigg( \frac{-(D(x\prime, z\prime) \times 100)^2}{\sigma^2} \bigg)}. \]
As one may realize now that the original model can be a total blackbox. We only use its predictions as labels to learn the explanation model. Be aware that here \(f(z)\) returns the predicted probability as label so the explanation model is a regressor not a classifier.
Linearity
Notice that the local explanation model is a linear model, the explanation hence is subject to linearity. If we have any evidence suggesting heavy non-linearity around a prediction, the output of such explanation model won’t be faithful. Of course this is more of a compromise for better intrepretability.
No Explanation for a NULL Effect for Text Classifiers
And for the local sampling, notice that we only subsample from the presence of features on the target example. So the explanation is on the presence or absence of anything actually present in the target example. A feature that is originally not present in the example can never be part of the explanation. This could potentially miss an important null effect of a feature.
Take the same example above:
A B C H I J
It could be the case that, instead of the presence of the 6 features, the missingness of feature Z
is the most important driving force for the machine learning model to make the prediction. However due to the local sampling scheme, the importance of (the null of) Z
can never be estimated by the explanation model.
Be aware that in lime
’s implementation this limitation only applies to text classifiers. The initial local sampling is subject to only non-missing features, i.e., present tokens. For tabular data features are all considered present. The way we mark a feature as absent (missing) is through replacing it randomly.
This is indeed a very important concept as how we represent missingness. In the latter discussion on Shapley value (and more), we will see that the ability to well represent a feature value to be missing is the key to attribute feature importance in a theoretically sound manner.
We use Large Movie Review Dataset to do a binary sentiment classification exercise. We will use machine learning libraries such as scikit-learn
and tensorflow
to quickly build a varieity of (rather complicated and hard to interpret) models and use lime
to experiment explanation modeling.
3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)]
import os
import json
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
import warnings
warnings.simplefilter(action="ignore", category=Warning)
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
print(tf.__version__)
2.0.0
/device:GPU:0
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import train_test_split
import joblib
print(sklearn.__version__)
0.22
# Create model dir to cache all models trained in the notebook.
model_dir = "models"
if not os.path.exists(model_dir):
os.makedirs(model_dir)
# Directory to cache dataset.
home = os.path.expanduser("~")
cache_dir = os.path.join(home, ".keras")
First, we prepare the movie review dataset.3
import tensorflow_datasets as tfds
# Load the data as tf.data.Dataset.
imdb = tfds.load(name="imdb_reviews", as_supervised=True,
data_dir=os.path.join(home, "tensorflow_datasets"))
The dataset is a perfectly balanced dataset with 50,000 examples, half for positive and half for negative sentiment.
# Extract all texts as list since we want to use libraries other than tensorflow as well.
# And since this is a small dataset, we don't care about memory usage.
# We skip the use of a dataset iterator.
imdb_reviews_train = []
imdb_reviews_test = []
imdb_y_train = []
imdb_y_test = []
for x, y in imdb["train"].batch(128):
imdb_reviews_train.extend(x.numpy())
imdb_y_train.extend(y.numpy())
for x, y in imdb["test"].batch(128):
imdb_reviews_test.extend(x.numpy())
imdb_y_test.extend(y.numpy())
# TF works on bytes, but some other packages may only work on decoded string.
imdb_reviews_train = [b.decode("utf8") for b in imdb_reviews_train]
imdb_reviews_test = [b.decode("utf8") for b in imdb_reviews_test]
imdb_y_train = np.array(imdb_y_train)
imdb_y_test = np.array(imdb_y_test)
# Take one review.
print(imdb_reviews_train[87])
Any movie that portrays the hard-working responsible husband as the person who has to change because of bored, cheating wife is an obvious result of 8 years of the Clinton era.<br /><br />It's little wonder that this movie was written by a woman.
0
We use the data prepared by tensorflow-datasets
here just to save some time. For those who want to process the data in its very original format (where one review is in one .txt
file), the files can be downloaded by this piece of code:
imdb_remote_path = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
imdb_fname = os.path.basename(imdb_remote_path)
imdb_local_path = os.path.join(cache_dir, "datasets", imdb_fname)
if not os.path.exists(imdb_local_path):
_ = tf.keras.utils.get_file(fname=imdb_fname, origin=imdb_remote_path,
extract=True, cache_dir=cache_dir)
Let’s build a random forest with TF-IDF as our feature space. We will use the popular scikit-learn
library for implementation.4
# We drop words that are too frequent or too rare in the training dataset.
imdb_vectorizer = TfidfVectorizer(lowercase=True, min_df=10, max_df=.9)
imdb_X_train = imdb_vectorizer.fit_transform(imdb_reviews_train)
imdb_X_test = imdb_vectorizer.transform(imdb_reviews_test)
print(len(imdb_vectorizer.vocabulary_)) # Without OOV token.
18518
imdb_rf_model_file = os.path.join(model_dir, "text_rf.joblib")
# Save/reload the model to save notebook rendering time.
if os.path.exists(imdb_rf_model_file):
imdb_rf = joblib.load(imdb_rf_model_file)
else:
imdb_rf = RandomForestClassifier(n_estimators=300, random_state=64, n_jobs=-2)
_ = imdb_rf.fit(imdb_X_train, imdb_y_train)
_ = joblib.dump(imdb_rf, imdb_rf_model_file)
imdb_rf_pred = imdb_rf.predict(imdb_X_test)
imdb_rf_yhat = imdb_rf.predict_proba(imdb_X_test)[:,1]
print(classification_report(imdb_y_test, imdb_rf_pred))
precision recall f1-score support
0 0.84 0.86 0.85 12500
1 0.86 0.84 0.85 12500
accuracy 0.85 25000
macro avg 0.85 0.85 0.85 25000
weighted avg 0.85 0.85 0.85 25000
0.9274221727999999
As a baseline without extensive tuning (we didn’t tune anything indeed!), random forest seems to perform fairly well on this dataset.
As part of the algorithm’s design we are able to derive a global view of feature importance. This is based on how much each feature can reduce the impurity during all tree splittings. For example, we can plot the top 20 features:
sorted_vocab = sorted(imdb_vectorizer.vocabulary_.items(), key=lambda kv: kv[1])
sorted_vocab = [w for w, i in sorted_vocab]
imdb_rf_feat_imp = pd.Series(imdb_rf.feature_importances_, index=sorted_vocab).sort_values()
ax = imdb_rf_feat_imp.tail(20).plot(kind="barh")
plt.show()
As one can see, common adjectives describing good or bad things generally have larger impact in the model, which is totally expected. But we also see influential words such as just
and minutes
which are quite neutral and contain no useful information on their own. They may be jointly important in the model since a tree model allows interaction between variables. But we won’t be able to go deeper beyond the unconditional view we derived as a global feature ranking.
Interpretation of the impurity-based ranking must be very careful. For example, related features will theoretically have similar impact but only one of it will gain higher score (and suppress the other) in the ranking. Which one stands out is totally random due to the way tree splitting is performed during training.
In general it is NOT recommended to use impurity or loss-based feature ranking to interpret a tree ensemble model. Such ranking information is still useful to understand different aspects of the model, and can be used to subset feature to counter over-fitting issue, if any. But it won’t help really explain the model at the prediction-level: Why is my model making such prediction? And this is exactly why we need a explanation model in the first place.
Now move on to model explanation with LIME. We pick up one true positive and one false positive case made by our random forest model to see how the explanation model will explain each case.
from lime.lime_text import LimeTextExplainer
# We need a pipeline since LimeTextExplainer.explain_instance expects raw text input.
imdb_rf_pipe = make_pipeline(imdb_vectorizer, imdb_rf)
imdb_rf_explainer = LimeTextExplainer(class_names=["Negative", "Positive"], random_state=64)
imdb_rf_tp_idx = np.where(np.logical_and(imdb_rf_pred == 1, imdb_y_test == 1))[0]
imdb_rf_fp_idx = np.where(np.logical_and(imdb_rf_pred == 1, imdb_y_test == 0))[0]
# We take one true positive and one false positive example to demo explanation.
imdb_rf_tp_exp = imdb_rf_explainer.explain_instance(
imdb_reviews_test[imdb_rf_tp_idx[0]], imdb_rf_pipe.predict_proba, num_features=6)
imdb_rf_fp_exp = imdb_rf_explainer.explain_instance(
imdb_reviews_test[imdb_rf_fp_idx[0]], imdb_rf_pipe.predict_proba, num_features=6)
# For ipynb, one can simply call imdb_tp_exp.show_in_notebook(text=True) to embed the html output.
imdb_rf_tp_exp.save_to_file("/tmp/explain_text_rf_tp.html")
imdb_rf_fp_exp.save_to_file("/tmp/explain_text_rf_fp.html")
Our RF model doesn’t seem to be very confident on this particular positive example indeed. There is no dominant single word can drive the prediction in the correct direction. The contributing words are also mostly neutral on their own. We can confirm that the result of this prediction will be very sensitive and not robust. Admittedly this review does show some mixtures of positive and negative views.
Now let’s look at a false positive example, where our RF model wrongly labeled as a positive review.
In this example a single positive word great
(wrongly) dominate the prediction toward a positive sentiment. And we realize the model didn’t response well to some negative signals, especially for the word bore
.
If we examine more cases we may have more clues on how the model mis-behaves, and we can come up with a strategy accordingly to improve it. For now we’ll stop here and try experimenting with other learning algorithms hereafer.
Let’s use the text data example above to build a LIME model from scratch to better understand every detail of the technique.
from scipy.sparse import csr_matrix
from sklearn.linear_model import Ridge, lars_path
from sklearn.metrics.pairwise import cosine_distances
N = 1000 # Number of local samples.
K = 10 # Number of features to used.
i = imdb_rf_tp_idx[0] # The target example.
x = imdb_X_test[i]
local_samples_x = csr_matrix(np.ones([N, 1])) * x # Container for perturbation.
present_tok_id = x.nonzero()[1] # Present features.
# Generate random design matrix for the explanation model.
# This is under the interpretable (binary) space.
local_samples_z = (np.random.uniform(size=(N, len(present_tok_id))) > .5).astype(int)
# Predict local samples by the original model.
remove_ind = pd.DataFrame(zip(*np.where(local_samples_z == 0)), columns=["rid", "pos"])
for r in range(1, N):
# We keep the first sample as the original target example.
df = remove_ind[remove_ind.rid == r]
if not df.empty:
tok_ids_to_remove = present_tok_id[df.pos]
local_samples_x[r,tok_ids_to_remove] = 0
local_samples_x.eliminate_zeros()
local_samples_y = imdb_rf.predict_proba(local_samples_x)[:,1]
# Calculate proximity weights under the interpretable space.
def pi_x(z):
kernel_width = 25
dist = cosine_distances(z[0].reshape(1, -1), z).ravel()
return np.sqrt(np.exp(-((dist * 100) ** 2) / kernel_width ** 2))
weights = pi_x(local_samples_z)
# Subset top K features with LAR path.
weighted_z = ((local_samples_z - np.average(local_samples_z, axis=0, weights=weights))
* np.sqrt(weights[:, np.newaxis]))
weighted_y = ((local_samples_y - np.average(local_samples_y, weights=weights))
* np.sqrt(weights))
_, _, coefs = lars_path(weighted_z, weighted_y, method="lasso")
nonzero_coefs = range(weighted_z.shape[1])
for i in range(len(coefs.T) - 1, 0, -1):
nonzero_coefs = coefs.T[i].nonzero()[0]
if len(nonzero_coefs) <= 10:
break
# Learn the explanation model.
explainer = Ridge(alpha=1, fit_intercept=True, random_state=64)
_ = explainer.fit(local_samples_z[:,nonzero_coefs], local_samples_y, sample_weight=weights)
# Fitness (R^2).
# This can be a score to judge how good the local approximation is.
print(explainer.score(local_samples_z[:,nonzero_coefs], local_samples_y, sample_weight=weights))
0.8103558697830913
exp = pd.DataFrame({
"tok": np.array(sorted_vocab)[present_tok_id[nonzero_coefs]],
"imp": explainer.coef_
})
print(exp.sort_values("imp", ascending=False))
tok imp
4 love 0.044999
0 also 0.033278
9 very 0.025392
7 silly -0.015901
1 been -0.021245
2 better -0.031299
3 even -0.038800
6 only -0.040141
8 thing -0.062872
5 nothing -0.078848
We try a smaller local sample size in our exercise, but we can already successfully calculate very closely the feature contribution scores as in lime
’s API.
Now let’s try a shallow neural network model with word embeddings trained from scratch. We use tensorflow.keras
API to quickly build and train a neural net. We average word embeddings as the document embeddings for each review, then feed-forward a ReLU layer before the sigmoid activation for cross-entropy optimization.
As an exercise, instead of re-using the vocabulary built by TfidfVectorizer
with scikit-learn
, we will re-tokenize the text data with keras.preprocessing
module. The inherent consistency under the Keras framework will also simplify our latter works on network layering.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Build vocabulary. We use similar size as in our previous TfidfVectorizer.
# Since we will use zero padding, 0 cannot be used as OOV index.
# Keras tokenizer by default reserves 0 already. OOV token, if used, will be indexed at 1.
# Note that len(tokenizer.index_word) will be all vocabulary instead of `num_words`.
vocab_size = 20001 # +1 for 0 index used for padding.
oov_token = "<unk>"
tokenizer = Tokenizer(lower=True, oov_token=oov_token, num_words=vocab_size)
tokenizer.fit_on_texts(imdb_reviews_train)
# Encode text with padding to ensure fixed-length input.
seq_train = tokenizer.texts_to_sequences(imdb_reviews_train)
seq_train_padded = pad_sequences(seq_train, padding="post")
maxlen = seq_train_padded.shape[1]
seq_test = tokenizer.texts_to_sequences(imdb_reviews_test)
seq_test_padded = pad_sequences(seq_test, padding="post", maxlen=maxlen)
assert tokenizer.index_word[1] == oov_token
assert seq_train_padded.max() == vocab_size - 1
# Wrap Keras Sequential model with scikit-learn API to have the predict_proba method.
# Easier to interact with lime API.
nn_model_file = os.path.join(model_dir, "text_clf_nn.h5")
def nn_model_fn():
embedding_size = 64
model = tf.keras.Sequential([
tf.keras.layers.Embedding(
vocab_size, embedding_size, input_length=maxlen,
mask_zero=True, name="word_embedding"),
tf.keras.layers.GlobalAveragePooling1D(name="doc_embedding"),
tf.keras.layers.Dense(embedding_size / 2, activation="relu", name="relu"),
tf.keras.layers.Dense(1, activation="sigmoid", name="sigmoid")
], name="nn_classifier")
model.compile(optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"])
return model
print(nn_model_fn().summary(line_length=90))
Model: "nn_classifier"
__________________________________________________________________________________________
Layer (type) Output Shape Param #
==========================================================================================
word_embedding (Embedding) (None, 2493, 64) 1280064
__________________________________________________________________________________________
doc_embedding (GlobalAveragePooling1D) (None, 64) 0
__________________________________________________________________________________________
relu (Dense) (None, 32) 2080
__________________________________________________________________________________________
sigmoid (Dense) (None, 1) 33
==========================================================================================
Total params: 1,282,177
Trainable params: 1,282,177
Non-trainable params: 0
__________________________________________________________________________________________
None
imdb_nn = tf.keras.wrappers.scikit_learn.KerasClassifier(nn_model_fn)
if not os.path.exists(nn_model_file):
metrics = imdb_nn.fit(
x=seq_train_padded, y=imdb_y_train,
batch_size=256, epochs=10,
validation_data=(seq_test_padded, imdb_y_test),
validation_steps=20,
callbacks=[
tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2),
tf.keras.callbacks.ModelCheckpoint(nn_model_file, monitor="val_loss", save_best_only=True)
],
verbose=2)
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
25000/25000 - 12s - loss: 0.6309 - accuracy: 0.7501 - val_loss: 0.1054 - val_accuracy: 0.8287
Epoch 2/10
25000/25000 - 11s - loss: 0.3714 - accuracy: 0.8757 - val_loss: 0.0684 - val_accuracy: 0.8641
Epoch 3/10
25000/25000 - 11s - loss: 0.2440 - accuracy: 0.9118 - val_loss: 0.0600 - val_accuracy: 0.8801
Epoch 4/10
25000/25000 - 11s - loss: 0.1886 - accuracy: 0.9346 - val_loss: 0.0574 - val_accuracy: 0.8871
Epoch 5/10
25000/25000 - 11s - loss: 0.1527 - accuracy: 0.9484 - val_loss: 0.0580 - val_accuracy: 0.8883
Epoch 6/10
25000/25000 - 11s - loss: 0.1233 - accuracy: 0.9612 - val_loss: 0.0604 - val_accuracy: 0.8887
imdb_nn.model = tf.keras.models.load_model(nn_model_file) # Restore the best model with wrapper.
imdb_nn.classes_ = np.array([0, 1])
imdb_nn_yhat = imdb_nn.predict_proba(seq_test_padded)[:,1]
imdb_nn_pred = (imdb_nn_yhat > .5).astype(int)
print(classification_report(imdb_y_test, imdb_nn_pred))
precision recall f1-score support
0 0.88 0.90 0.89 12500
1 0.89 0.88 0.89 12500
accuracy 0.89 25000
macro avg 0.89 0.89 0.89 25000
weighted avg 0.89 0.89 0.89 25000
0.9539228543999999
Based on the testing AUC score, our shallow neural network model did outperform a random forest. Let’s see how the explanation model tell us about the behavior of the neural network model.
def nn_predict_fn(text):
# This is for sklearn wrapper only.
seq = tokenizer.texts_to_sequences(text)
seq = pad_sequences(seq, padding="post", maxlen=maxlen)
return imdb_nn.predict_proba(seq)
imdb_nn_explainer = LimeTextExplainer(class_names=["Negative", "Positive"])
# Explain the same examples as in RF.
imdb_nn_tp_exp = imdb_nn_explainer.explain_instance(
imdb_reviews_test[imdb_rf_tp_idx[0]], nn_predict_fn, num_features=6)
imdb_nn_fp_exp = imdb_nn_explainer.explain_instance(
imdb_reviews_test[imdb_rf_fp_idx[0]], nn_predict_fn, num_features=6)
imdb_nn_tp_exp.save_to_file("/tmp/explain_text_nn_tp.html")
imdb_nn_fp_exp.save_to_file("/tmp/explain_text_nn_fp.html")
The above is the LIME explanation of the same positive example previously explained with a RF model. We realize that, though both models eventually give a positive prediction, the neural network model has a very different opinion on how the positive prediction is formulated. Instead of being confused and indecisive, the NN model is actually over-confident about this prediction! Some neutral words have disproportionate contribition to the positive, pointing out the potential direction to improve the model. For example, can a bigram tokenizer be better?
How about the second example (which is a negative review)? Our NN model also makes a mistake on this negative review, by predicting it as a positive one.
What’s different here is the reaction to the negative word bore
, which is not seen in RF.
Without a explanation model, it won’t be easy for us to compare two models at this level of details.
One step further, let’s use pre-trained word embeddings for the neural nets and build another explanation model. We will use GloVe (Pennington, Socher, and Manning (2014)). We use just the smaller GloVe model since our dataset is quite small.
# Download GloVe pre-trained embeddings.
# The file is about 800MB so may take some time.
glove6b_remote_path = "http://nlp.stanford.edu/data/glove.6B.zip"
glove6b_local_path = os.path.join(cache_dir, "datasets", "glove.6B.50d.txt")
glove6b_fname = os.path.basename(glove6b_remote_path)
if not os.path.exists(glove6b_local_path):
_ = tf.keras.utils.get_file(fname=glove6b_fname, origin=glove6b_remote_path,
extract=True, cache_dir=cache_dir)
glove_all = pd.read_csv(glove6b_local_path, sep=" ", header=None, index_col=0, quoting=3)
In building the GloVe embeddings we need to take special care about out-of-vocabulary token AND padding index since we will be using the Keras API.
# Map vocabulary to pre-trained embeddings.
matched_toks = []
for i, w in tokenizer.index_word.items():
if i < vocab_size:
if w in glove_all.index:
matched_toks.append(w)
else:
matched_toks.append(oov_token)
# Note that GloVe pre-trained embeddings does not include its own OOV token.
# We will use a global average embedding to represent OOV token.
print(len([t for t in matched_toks if t == oov_token])) # How many OOVs?
861
glove_all.loc[oov_token] = glove_all.values.mean(axis=0)
glove = glove_all.loc[matched_toks].values
# Append dummy 0-index vector to support padding.
glove = np.vstack([np.zeros((1, glove.shape[1])), glove])
print(glove.shape)
(20001, 50)
Now let’s build the neural network. Most of the code will be the same as before, only the Embedding
layer now we will use a constant matrix for initialization. We make the GloVe embeddings trainable so it will further adapt to our specific dataset.
tr_model_file = os.path.join(model_dir, "text_clf_tr.h5")
def tr_model_fn():
embedding_size = glove.shape[1]
model = tf.keras.Sequential([
tf.keras.layers.Embedding(
vocab_size, embedding_size, input_length=maxlen,
embeddings_initializer=tf.keras.initializers.Constant(glove),
trainable=True, mask_zero=True, name="glove_embedding"),
tf.keras.layers.GlobalAveragePooling1D(name="doc_embedding"),
tf.keras.layers.Dense(embedding_size / 2, activation="relu", name="relu"),
tf.keras.layers.Dense(1, activation="sigmoid", name="sigmoid")
], name="tr_classifier")
model.compile(optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"])
return model
print(tr_model_fn().summary(line_length=90))
Model: "tr_classifier"
__________________________________________________________________________________________
Layer (type) Output Shape Param #
==========================================================================================
glove_embedding (Embedding) (None, 2493, 50) 1000050
__________________________________________________________________________________________
doc_embedding (GlobalAveragePooling1D) (None, 50) 0
__________________________________________________________________________________________
relu (Dense) (None, 25) 1275
__________________________________________________________________________________________
sigmoid (Dense) (None, 1) 26
==========================================================================================
Total params: 1,001,351
Trainable params: 1,001,351
Non-trainable params: 0
__________________________________________________________________________________________
None
imdb_tr = tf.keras.wrappers.scikit_learn.KerasClassifier(tr_model_fn)
if not os.path.exists(tr_model_file):
imdb_tr = tf.keras.wrappers.scikit_learn.KerasClassifier(tr_model_fn)
metrics = imdb_tr.fit(
x=seq_train_padded, y=imdb_y_train,
batch_size=256, epochs=20,
validation_data=(seq_test_padded, imdb_y_test),
validation_steps=20,
callbacks=[
tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2),
tf.keras.callbacks.ModelCheckpoint(tr_model_file, monitor="val_loss", save_best_only=True)
],
verbose=2)
Train on 25000 samples, validate on 25000 samples
Epoch 1/20
25000/25000 - 12s - loss: 0.6605 - accuracy: 0.6317 - val_loss: 0.1278 - val_accuracy: 0.6955
Epoch 2/20
25000/25000 - 12s - loss: 0.5532 - accuracy: 0.7613 - val_loss: 0.1028 - val_accuracy: 0.7812
Epoch 3/20
25000/25000 - 11s - loss: 0.4191 - accuracy: 0.8334 - val_loss: 0.0821 - val_accuracy: 0.8316
Epoch 4/20
25000/25000 - 11s - loss: 0.3285 - accuracy: 0.8724 - val_loss: 0.0718 - val_accuracy: 0.8523
Epoch 5/20
25000/25000 - 12s - loss: 0.2755 - accuracy: 0.8936 - val_loss: 0.0661 - val_accuracy: 0.8680
Epoch 6/20
25000/25000 - 11s - loss: 0.2397 - accuracy: 0.9082 - val_loss: 0.0636 - val_accuracy: 0.8730
Epoch 7/20
25000/25000 - 11s - loss: 0.2124 - accuracy: 0.9203 - val_loss: 0.0609 - val_accuracy: 0.8797
Epoch 8/20
25000/25000 - 12s - loss: 0.1901 - accuracy: 0.9307 - val_loss: 0.0604 - val_accuracy: 0.8799
Epoch 9/20
25000/25000 - 12s - loss: 0.1708 - accuracy: 0.9389 - val_loss: 0.0599 - val_accuracy: 0.8820
Epoch 10/20
25000/25000 - 11s - loss: 0.1544 - accuracy: 0.9462 - val_loss: 0.0601 - val_accuracy: 0.8842
Epoch 11/20
25000/25000 - 11s - loss: 0.1417 - accuracy: 0.9522 - val_loss: 0.0608 - val_accuracy: 0.8861
imdb_tr.model = tf.keras.models.load_model(tr_model_file) # Restore the best model with wrapper.
imdb_tr.classes_ = np.array([0, 1])
imdb_tr_yhat = imdb_tr.predict_proba(seq_test_padded)[:,1]
imdb_tr_pred = (imdb_tr_yhat > .5).astype(int)
print(classification_report(imdb_y_test, imdb_tr_pred))
precision recall f1-score support
0 0.87 0.90 0.88 12500
1 0.89 0.87 0.88 12500
accuracy 0.88 25000
macro avg 0.88 0.88 0.88 25000
weighted avg 0.88 0.88 0.88 25000
0.9515735391999999
Our NN model with transfer learning has similar AUC score to the vanilla NN. Let’s use explanation modeling to see if there is any actual difference.
def tr_predict_fn(text):
# This is for sklearn wrapper only.
seq = tokenizer.texts_to_sequences(text)
seq = pad_sequences(seq, padding="post", maxlen=maxlen)
return imdb_tr.predict_proba(seq)
imdb_tr_explainer = LimeTextExplainer(class_names=["Negative", "Positive"])
# Explain the same examples as in RF.
imdb_tr_tp_exp = imdb_tr_explainer.explain_instance(
imdb_reviews_test[imdb_rf_tp_idx[0]], tr_predict_fn, num_features=6)
imdb_tr_fp_exp = imdb_tr_explainer.explain_instance(
imdb_reviews_test[imdb_rf_fp_idx[0]], tr_predict_fn, num_features=6)
imdb_tr_tp_exp.save_to_file("/tmp/explain_text_tr_tp.html")
imdb_tr_fp_exp.save_to_file("/tmp/explain_text_tr_fp.html")
For the same positive review, again the model shows over-confidence. Even the donimant words are the same.
For the negative review, interestingly, the transfer learning NN indeed makes a correct prediction of negative label.5
As a final exercise on text classification, let’s experiment the explanation modeling with a recurrent neural network (RNN) RNN is known to be able to capture sequential dependencies better than ngram bag-of-words approach.
(Note that, even for a single recurrent layer, training a RNN will be prohibitively slow without a GPU.)
rnn_model_file = os.path.join(model_dir, "text_clf_rnn.h5")
def rnn_model_fn():
embedding_size = glove.shape[1]
model = tf.keras.Sequential([
tf.keras.layers.Embedding(
vocab_size, embedding_size, input_length=maxlen,
embeddings_initializer=tf.keras.initializers.Constant(glove),
trainable=True, mask_zero=True, name="glove_embedding"),
tf.keras.layers.GRU(64, dropout=.2, name="GRU"),
tf.keras.layers.Dense(1, activation="sigmoid", name="sigmoid")
], name="rnn_classifier")
model.compile(optimizer="adam",
loss="binary_crossentropy",
metrics=["accuracy"])
return model
print(rnn_model_fn().summary(line_length=90))
Model: "rnn_classifier"
__________________________________________________________________________________________
Layer (type) Output Shape Param #
==========================================================================================
glove_embedding (Embedding) (None, 2493, 50) 1000050
__________________________________________________________________________________________
GRU (GRU) (None, 64) 22272
__________________________________________________________________________________________
sigmoid (Dense) (None, 1) 65
==========================================================================================
Total params: 1,022,387
Trainable params: 1,022,387
Non-trainable params: 0
__________________________________________________________________________________________
None
imdb_rnn = tf.keras.wrappers.scikit_learn.KerasClassifier(rnn_model_fn)
if not os.path.exists(rnn_model_file):
metrics = imdb_rnn.fit(
x=seq_train_padded, y=imdb_y_train,
batch_size=32, epochs=10,
validation_data=(seq_test_padded, imdb_y_test),
validation_steps=20,
callbacks=[
tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2),
tf.keras.callbacks.ModelCheckpoint(rnn_model_file, monitor="val_loss", save_best_only=True)
],
verbose=2)
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
25000/25000 - 53s - loss: 0.4903 - accuracy: 0.7515 - val_loss: 0.0077 - val_accuracy: 0.8813
Epoch 2/10
25000/25000 - 47s - loss: 0.2628 - accuracy: 0.8935 - val_loss: 0.0059 - val_accuracy: 0.9141
Epoch 3/10
25000/25000 - 47s - loss: 0.1932 - accuracy: 0.9267 - val_loss: 0.0063 - val_accuracy: 0.8906
Epoch 4/10
25000/25000 - 47s - loss: 0.1488 - accuracy: 0.9456 - val_loss: 0.0060 - val_accuracy: 0.9016
imdb_rnn.model = tf.keras.models.load_model(rnn_model_file) # Restore the best model with wrapper.
imdb_rnn.classes_ = np.array([0, 1])
imdb_rnn_yhat = imdb_rnn.predict_proba(seq_test_padded)[:,1] # Interence of RNN take time.
imdb_rnn_pred = (imdb_rnn_yhat > .5).astype(int)
print(classification_report(imdb_y_test, imdb_rnn_pred))
precision recall f1-score support
0 0.91 0.89 0.90 12500
1 0.89 0.91 0.90 12500
accuracy 0.90 25000
macro avg 0.90 0.90 0.90 25000
weighted avg 0.90 0.90 0.90 25000
0.9631042208
RNN with pre-trained GloVe embeddings seems to work very well, even for such a small dataset. That’s see how the explanation can differ, again, for the same two examples:
def rnn_predict_fn(text):
# This is for sklearn wrapper only.
seq = tokenizer.texts_to_sequences(text)
seq = pad_sequences(seq, padding="post", maxlen=maxlen)
return imdb_rnn.predict_proba(seq)
imdb_rnn_explainer = LimeTextExplainer(class_names=["Negative", "Positive"])
# Explain the same examples as in RF.
imdb_rnn_tp_exp = imdb_rnn_explainer.explain_instance(
imdb_reviews_test[imdb_rf_tp_idx[0]], rnn_predict_fn, num_features=6)
imdb_rnn_fp_exp = imdb_rnn_explainer.explain_instance(
imdb_reviews_test[imdb_rf_fp_idx[0]], rnn_predict_fn, num_features=6)
imdb_rnn_tp_exp.save_to_file("/tmp/explain_text_rnn_tp.html")
imdb_rnn_fp_exp.save_to_file("/tmp/explain_text_rnn_fp.html")
The same over-confidence for all NN models on this particular positive review.
For the negative review, RNN also correctly predict the label. This may relate to they both using the pre-trained embeddings.
Lots of data can be represented in tabular format. Here we will use UCI Heart Disease dataset for demo. Particularly, we use the Cleveland dataset which is commonly used in machine learning research.6
ucihd_remote_path = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
ucihd_fname = os.path.basename(ucihd_remote_path)
ucihd_local_path = os.path.join(cache_dir, "datasets", ucihd_fname)
if not os.path.exists(ucihd_local_path):
_ = tf.keras.utils.get_file(fname=ucihd_fname, origin=ucihd_remote_path,
extract=False, cache_dir=cache_dir)
The dataset contains both numerical and categorical features (all encoded in numerics already, please refer to the in-line comments for documentation). It is tiny in both number of features and number of examples. But as a demo case it should serve well the purpose.
ucihd_attr = [
"age",
"sex", # 0 = female 1 = male
"cp", # chest pain type 1: typical angina 2: atypical angina 3: non-anginal pain 4: asymptomatic
"trestbps", # resting blood pressure (in mm Hg on admission to the hospital)
"chol", # serum cholestoral in mg/dl
"fbs", # (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
"restecg", # resting electrocardiographic results 0: normal 1: having ST-T wave abnormality 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
"thalach", # maximum heart rate achieved
"exang", # exercise induced angina (1 = yes; 0 = no)
"oldpeak", # ST depression induced by exercise relative to rest
"slope", # the slope of the peak exercise ST segment
"ca", # number of major vessels (0-3) colored by flouroscopy
"thal", # 3 = normal; 6 = fixed defect; 7 = reversable defect
"label" # diagnosis of heart disease (angiographic disease status) 0: < 50% diameter narrowing 1-4: > 50% diameter narrowing
]
ucihd = pd.read_csv(ucihd_local_path, header=None, names=ucihd_attr, na_values="?")
categorical_attr = ["sex", "cp", "fbs", "restecg", "exang", "thal"]
for col in categorical_attr:
ucihd[col] = ucihd[col].astype("category")
# Clean label.
ucihd.loc[ucihd["label"] > 1, "label"] = 1
print(ucihd.shape)
(303, 14)
label
0 164
1 139
dtype: int64
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal label
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3.0 0.0 6.0 0
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 1
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 2.0 2.0 7.0 1
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5 3.0 0.0 3.0 0
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4 1.0 0.0 3.0 0
Again we try to explain tree ensembles.
# sklearn's implementation of RF doesn't allow missing value.
# For categorical (as string) we can leave one special category for missing,
# but for numerical we need to do some special encoding or imputation.
ucihd_2 = ucihd.copy()
ucihd_2.loc[ucihd_2["ca"].isna(), "ca"] = -1 # Encode missing numerical.
# One-hot encode all categorical features.
ucihd_2 = pd.get_dummies(ucihd_2, columns=categorical_attr, dummy_na=True)
ucihd_y = ucihd_2.pop("label")
ucihd_X_train, ucihd_X_test, ucihd_y_train, ucihd_y_test = train_test_split(
ucihd_2, ucihd_y.values, test_size=.3, random_state=64)
ucihd_rf = RandomForestClassifier(n_estimators=100, random_state=64)
_ = ucihd_rf.fit(ucihd_X_train, ucihd_y_train)
ucihd_rf_yhat = ucihd_rf.predict_proba(ucihd_X_test)[:,1]
ucihd_rf_pred = ucihd_rf.predict(ucihd_X_test)
print(classification_report(ucihd_y_test, ucihd_rf_pred))
precision recall f1-score support
0 0.82 0.84 0.83 50
1 0.80 0.78 0.79 41
accuracy 0.81 91
macro avg 0.81 0.81 0.81 91
weighted avg 0.81 0.81 0.81 91
0.9004878048780488
As one can see RF performs very well on this dataset.
To explain a model trained with numerical features, lime
by default will discretize continous variables into quantiles for ease of interpretation. Discretization is done using statistics derived from the training dataset.
from lime.lime_tabular import LimeTabularExplainer
cat_ind = [i for i, col in enumerate(ucihd_2.columns) if "_" in col]
ucihd_rf_explainer = LimeTabularExplainer(
ucihd_X_train.values, class_names=["Negative", "Positive"],
feature_names=ucihd_2.columns,
categorical_features=cat_ind)
ucihd_rf_tp_idx = np.where(np.logical_and(ucihd_rf_pred == 1, ucihd_y_test == 1))[0]
ucihd_rf_fp_idx = np.where(np.logical_and(ucihd_rf_pred == 1, ucihd_y_test == 0))[0]
# We take one true positive and one false positive for examples.
ucihd_rf_tp_exp = ucihd_rf_explainer.explain_instance(
ucihd_X_test.iloc[ucihd_rf_tp_idx[0]], ucihd_rf.predict_proba, num_features=4)
ucihd_rf_fp_exp = ucihd_rf_explainer.explain_instance(
ucihd_X_test.iloc[ucihd_rf_fp_idx[0]], ucihd_rf.predict_proba, num_features=4)
ucihd_rf_tp_exp.save_to_file("/tmp/explain_tab_rf_tp.html")
ucihd_rf_fp_exp.save_to_file("/tmp/explain_tab_rf_fp.html")
Following the same idea in our discussion on text classifiers, we choose two examples, one true positive and the other false positive, from the RF predictions to demonstrate explanation modeling.
The explanation suggests several dominant features toward the positive. For categoricals each category serves as individual contribution for explanation. This is a natural consequence of one-hot encoding in our feature space.
For the false positive case, the model is less confident. There are indeed more features driving negatively. But one strong positive contribution from the feature ca
(number of major vessels colored by flouroscopy) cancel out the entire negative driving forces.
Gradient boosting trees (GBT) is a powerful model family proven to work exceptionally well in many different applications. Yet due to its ensembling nature, GBT is also hard to intrepret in general.
Here we demo lightgbm
’s implementation of GBT with LIME explanation.
import lightgbm as lgb
ucihd_tr = lgb.Dataset(ucihd_X_train, label=ucihd_y_train)
ucihd_te = lgb.Dataset(ucihd_X_test, label=ucihd_y_test)
ucihd_lgb_params = {
"learning_rate": .01,
"boosting_type": "gbdt",
"objective": "binary",
"metric": ["binary_logloss", "auc"],
"num_leaves": 8,
"max_depth": 3,
"min_data_per_leaf": 5,
"verbose": -1,
"seed": 64
}
ucihd_bst = lgb.train(
params=ucihd_lgb_params,
num_boost_round=300, early_stopping_rounds=20,
train_set=ucihd_tr, valid_sets=[ucihd_te],
verbose_eval=10)
Training until validation scores don't improve for 20 rounds
[10] valid_0's binary_logloss: 0.655886 valid_0's auc: 0.775854
[20] valid_0's binary_logloss: 0.628993 valid_0's auc: 0.785854
[30] valid_0's binary_logloss: 0.60576 valid_0's auc: 0.81122
[40] valid_0's binary_logloss: 0.585625 valid_0's auc: 0.824146
[50] valid_0's binary_logloss: 0.569745 valid_0's auc: 0.826585
[60] valid_0's binary_logloss: 0.556003 valid_0's auc: 0.826098
[70] valid_0's binary_logloss: 0.542218 valid_0's auc: 0.839024
[80] valid_0's binary_logloss: 0.532141 valid_0's auc: 0.843902
[90] valid_0's binary_logloss: 0.524471 valid_0's auc: 0.84439
[100] valid_0's binary_logloss: 0.516474 valid_0's auc: 0.85122
[110] valid_0's binary_logloss: 0.510493 valid_0's auc: 0.85122
[120] valid_0's binary_logloss: 0.506372 valid_0's auc: 0.852683
[130] valid_0's binary_logloss: 0.498944 valid_0's auc: 0.851951
[140] valid_0's binary_logloss: 0.493064 valid_0's auc: 0.854878
[150] valid_0's binary_logloss: 0.49013 valid_0's auc: 0.856341
[160] valid_0's binary_logloss: 0.487886 valid_0's auc: 0.853415
Early stopping, best iteration is:
[145] valid_0's binary_logloss: 0.491347 valid_0's auc: 0.856341
ucihd_lgb_yhat = ucihd_bst.predict(ucihd_X_test)
ucihd_lgb_pred = (ucihd_lgb_yhat > .5).astype(int)
print(classification_report(ucihd_y_test, ucihd_lgb_pred))
precision recall f1-score support
0 0.80 0.78 0.79 50
1 0.74 0.76 0.75 41
accuracy 0.77 91
macro avg 0.77 0.77 0.77 91
weighted avg 0.77 0.77 0.77 91
0.8563414634146341
In this particular (rather small) dataset RF indeed outperforms GBT. As a matter of fact, based on existing benchmark a simple logistic regression may have a even higher score for this problem. Nevertheless, let’s move on to our explanation model with LIME:
def ucihd_lgb_predict_fn(x):
# We need to output 2 columns for binary prob prediction.
p = ucihd_bst.predict(x).reshape(-1, 1)
return np.hstack((1 - p, p))
ucihd_lgb_explainer = LimeTabularExplainer(
ucihd_X_train.values, class_names=["Negative", "Positive"],
feature_names=ucihd_2.columns,
categorical_features=cat_ind)
# We take the same examples previously explained in our RF explanation model.
ucihd_lgb_tp_exp = ucihd_lgb_explainer.explain_instance(
ucihd_X_test.iloc[ucihd_rf_tp_idx[0]], ucihd_lgb_predict_fn, num_features=4)
ucihd_lgb_fp_exp = ucihd_lgb_explainer.explain_instance(
ucihd_X_test.iloc[ucihd_rf_fp_idx[0]], ucihd_lgb_predict_fn, num_features=4)
ucihd_lgb_tp_exp.save_to_file("/tmp/explain_tab_lgb_tp.html")
ucihd_lgb_fp_exp.save_to_file("/tmp/explain_tab_lgb_fp.html")
The behavior of GBT looks similar to that of RF in terms of these two examples.
In both case, the variable ca
has a dominant impact on the final decision. The two modesl also share the same confusion against the negative example.
lightgbm
This section is a digression on lightgbm
usage.
Since lime
’s API requires us to prepare our dataset in one-hot encoding representation, our lightgbm
code uses the same data pipeline as in scikit-learn
random forest. But that is actually not optimized for lightgbm
. The following code chunk showcases the best practice of encoding categoricals in lightgbm
: We don’t encode them at all!
# We leave both missings and categoricals as-is in the dataset.
ucihd_train, ucihd_test = train_test_split(ucihd, test_size=.3, random_state=64)
ucihd_tr = lgb.Dataset(
ucihd_train.drop("label", axis=1), label=ucihd_train["label"],
categorical_feature=categorical_attr,
free_raw_data=False)
ucihd_te = lgb.Dataset(
ucihd_test.drop("label", axis=1), label=ucihd_test["label"],
categorical_feature=categorical_attr,
free_raw_data=False)
ucihd_bst_2 = lgb.train(
params=ucihd_lgb_params,
num_boost_round=300, early_stopping_rounds=20,
train_set=ucihd_tr, valid_sets=[ucihd_te],
verbose_eval=-1)
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[94] valid_0's binary_logloss: 0.519726 valid_0's auc: 0.852683
ucihd_lgb_yhat = ucihd_bst_2.predict(ucihd_test.drop("label", axis=1))
ucihd_lgb_pred = (ucihd_lgb_yhat > .5).astype(int)
print(roc_auc_score(ucihd_test["label"], ucihd_lgb_yhat))
0.8526829268292683
To summarize, There are two very special properties about lightgbm
algorithm. lightgbm
treats missings natively as a special tree split point. This allows us to keep the original missing as is and in many cases can result in better accuracy than imputation.7
In addition, lightgbm
encodes categorical variables internally in a more efficient way. So we don’t even need to do one-hot encoding on our own. Of course in this tiny dataset we won’t see any noticable difference. But for large applications the performance impact can be huge. Whatever, by skipping one-hot encoding pipeline our code can be much neater as well.
We use image dataset from Open Images. For speed it is recommended to download the files from Amazon S3 using awscli
. We will only use a subset of all the available images.8
# Install AWS client tool.
pip install awscli
# Data is partitioned by first char of hexidecimal image file name, from 0 to f.
# We download two partitions.
# (This can take considerable time so we didn't make this chunk executable as rendering notebook.)
aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_0.tar.gz .
aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_f.tar.gz .
tar xvzf train_0.tar.gz
tar xvzf train_f.tar.gz
To simplify our discussion, we will build a classifier that only cares about cats and dogs. Hence we will only use a small part of the downloaded images.
# We untar the image data partitions under here.
img_data_dir = os.environ.get("IMG_DATA_DIR", "e:/data/open_images")
# Scan through all downloaded image files.
def img_df(path):
img_files = os.listdir(path)
df = pd.DataFrame({"filename": img_files})
df["id"] = [os.path.splitext(f)[0] for f in df["filename"]]
df["fullpath"] = [os.path.join(path, f) for f in img_files]
return df
img_data = pd.concat([
img_df(os.path.join(img_data_dir, "train_0")),
img_df(os.path.join(img_data_dir, "train_f"))
])
# Get boxable object classes.
# Take only cats and dogs!
img_classes = pd.read_csv(
os.path.join(img_data_dir, "class-descriptions-boxable.csv"),
header=None, names=["id", "class"])
cat_dog_id = img_classes[img_classes["class"].isin(["Cat", "Dog"])]
img_labels = pd.read_csv(
os.path.join(img_data_dir, "train-annotations-human-imagelabels-boxable.csv"))
img_labels = img_labels.merge(cat_dog_id, left_on="LabelName", right_on="id")
img_labels = img_labels[img_labels["Confidence"] == 1] # Remove false labels verified by a human.
img_labels.drop(["Source", "LabelName", "Confidence", "id"], axis=1, inplace=True)
print(img_labels.ImageID.str[:1].value_counts()) # Check distribution of cats and dogs among partitions.
0 3425
1 2385
2 2348
7 2327
9 2276
3 2271
4 2248
5 2245
b 2242
6 2235
d 2232
c 2230
a 2189
8 2168
f 2003
e 1997
Name: ImageID, dtype: int64
img_data = img_labels.merge(img_data, left_on="ImageID", right_on="id")
print(img_data.groupby("class").size()) # Available dogs and cats.
class
Cat 2134
Dog 3294
dtype: int64
Here is a cat from our dataset:
from tensorflow.keras.preprocessing import image
def cat_shot(i, data=img_data, filename=False):
"""Return a cat image in file path or plot."""
outfile = data.loc[data["class"] == "Cat", "fullpath"].iloc[i]
if filename:
return outfile
else:
img = image.load_img(outfile)
# Down-scale to reduce notebook file size.
w, h = img.size
img = img.resize((w // 2, h // 2))
ax = plt.imshow(img)
plt.show()
cat_shot(110)
And more!
Let’s experiment with a MobileNet-V2 (Sandler et al. (2018)) with pre-trained ImageNet embeddings. In the ImageNet network there are 1000 classes including more than just 1 cat category. We can not only predict the cat but also its species: a tabby, a Persian, …
from tensorflow.keras.applications import mobilenet_v2
from tensorflow.keras.applications import imagenet_utils
mobilev2 = mobilenet_v2.MobileNetV2(weights="imagenet")
def img_prep(infile, add_batch_dim=True):
img = image.load_img(infile, target_size=(224, 224))
img_a = image.img_to_array(img)
if add_batch_dim:
img_a = np.expand_dims(img_a, axis=0)
m = mobilenet_v2.preprocess_input(img_a)
return m
def img_predict(infile):
i = img_prep(infile)
p = mobilev2.predict(i)
return imagenet_utils.decode_predictions(p)
acat_file = cat_shot(110, filename=True) # This is a tabby.
for e in img_predict(acat_file)[0]:
print(e)
('n02123045', 'tabby', 0.4200025)
('n02123159', 'tiger_cat', 0.22227849)
('n04265275', 'space_heater', 0.011310321)
('n03642806', 'laptop', 0.011113907)
('n02124075', 'Egyptian_cat', 0.009064828)
acat_file = cat_shot(137, filename=True) # This is a Persian.
for e in img_predict(acat_file)[0]:
print(e)
('n02123394', 'Persian_cat', 0.9659566)
('n02085782', 'Japanese_spaniel', 0.0024251523)
('n02127052', 'lynx', 0.0009122683)
('n02086079', 'Pekinese', 0.00045317275)
('n02129165', 'lion', 0.00037420567)
The prediction looks good! Now try LIME explaination:
from lime import lime_image
from skimage.segmentation import mark_boundaries
def show_img_exp(infile, model):
explainer = lime_image.LimeImageExplainer()
img = img_prep(infile, add_batch_dim=False)
exp = explainer.explain_instance(img, model.predict, hide_color=0)
temp_img, mask = exp.get_image_and_mask(exp.top_labels[0], positive_only=False)
plt.imshow(mark_boundaries(temp_img, mask))
plt.show()
show_img_exp(acat_file, mobilev2)
WARNING: Logging before flag parsing goes to stderr.
W1219 17:26:34.537584 34400 image.py:700] Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
The pre-trained ImageNet has 1000 classes, which is too much for us who only care about cats and dogs (not even species). We’d like to use transfer learing to build a cat-or-dog model on top of the original ImageNet. To accomplish this, first we need to build a training pipeline. Since image is very different from either textual or tabular data: They are relatively huge in size. So for any image-related task it is better to build an on-disk data pipeline to make sure our training loop doesn’t waste too much memory.
# Train-test split.
img_data_train, img_data_test = train_test_split(img_data, test_size=.1, random_state=64)
img_train_generator = image.ImageDataGenerator(
preprocessing_function=mobilenet_v2.preprocess_input
).flow_from_dataframe(
img_data_train, x_col="fullpath", classes=["Cat", "Dog"],
batch_size=32, target_size=(224, 224))
Found 4885 validated image filenames belonging to 2 classes.
img_test_generator = image.ImageDataGenerator(
preprocessing_function=mobilenet_v2.preprocess_input
).flow_from_dataframe(
img_data_test, x_col="fullpath", classes=["Cat", "Dog"],
batch_size=32, target_size=(224, 224), shuffle=False)
# Test the generator.
Found 543 validated image filenames belonging to 2 classes.
(32, 224, 224, 3)
(32, 2)
With the data generator, now we can train a model (better with a GPU). We strip the final output layer for 1000 units (used for a softmax optimization in the original architecture), apply one global average pooling layer over all convolution kernels, and then add two additional fully-connected layers with decreasing number of units, then a final 2-unit layer for a 2-class softmax optimization.
cat_dog_model_file = os.path.join(model_dir, "cat_dog.h5")
mobilev2_base = mobilenet_v2.MobileNetV2(weights="imagenet", include_top=False)
mobilev2_base.trainable = False # Freeze the pre-trained weights.
def cat_dog_model_fn():
model = tf.keras.Sequential([
mobilev2_base,
tf.keras.layers.GlobalAveragePooling2D(name="Pooling"),
tf.keras.layers.Dense(768, activation="relu", name="FC-1"),
tf.keras.layers.Dense(128, activation="relu", name="FC-2"),
tf.keras.layers.Dense(2, activation="softmax", name="Softmax")
], name="cat_or_dog")
model.compile(optimizer="adam",
loss="categorical_crossentropy",
metrics=["accuracy"])
return model
cat_dog_model = cat_dog_model_fn()
print(cat_dog_model.summary(line_length=90))
Model: "cat_or_dog"
__________________________________________________________________________________________
Layer (type) Output Shape Param #
==========================================================================================
mobilenetv2_1.00_224 (Model) (None, None, None, 1280) 2257984
__________________________________________________________________________________________
Pooling (GlobalAveragePooling2D) (None, 1280) 0
__________________________________________________________________________________________
FC-1 (Dense) (None, 768) 983808
__________________________________________________________________________________________
FC-2 (Dense) (None, 128) 98432
__________________________________________________________________________________________
Softmax (Dense) (None, 2) 258
==========================================================================================
Total params: 3,340,482
Trainable params: 1,082,498
Non-trainable params: 2,257,984
__________________________________________________________________________________________
None
if not os.path.exists(cat_dog_model_file):
n_step = img_train_generator.n // img_train_generator.batch_size
metrics = cat_dog_model.fit_generator(
generator=img_train_generator,
steps_per_epoch=n_step, epochs=10,
validation_data=img_test_generator,
callbacks=[
tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2),
tf.keras.callbacks.ModelCheckpoint(cat_dog_model_file, monitor="val_loss", save_best_only=True)
],
verbose=2)
Epoch 1/10
152/152 - 73s - loss: 0.3781 - accuracy: 0.8216 - val_loss: 0.2090 - val_accuracy: 0.9116
Epoch 2/10
152/152 - 64s - loss: 0.2731 - accuracy: 0.8766 - val_loss: 0.2649 - val_accuracy: 0.8785
Epoch 3/10
152/152 - 66s - loss: 0.2548 - accuracy: 0.8889 - val_loss: 0.2279 - val_accuracy: 0.9134
cat_dog_model = tf.keras.models.load_model(cat_dog_model_file) # Restore the best model.
cat_dog_yhat = cat_dog_model.predict(img_test_generator)
cat_dog_pred = (cat_dog_yhat[:,1] > .5).astype(int)
print(classification_report(img_test_generator.labels, cat_dog_pred))
precision recall f1-score support
0 0.85 0.93 0.89 207
1 0.96 0.90 0.93 336
accuracy 0.91 543
macro avg 0.90 0.92 0.91 543
weighted avg 0.92 0.91 0.91 543
0.9762479871175523
Here are some explanations.
{'Cat': 0, 'Dog': 1}
[[0.9952461 0.00475389]]
W1219 17:30:25.099273 34400 image.py:700] Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
[[0.9770295 0.02297052]]
W1219 17:30:35.533436 34400 image.py:700] Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
[[0.9689663 0.03103369]]
W1219 17:30:45.606943 34400 image.py:700] Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
[[0.7678571 0.23214294]]
W1219 17:30:55.827044 34400 image.py:700] Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
The last case is indeed challenging for the model. Here the model failed to capture the cat’s head hiding in-between a white blanket. (The red area is decreasing the prediction toward a top class.) Whatever it is predicting, it is not really capture the correct signal in the photo. Indeed, our model is very sensitive to this particular image. Re-training the model may give different prediction.
Surely our model isn’t trained seriously. But we do realize that via model explanation we are closer to understand how it infers, provided that our explanation is locally faithful.
Shapley value is a game theory term. In a cooperative game with \(n\) players (denoted as set \(N\) with \(\vert N\vert = n\)), Shapley value of a player is its contribution to the total payoff of the game, after taking into account all possible coalition among players.
When each player in a cooperative game may contribute differently to the payoff, Shapley value is a way to determine how important each player is to the final outcome.
We define a function \(\nu(S)\) where \(S\) is a set of \(m\) players (\(\vert S\vert = m\)), the value of \(\nu\) is the expected payoff from all members in \(S\) as a coalition. Shapley value suggests that the contribution of an individual member \(j\) is calculated as the following formula:
\[ \begin{aligned} \varphi_j(\nu) &= \sum _{S\subseteq N\setminus \{j\}} \gamma(S, N) \cdot \bigg[ \nu(S\cup j) - \nu(S) \bigg] \\ &= \sum _{S\subseteq N\setminus \{j\}} \gamma(S, N) \cdot \bigg[ \nu(S_{j+}) - \nu(S_{j-}) \bigg], \end{aligned} \]
where
\[ \gamma(S, N) = \frac {m!(n - m - 1)!}{n!} \]
is the permutation proportional weight, and \(N\setminus \{j\}\) is the entire player set except player \(j\), \(S_{j+}\) denotes a coalition with player \(j\), \(S_{j-}\) denotes a coalition without player \(j\).
All Paths to a Full Coalition
Here is to understand the equation a bit more. The factorial \(n!\) is the total number of possible ways to grow coalitions from empty to full. For example if we have 3 players when forming a coalition we have 3 choices to pick up the first person, then 2 choices to pick up the second person, and only 1 choice left to form a coalition full of 3. So total possible paths to full coalitions will be \(3! = 3 \times 2 \times 1 = 6\).
Individual Contribution
Now in each path a player \(j\) will join the coalition at some point in time. This is where the attribution term \(\nu(S_{j+}) - \nu(S_{j-})\) can be calculated for that player. Apparently, since we have \(n!\) different paths it suggests that a player \(j\) will have \(n!\) such attribution values.
Weighted by Possibility
While we have in total \(n!\) different paths from zero to full coalition, there is only 4 distinct pairs of \((S_{j+}, S_{j-})\) for player \(j\). And they are not equally likely to happen. This is why we have the numerator term in \(\gamma(S,N)\), scaling each pair by its frequency in all possible paths. \(m\) is the coalition size before \(j\) joins. For example when \(m = 0\) (from empty coalition) the frequency of \(j\) becoming the first player in coalition is \(m!(n - m - 1)! = 0!(3 - 0 - 1)! = 2\) out of \(n! = 3! = 6\) paths.
So essentially Shapley value of player \(j\) is the weighted average payoff difference between all possible team work compositions (coalitions) with and without \(j\).
Lipovetsky and Conklin (2001) postulate a machine learning model inference task as such a cooperative game. Each feature value is now a player, the payoff is the difference between the final predicted value and the average prediction value. In this way, Shapley value of a feature can well represent its importance in the contribution to the prediction. Based on Shapely values of all features, we are able to explain a blackbox prediction.9
Now the real question is how can we implement the characteristic function \(\nu(S)\) in order to calculate the payoff difference between any two feature coalitions? There are many different implementations on this. Here we briefly discuss some of them.
\(\nu\) in this case is exactly the model prediction function. One obvious way of calculating the term \(\nu(S_{j+}) - \nu(S_{j-})\) for a given feature coalition is hence to train two separate models, one with only the coalition as feature sets (\(S_{j-}\)) and the other with the coalition plus the feature \(j\) (\(S_{j+}\)). Though using such method we can obtain the exact Shapley value, obviously this will be infeasible for large applications.
To bypass the need for re-training with an approximation solution, another way to obtain \(\nu(S)\) is to marginalize the effect of all the features not present in the coalition set \(S\). This is done by predict the same target instance, but with its non-coalition feature values replaced with random draws from the data.
This is better illustrated with an example. Let’s take the first row of the UCI Heart Disease data we explored just before:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal label
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3.0 0.0 6.0 0
To make the example neat, assuming we are only looking at a model trained on the following four attributes: age
, sex
, cp
, ca
. Our goal now is to explain that model prediction given these 4 feature values. We illustrate the idea using the following plot (assuming age
is the feature to calculate Shapley value):
Each row in the plot represents a possible coalition case. For example: Row 1 indicates there is no coalition at all. Row 7 indicates a coalition of the rest 3 features. In each case, features not included in the coalition are randomized by replacing them with a random draw from the dataset. For each coalition case if we generate enough such (artificial) instances and average the result, we marginalize their effects on the model, as if they are not included in the model in the first place.10
For the target feature to calculate Shapley value, we also use the same randomization technique to indicate whether it is in or out of the coalition. Now \(S_{j+}\) can be obtained by averaging model predictions over all randomized samples with feature j
included in the coalition and all the other features replaced by random draws. And similarly for \(S_{j-}\) with only one tweak: feature \(j\) now is also randomized. For each possible coalition (corresponding to each row in the plot) we need to do this marginalization computation. After traverse through all possible coalitions, we take the weighted average of all prediction differences to arrive at the estimated Shapley value, with the weighting function \(\gamma(S, N)\).
Even though this approach eliminates the need to re-train a pair of models for every coalition, number of possible coalitions still grow exponentially with total number of features. As a result, in reality, it is still not feasible for any large application.
To further reduce the computing time, another approach is to use Monte Carlo samples to approximate combinations of coalitions. So essentially the task will be simplified to solve:
\[ \hat\varphi_j(\nu) = \frac{1}{R} \sum _{r = 1}^R \bigg[ \nu(S_{j+}^r) - \nu(S_{j-}^r) \bigg], \]
where \(R\) is number of Monte Carlo samples, \(S_{j+}^r\) is a feature set with a random number of features whose value being replaced with random draws from the data, but fixing feature \(j\). \(S_{j-}^r\) is almost the same except for also randomizing feature \(j\).
This approximation makes Shapley value approach feasible now even for large applications.
In theory a Shapley value considers all possible feature interactions and is global for the underlying machine learning model. This is very different from LIME where the explanation is only local to the original model. Consequently Shapley explanations have better properties than LIME. We can interpret individual prediction, or we can calculate Shapley values for all examples and aggregate the effect to profile global feature importance. Again the original model can be any blackbox algorithm. All we need is the access to its prediction interface AND also the original training dataset in order to do the sampling approximation.
We will skip the hands-on exercise on Shapley value approach, jumping directly to a more general and also computationally efficient approach: SHAP. But the discussion here will definitely help us understand SHAP since they are closely related to each other.
Lundberg and Lee (2017) propose SHAP (SHapley Additive exPlanations), yet another additive feature attribution method for model explainability. As its name suggests, SHAP is based up top of Shapley value. It is indeed a more theoretically sound approach unifying both LIME and Shapley value (and more). Of course SHAP is also model-agnostic. In theory it can be applied to any machine learning model, but shap
comes with a customized fast implementation particularly for gradient boosting trees (GBT). It supports APIs of well-known open source GBT libraries such as xgboost
, lightgbm
, and catboost
.
shap
also comes with more visualization methods for feature investigation, especially for feature interaction exploration. Before we proceed to the hands-on section, let’s briefly explore the methodology first.
Remember that the additive feature attribution method for an explanation model \(g(\cdot)\) has the follwoing form:
\[ g(z\prime) = \phi_0 + \sum_{j = 1}^n \phi_j z_j\prime. \]
LIME estimates the contribution factor \(\phi_j\) using a local regularized regression model, with a weighting function measuring proximity between the sampled binarized features \(z\prime\) and the target binarized feature \(x\prime\). SHAP, instead, postulates \(\phi_j\) as the Shapley value of feature \(j\). It turns out that if we correctly setup the weighting function in a linear model, we can obtain the Shapley value using a weighted regression with randomized coalition sampling (just as what we discussed here).
The weighting function proven to be able to recover Shapley value is:
\[ \pi_{x\prime}(z\prime) = \frac{m - 1} {\binom{m}{\vert\ z\prime\vert}\vert z\prime\vert(m - \vert z\prime\vert)}, \]
where \(m\) is the maximum coalition size possible for the given example \(z\), and \(\vert z\prime\vert\) is the size of current coalition.
Now we optimize the loss:
\[ Loss = \sum_{z, z\prime} \pi_{x\prime}(z\prime) \cdot \big(f(z) - g(z\prime)\big)^2. \]
(This objective function is in general the same as in LIME.)
In shap
, the KernelExplainer
implements the weighted linear model to estimate Shapley value, by the following procedure:
Step 2 is to marginalize out the impact of absent features, which we’ve discussed already in the Shapley value section. shap
does this with what it calls a “background dataset.” The size of the background dataset significantly impacts the performance of the explainer, since each coalition sample must be averaged over the model prediction of all background data points.
For completeness, here is a quote from the docstring of KernelExplainer
about the marginalization task:
...Since most models aren't designed to handle arbitrary missing data at test
time, we simulate "missing" by replacing the feature with the values it takes in the
background dataset. So if the background dataset is a simple sample of all zeros, then
we would approximate a feature being missing by setting it to zero. For small problems
this background dataset can be the whole training set, but for larger problems consider
using a single reference value or using the kmeans function to summarize the dataset.
This approach closely connects LIME with Shapley value approach. We summarize the major difference between SHAP and LIME.
Weighting Function
In LIME the choice of the weighting function \(\pi_x\prime\) is heuristic, while in SHAP it is based on a theoretical work to recover the Shapley value, which is in turn based on cooperative game theory.
Regularization
Based on the theoretical derivation, SHAP uses a linear regression without regularization in order to faithfully recover Shapley value.11 In LIME instead a LASSO path is used to prefer a sparse solution. The choice is again heuristic and in lime
it is possible to switch the local model to any other linear regressor.
SHAP as a Specific LIME Implementation
Put it differently, if we deliberately choose the weighting function and the local regressor, we are able to use LIME approach to just estimate Shapley value.
Another important but subtle difference is in explaining text classifiers. lime
restricts its local sampling to present tokens of the target example only, while shap
doesn’t.
We (purposely) skip the discussion on proving the above weighted linear model can recover Shapley value. For readers who are interested please refer to the supplementary material to the original paper. Here instead we discuss the intuition behind the scene.
If we look at the definition of Shapley value restated here:
\[ \varphi_j(\nu) = \sum _{S\subseteq N\setminus \{j\}} \frac {m!(n - m - 1)!}{n!} \bigg[ \nu(S_{j+}) - \nu(S_{j-}) \bigg], \]
we realize that we are measuring an expected value of differences. To estimate such difference of a binary variable we know that we can simply include a dummy variable in our design matrix for a regression problem. Now the only task left is to specify a correct weighting scheme such that the expectation will take into consideration how feature coalition can be formed. SHAP is novel since it derives the correct weighting scheme such that the regression coefficients are exactly Shapley value.
Although KernelExplainer
is model-agnostic, it is also based on restricter assumption in approximating Shapley value. To arrive at better result, for certain models there are better ways to estimate Shapley value. Especially for tree ensembles and deep neural networks.
A limitation on the above linear regression approach is the assumption of feature independence. It is a natural consequence when we do coalition sampling by marginalizing out the absent features in order to construct regression samples to estimate the Shapley value.
Lundberg, Erion, and Lee (2018) extend the work in shap
, introduce TreeExplainer
to release such constraint with a model-specific algorithm to calculate Shapley value. This can be done due to the nature of tree algorithm. The marginalization phase can take into account feature dependency by examine the tree structure and only average over terminal nodes conditional on any given node, based on the coalition set. In this way features are no longer distributed independently in the marginalization phase, instead, features are distributed according to the distribution generated by the underlying trees.
In TreeExplainer
the exact Shapley value is directly estimated using tree nodes instead of a linear model. One added-on benefit is to extend the Shapley value to Shapley interaction value so essentially a feature attribution vector can be generalized to a feature attribution matrix, with the interaction with all other features.
Ohter than a specific approach for tree ensembles, in shap
the GradientExplainer
is used to handle deep neural nets (and hence model-specific), especially image models. Or in a broader sense, any differentiable model. It connects Shapley value with integrated gradients (Sundararajan, Taly, and Yan (2017)), which is another model explanation approach specifically designed for deep neural networks.
To understand SHAP’s expected gradients, we need to have a briefing on what is integrated gradients first. Mathematically speaking, the integrated gradient of input dimension \(j\) is defined as:
\[ IG_j(x) = (x_j - x_j\prime) \int_0^1\frac{\partial f(x\prime + \alpha (x - x\prime))}{\partial x_j} d\alpha, \]
where \(f\) is the model function, \(x\prime\) is the base input and \(x\) the instance to be explained. Commonly used base input for image model is a black image input; for language model it is a zero embedding.
The integration is done along the straight line between the base input and the target input. Instead of just taking derivative of the output relative to the input, we are taking model output derivatives along all points throughout the base input to the target input path, and sum them together.
Why don’t we just calculate \(\frac{\partial f(x)}{\partial x_j}\)? By definition it should tell us how much change in our output can be attributed to our input change. But this is not working well for a deep neural network due to the saturation nature: An important feature may well have a tiny gradient. Readers interested can refer to Sundararajan, Taly, and Yan (2016) for detailed discussion about gradient saturation.
Hence instead of calculating the input gradient, we calculate the gradient all along the way from the input being “missing” (represented by a base input) and gradually becomes the observed input. We accumulate all the resulting gradients to attribute the importance of that particular dimension.
Though sharing with similar properties, integrated gradients are not the exact Shapley value. If we’d like to calculate the Shapley value using integrated gradients for a differentiable model, instead of integrate along the straight line between the base input and the target input, we need to integrate along all possible extremal paths and take the average. We will have \(n!\) such paths for \(n\) features in the Euclidean space. Each path corresponds to a specific way to grow the coalition from an empty set to the full feature set. This can be visually illustrated in the following plot:
For a simple case of only two features, total number of possible coalition paths are just 2. Assuming the zero origin as the base input and \((1, 1)\) as the target example, for each feature we need to do path integrated gradients along the two red lines and take averge to arrive at the Shapley value of that feature.
In the original IG paper the authors mentioned that the choice of extremal path average or a straight line approach is somewhat subjective. And since the former requires a considerablly higher computation cost, they propose the latter approach.
Apparently a base input is required in order to calculate the integrated gradients of a particular instance. The choice of base input can heavily affect our ability to interpret a model. By definition the value in base input must represent feature missing. A black image as base input implicitly assumes black pixel is missing, which may be or may not be true, depending on the actual application. It is totally possible that in the original model a black pixel can be an important signal to the label. In such case, using a black image as base input is a no-go. The same problem applies to not just image but in any other application where a base input may potentially wrongly represent missingness to the model.
As a result, how to choose a base input becomes a critical issue in integrated gradients. There are studies specifically focus on this area. One possible direction is to choose multiple base inputs and average them to arrive at the final results. This is refered to as expected gradients. The rationale is the same as all the methods we already discussed: We marginalize the effect of a feature value in model inference task, as if that value is not even present in the first place.12
Note that in Shapley value approach the base input is implicitly assumed. In theory it is the model output of an empty coalition. Practically it will be the average predicted value over a bakcground dataset (usually the training dataset).
From shap
’s GradientExplainer
docstring:
... Integrated gradients values are a bit
different from SHAP values, and require a single reference value to integrate from. As an adaptation
to make them approximate SHAP values, expected gradients reformulates the integral as an expectation
and combines that expectation with sampling reference values from the background dataset. This leads
to a single combined expectation of gradients that converges to attributions that sum to the
difference between the expected model output and the current output.
So the idea is to first introduce a background distribution (using training dataset) as the base input, and integrate over the integrated gradients to arrive at the final attribution value. (There are two integrals, two expectations involves.)
Empirically, for input dimension \(j\) the attribution value we are calculating now becomes:
\[ EG_j(x) = \sum_{i=1}^k (x_j - x_j^i\prime) \cdot \frac{\partial f(x^i\prime + \alpha^i(x - x^i\prime) )}{\partial x_j}, \]
where \(k\) is number of background samples, \(x^i\prime\) is the \(i\)-th background sample, \(x_j^i\prime\) is the \(j\)-th dimension of that sample, and \(\alpha^i\) is a real number random draw in range \([0, 1]\).
Note that this sampling approach implicitly assumes feature independence, which is generally not true. For example, pixels are not independent from their surrounding pixels. So the solution is at best an approximation. For all the above shap
explainers we’ve discussed, only the TreeExplainer
does not need to assume feature independence.
The method of expected gradients does not average over all possible extremal paths from the a single base input. Instead, it is averaging over the integrated gradients along the straight line from a set of base inputs. So strictly speaking expected gradients do not exactly follow the notion of Shapley value (from SHAP’s point of view), where the attribution must take into consideration the discrete order of feature joinning the coalition. If we accept that IG over the straight line is an approximation for averaging IG along all extremal paths, then expected gradients can be viewed as an approximation for Shapley value, where the base input is no longer a single reference but a distribution.
We should be aware that all these works we’ve discussed are inspired by Shapley values, but there is no strict definition of feature missingness in the first place when we apply it to a machine learning model. In SHAP’s philosophy, a missing feature is represented by that feature being marginalized over a background dataset. In Janzing, Minorics, and Blöbaum (2019) this is refered to as the “many Shapley values” for model explanation:
…there are many ways to apply the Shapley value that differ in how they reference the model, the training data, and the explanation context.
Formally speaking, given a model function \(f\), a feature vector \(x\) and a feature subset \(S\) (coalition), SHAP’s philosophy of feature missingness can be expressed as solving this conditional model output:
\[ E\big[f(x) | x_S\big], \]
where \(x_S\) is hold fixed and the rest of the feature being marginalized over their distribution.
Rather than the conditioal expectation approach, Janzing, Minorics, and Blöbaum (2019) promote to use instead a base input approach to represent feature missingness. So features that are marked out of coalition will be replaced by their corresponding value in the base input vector. This is exactly the philosophy adopted by integrated gradients.
Using a base input to derive feature attribution has a benefit in case we’d like to explicitly explain a model under a comparative context. For example, why is the predicted probability increase compared to one month ago for the same entity? The entity may only have changed some feature values over time but not all. And there are features most likely unchanged, like gender. In such scenario a conditional expectation approach integrating over the background dataset may generate artificial attribution that is not relevant to the question asked.
In the end, no one approach is universally peferct. We should carefully choose the explanation model to faithfully reflect the questionwe’d like to anwser about our model.
Let’s skip more technical details and move directly on to the actual hands-on examples.
shap.TreeExplainer
performance is bad for scikit-learn
’s RandomForestClassifier
.13 To fully explore the capability we will skip the RF and move on to a GBT model.
In the previous section we didn’t train a GBT for the text classification problem. So let’s quickly build one such model first (with the same TF-IDF vectorization as we did for the RF model).
# lightgbm does not allow utf-8 encoded feature names.
# Since important tokens are most likely ascii-compatible for our dataset,
# we simply strip non-ascii as a workaround for this exercise.
def remove_non_ascii(s):
return "".join([i if ord(i) < 128 else "_" for i in s])
sorted_vocab_ascii = [remove_non_ascii(v) for v in sorted_vocab]
imdb_X_tr = lgb.Dataset(imdb_X_train, label=imdb_y_train, feature_name=sorted_vocab_ascii)
imdb_X_te = lgb.Dataset(imdb_X_test, label=imdb_y_test, feature_name=sorted_vocab_ascii)
imdb_lgb_params = {
"learning_rate": .05,
"boosting_type": "gbdt",
"objective": "binary",
"metric": ["binary_logloss", "auc"],
"num_leaves": 16,
"max_depth": 4,
"min_data_per_leaf": 20,
"verbose": -1
}
imdb_lgb_model_file = os.path.join(model_dir, "text_clf_lgb.txt")
# Save/reload model to save notebook rendering time.
if os.path.exists(imdb_lgb_model_file):
# Parameters are not loaded back? (Which cause the subsequent call to shap_values fail.)
# https://github.com/microsoft/LightGBM/issues/2613
# As a workaround we pass the same parameters to re-construct the model.
imdb_bst = lgb.Booster(model_file=imdb_lgb_model_file, params=imdb_lgb_params)
else:
imdb_bst = lgb.train(
params=imdb_lgb_params,
num_boost_round=1000, early_stopping_rounds=20,
train_set=imdb_X_tr, valid_sets=[imdb_X_te],
verbose_eval=100)
_ = imdb_bst.save_model(imdb_lgb_model_file)
Training until validation scores don't improve for 20 rounds
[100] valid_0's binary_logloss: 0.479194 valid_0's auc: 0.88184
[200] valid_0's binary_logloss: 0.424974 valid_0's auc: 0.906732
[300] valid_0's binary_logloss: 0.394577 valid_0's auc: 0.918715
[400] valid_0's binary_logloss: 0.374584 valid_0's auc: 0.925959
[500] valid_0's binary_logloss: 0.359882 valid_0's auc: 0.930936
[600] valid_0's binary_logloss: 0.348809 valid_0's auc: 0.934481
[700] valid_0's binary_logloss: 0.340231 valid_0's auc: 0.937016
[800] valid_0's binary_logloss: 0.333412 valid_0's auc: 0.938953
[900] valid_0's binary_logloss: 0.327711 valid_0's auc: 0.94053
[1000] valid_0's binary_logloss: 0.323358 valid_0's auc: 0.94162
Did not meet early stopping. Best iteration is:
[1000] valid_0's binary_logloss: 0.323358 valid_0's auc: 0.94162
imdb_lgb_yhat = imdb_bst.predict(imdb_X_test)
imdb_lgb_pred = (imdb_lgb_yhat > .5).astype(int)
print(classification_report(imdb_y_test, imdb_lgb_pred))
precision recall f1-score support
0 0.88 0.85 0.86 12500
1 0.85 0.88 0.87 12500
accuracy 0.86 25000
macro avg 0.86 0.86 0.86 25000
weighted avg 0.86 0.86 0.86 25000
0.9416203136
Based on the testing AUC score we find out that GBT performs comparably to neural network models.
Just like RF we will have access to the overall feature importance with a GBT model:14
The global feature ranking reveals some highly ranked features to be meaningless on its own. Especially the word it
. But as discussed earlier we shouldn’t over-interpret the ranks without a proper explanation modeling.
Since shap.TreeExplainer
is customized for GBT for speed, we can feed in all testing examples to calculate all shap values at once.
import shap
# Sparse matrix is supported by shap for lightgbm models.
imdb_lgb_explainer = shap.TreeExplainer(imdb_bst)
imdb_lgb_shap_values = imdb_lgb_explainer.shap_values(imdb_X_test)
def imdb_lgb_shap_plot(test_id, matplotlib=False):
shap_plt = shap.force_plot(
imdb_lgb_explainer.expected_value[1],
imdb_lgb_shap_values[1][test_id,:],
imdb_X_test[test_id,:].toarray(), # We still need a dense matrix here.
feature_names=sorted_vocab,
matplotlib=matplotlib
)
return shap_plt
One advantage of shap
on GBT models is the capability of traverse through all the testing examples due to its efficiency. So we can based on all the resulting shap values to derive a global feature importance judged by their average shap values (contributions). Note that this is different from the loss/impurity or split time-based feature ranking derived from RF/GBT during training. It is an aggregation from all local prediction explanations (contributions) during testing data inference.
shap.summary_plot(imdb_lgb_shap_values, imdb_X_test, feature_names=sorted_vocab,
plot_type="bar", max_display=20, show=False, plot_size=.25)
plt.show()
As we can see, the ranking based on shap values for testing set will be in general different from the ranking based on training split. And it is more interpretable: Features with higher rank literally have averagely higher impact on the testing dataset. Also the ranking can be conditioned on labels.
The most important application of shap
still lies on instance-level explanation. We stick to the previous two reviews. For the review that RF correctly label positive, we have the shap
explanation with the following visualization:
shap.plots.force.save_html("/tmp/shap_explain_lgb_text_rf_tp.html",
imdb_lgb_shap_plot(imdb_rf_tp_idx[0]), full_html=False)
Note that by default shap
for lightgbm
shows log-odds rather than probability in the plot. So a positive value indicates a positive prediction, otherwise negative.
To verify this:
def to_log_odds(p):
return np.log(p / (1 - p))
def to_p(log_odds):
return np.exp(log_odds)/(1 + np.exp(log_odds))
# Take the first true positive to examine:
p = imdb_bst.predict(imdb_X_test[imdb_rf_tp_idx[0],:].toarray())
print(p)
[0.71683258]
[0.92880398]
The base value shown in the force plot is the mean of the model output over the background dataset. For TreeExplainer
background dataset is the training dataset. To verify this:
-0.16883212836027323
0.45789194210729806
Note that the expectation is taken over the raw outputs instead of probabilities. For any given prediction, the shap values of all features should sum up to the difference between the predicted log-odds and the expected log-odds. To verify this on the specific positive example:
expected_log_odds = imdb_lgb_explainer.expected_value[1]
predicted_log_odds = to_log_odds(p)
print(predicted_log_odds - expected_log_odds) # The difference.
[1.09763611]
shap_v = pd.DataFrame({
"token": sorted_vocab,
"shap_value": imdb_lgb_shap_values[1][imdb_rf_tp_idx[0],:],
"tfidf": np.squeeze(imdb_X_test[imdb_rf_tp_idx[0]].toarray())
})
shap_v = shap_v.sort_values("shap_value", ascending=False)
print(shap_v) # Shap values of all features for the example.
token shap_value tfidf
8464 incredible 0.469118 0.138549
9893 love 0.358313 0.076663
4420 definitely 0.267002 0.109114
1420 bad 0.200221 0.000000
728 also 0.188378 0.066295
... ... ... ...
8908 it -0.135065 0.031411
1773 better -0.136536 0.075703
7329 great -0.173293 0.000000
14932 silly -0.259718 0.125069
11305 nothing -0.725885 0.084064
[18518 rows x 3 columns]
1.0976361111226303
And here is the SHAP explanation for the RF false positive case:
shap.plots.force.save_html("/tmp/shap_explain_lgb_text_rf_fp.html",
imdb_lgb_shap_plot(imdb_rf_fp_idx[0]), full_html=False)
From both cases we can know for example that the absence of bad
(indicated by a 0 tf-idf value) contributes positively to the final prediction. This is different from LIME where exh explanation on a text classifier is only conditioned on present tokens at the target example. For SHAP all features will be assigned their contribution.
As of 2019-12-19 shap.DeepExplainer
does not yet support TF 2.0.15 Let’s take this chance to use the model-agnostic shap.KernelExplainer
. The compromise is that it will run very slow for even just one prediction.
# Explain the two RF examples previously used.
imdb_exp_ind = np.array([imdb_rf_tp_idx[0], imdb_rf_fp_idx[0]])
# We use only a small set of data as the background for feature marginalization.
imdb_nn_shap_explainer = shap.KernelExplainer(imdb_nn.predict_proba, seq_train_padded[:50])
# This can be VERY slow. (~1.5 min for just 2 predictions.)
# Since we have a relatively large input space. (Max len of input sequence.)
imdb_nn_kernel_shap_values = imdb_nn_shap_explainer.shap_values(
seq_test_padded[imdb_exp_ind], silent=True)
# The attribution is assigned on original input sequence.
# To convert it into readable content we need to make them back to token id.
def imdb_nn_shap_plot(test_id, matplotlib=False):
# Strip paddings.
features = seq_test_padded[test_id]
maxlen = (features != 0).sum()
features = features[:maxlen]
feature_names = [tokenizer.index_word.get(i, "<pad>") for i in features]
shap_plt = shap.force_plot(
imdb_nn_shap_explainer.expected_value[1],
imdb_nn_kernel_shap_values[1][test_id,:maxlen],
features=features,
feature_names=feature_names,
matplotlib=matplotlib
)
return shap_plt
shap.plots.force.save_html("/tmp/shap_explain_nn_text_rf_tp.html",
imdb_nn_shap_plot(0), full_html=False)
shap.plots.force.save_html("/tmp/shap_explain_nn_text_rf_fp.html",
imdb_nn_shap_plot(1), full_html=False)
We do the same exercise on the tabular dataset previously explained by lime
.
ucihd_rf_explainer = shap.TreeExplainer(ucihd_rf)
ucihd_rf_shap_values = ucihd_rf_explainer.shap_values(ucihd_X_test)
def ucihd_rf_shap_plot(test_id, matplotlib=False):
shap_plt = shap.force_plot(
ucihd_rf_explainer.expected_value[1],
ucihd_rf_shap_values[1][test_id,:],
ucihd_X_test.iloc[[test_id]],
matplotlib=matplotlib
)
return shap_plt
From the global ranking we can confirm that variable ca
is definitely an influential feature.
We can plot partial dependency based on shap values of two features over the entire testing dataset. For example, by knowing that ca
is important, we’d like to further know how age
can impact the contribution of ca
across different examples.
shap.dependence_plot("age", ucihd_rf_shap_values[1], ucihd_X_test, interaction_index="ca", show=False)
plt.gcf().subplots_adjust(left=.25)
plt.show()
The result suggests two things:
ca
has less impact for yonger peopleBoth can be examined by domain-experts to see if the model is learning the correct pattern that we expected or at least that we can reason.
Note that for scikit-learn
RF model by default shap
reports probability instead of log-odds. Such behavior difference results from the optimization customized for GBT model family.
# The true positive case in RF.
shap.plots.force.save_html("/tmp/shap_explain_ucihd_rf_tp.html",
ucihd_rf_shap_plot(ucihd_rf_tp_idx[0]), full_html=False)
# The false positive case in RF.
shap.plots.force.save_html("/tmp/shap_explain_ucihd_rf_fp.html",
ucihd_rf_shap_plot(ucihd_rf_fp_idx[0]), full_html=False)
For GBT we feed the model that is optimized, where categoricals are encoded internally without explicit one-hot encoding.
ucihd_lgb_explainer = shap.TreeExplainer(ucihd_bst_2)
ucihd_lgb_shap_values = ucihd_lgb_explainer.shap_values(ucihd_test.drop("label", axis=1))
def ucihd_lgb_shap_plot(test_id, matplotlib=False):
shap_plt = shap.force_plot(
ucihd_lgb_explainer.expected_value[1],
ucihd_lgb_shap_values[1][test_id,:],
ucihd_test.iloc[[test_id]].drop("label", axis=1),
matplotlib=matplotlib
)
return shap_plt
# The true positive case in RF.
shap.plots.force.save_html("/tmp/shap_explain_lgb_ucihd_rf_tp.html",
ucihd_lgb_shap_plot(ucihd_rf_tp_idx[0]), full_html=False)
# The false positive case in RF.
shap.plots.force.save_html("/tmp/shap_explain_lgb_ucihd_rf_fp.html",
ucihd_lgb_shap_plot(ucihd_rf_fp_idx[0]), full_html=False)
As one may now realize, by explicitly one-hot-encode the categorical features we essentially split them into different features in their interpretable representation. This can be either good or bad, depending on the actual use case. From this particular aspect libary such as lightgbm
provides the flexibility to allow us choose whether to do the one-hot encoding or not. So the way we want to construct the explanation model may well affect our implementation of the original model.
We use the expected gradients to explain our image model. We explain the top 2 predictions within the ImageNet 1000 classes.
# Prepare background distribution of ImageNet embeddings.
# Since we freeze our imagenet embeddings, we can use the built-in summary embeddings.
bg_X, _ = shap.datasets.imagenet50()
# NOTE:
# GradientExplainer does not support Keras models without pre-determined input shape.
eg_explainer = shap.GradientExplainer(
model=mobilev2,
data=mobilenet_v2.preprocess_input(bg_X))
infile = cat_shot(19, data=img_data_test, filename=True)
x = img_prep(infile)
shap_values, top_ind = eg_explainer.shap_values(x, ranked_outputs=2)
# Retreieve original imagenet label.
# This is auto downloaded when we init the imagenet embeddings.
with open(os.path.join(cache_dir, "models", "imagenet_class_index.json"), "r") as f:
imagenet_classes = json.load(f)
# Convert predicted label to readable class.
top_cls_ind = np.array([imagenet_classes[str(i)][1] for i in top_ind[0]])
# Plot EG explanation.
shap.image_plot(shap_values, x, top_cls_ind[np.newaxis,:])
W1219 17:34:26.405433 34400 image.py:700] Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Apparently the hidding cat’s head is also ignored by the pre-trained ImageNet model. Blanket plays a bigger role in shaping the prediction. Interestingly, the second most possible prediction is a dog.
Nori et al. (2019) publish the open source package interpret
for a fast implementation of Generalized Additive Models with Pairwise Interactions, or GA2M (Lou et al. (2013)). As of 2019-12-19, interpret
is still in its alpha release with limited documentation. The library contains two groups of modeling frameworks:
glassbox
: explanable machine learning modelsblackbox
: machine learning explanation models (such as LIME and SHAP)We’ve already covered the mainstream approach in the second group, i.e., models that approximate (locally) the original model (supposed to be a blackbox) for better explainability. The more interesting part of interpret
is to bring about another type of model that is readily interpretable from its very origin, and yet still competitively accurate: the Explainable Boosting Machine, or EBM.
EBM is an additive model of the form:
\[ g(E(y)) = \beta_0 + \sum f_j (x_j) + \sum f_{ij}(x_i, x_j), \]
where \(g(\cdot)\) is a link function (sigmoid for binary classification, for an example), \(f_j\) is the feature function for the \(j\)-th feature, learned by a gradient boosting algorithm with only that feature at a time and in a round-robin fashion for all features. \(f_{ij}\) is a pairwise interaction feature function to further boost the accuracy of the model while remain interpretability.
The model is interpretable since the contribution of any individual feature can be directly quantified by their corresponding feature function \(f_j\). Such explanation can extend up to pairwise interaction if pairwise feature functions are also estimated.
The individial feature functions are also referred to as shape functions in GAM literature. The name probably comes after the fact that upon finishing learning the function, we can plot its output value \(f_j(x_j)\) against its input value \(x_j\), effectively visaulize the shape of possible contributions of that feature along its values over a dataset.
Note that although GAM is linear due to its additivity, each individual shape function can be (and mostly is) non-linear. Lou, Caruana, and Gehrke (2012) has done a comprehensive experiment on GAM (without variable interaction) over several datasets. They found that bagged trees as shape functions and gradient boosting as GAM learner has the best accuracy over several other choices.
To briefly summarize the gradient boosting training loop, here is the pseudo code:
set total iteration = M
set total feature = N
initialize all f_j with 0
for m in 1 to M:
for j in 1 to N:
calculate residuals with full GAM model
learn f_j against residuals and update the full GAM model
It will be infeasible to include all pairwise interaction since the number of possible pairs grows quadratically with total number of features. GA2M proposes an efficient way to rank potentially significant interaction pairs to largely reduce number of feature functions to learn. The pair ranking or detecting algorithm is called FAST in the original paper.
Given the current best model, potential interactions are detected on model residual. And the residual sum of squares (RSS) is used as the criteria whether to include additional interaction. If RSS does not reduce enough, it is suggesting the additional interaction is doing no benefit to the current best model.
In a bit more details, FAST contains two stages:
In stage one the process of learning one-dimensional feature functions is also called feature shaping. In stage two where we extend GAM to GA2M, \(f_{ij}\) is a bivariate tree model with efficient implementation based on cumulative histograms of the two involving features. Additionally, continuos features are further discretized into bins (say, 256 bins) of equi-frequency to speed up the tree split training.16
GA2M rank each feature importance (either one-dimensional or interaction) by the standard deviation of the function values: \(\sqrt{E(f_j^2)}\) Note that in a reduced form where feature function itself is linear: \(f_j(x_j) = w_jx_j\), according to the formula the weight directly corresponds to the importance value: \(\sqrt{E(f_j^2)} = w_j\).
In plain words, the importance score measures how volatile the contribution of a given feature is over a given training set. A feature with higher volatility essentially play a bigger role in shaping the final model decision. Hence it is (globally) more important.
EBM is not efficient for text dataset. Due to the algorithm’s design it will run too long for bag-of-words model since there are too many feature functions to estimate. If we fit a EBM with the movie review dataset (definitely not a big one), we will encounter OOM (out-of-memory) issue even without interaction terms. As a result, we will skip the discussion of EBM on a text classifier. (The same restriction applies to image dataset.)
ExplainableBoostingClassifier
has a scikit-learn
fashion API and hence is straightforward to use.
from interpret.glassbox import ExplainableBoostingClassifier
ucihd_ebm = ExplainableBoostingClassifier(
n_estimators=16, feature_names=ucihd_2.columns, n_jobs=1)
_ = ucihd_ebm.fit(ucihd_X_train, ucihd_y_train)
ucihd_ebm_yhat = ucihd_ebm.predict_proba(ucihd_X_test)[:,1]
ucihd_ebm_pred = (ucihd_ebm_yhat > .5).astype(int)
print(classification_report(ucihd_y_test, ucihd_ebm_pred))
precision recall f1-score support
0 0.88 0.86 0.87 50
1 0.83 0.85 0.84 41
accuracy 0.86 91
macro avg 0.86 0.86 0.86 91
weighted avg 0.86 0.86 0.86 91
0.9341463414634146
The model performs very well on the heart disease dataset, outperforming both RF and GBT.17
interpret
comes with a rich set of visualization tools (with plotly
as its backend). Model explanation is divided into two groups: global and local.
For global explanation, we have access to both global feature importance and a per-feature feature contribution stats.
Name Type # Unique % Non-zero
0 age continuous 40 1.000
1 trestbps continuous 45 1.000
2 chol continuous 129 1.000
3 thalach continuous 83 1.000
4 oldpeak continuous 39 0.679
5 slope continuous 3 1.000
6 ca continuous 5 0.387
7 sex_0.0 categorical 2 0.335
8 sex_1.0 categorical 2 0.665
9 sex_nan categorical 1 0.000
10 cp_1.0 categorical 2 0.080
11 cp_2.0 categorical 2 0.170
12 cp_3.0 categorical 2 0.292
13 cp_4.0 categorical 2 0.458
14 cp_nan categorical 1 0.000
15 fbs_0.0 categorical 2 0.830
16 fbs_1.0 categorical 2 0.170
17 fbs_nan categorical 1 0.000
18 restecg_0.0 categorical 2 0.505
19 restecg_1.0 categorical 2 0.014
20 restecg_2.0 categorical 2 0.481
21 restecg_nan categorical 1 0.000
22 exang_0.0 categorical 2 0.670
23 exang_1.0 categorical 2 0.330
24 exang_nan categorical 1 0.000
25 thal_3.0 categorical 2 0.571
26 thal_6.0 categorical 2 0.071
27 thal_7.0 categorical 2 0.349
28 thal_nan categorical 2 0.009
# Global feature importance.
ucihd_ebm_global.visualize().write_html("/tmp/ucihd_ebm_feat_imp.html", include_plotlyjs=False)
# Global contribution on age.
fid = ucihd_ebm_global.selector.Name.tolist().index("age")
ucihd_ebm_global.visualize(fid).write_html("/tmp/ucihd_ebm_age_imp.html", include_plotlyjs=False)
# Global contribution on trestbps.
fid = ucihd_ebm_global.selector.Name.tolist().index("trestbps")
ucihd_ebm_global.visualize(fid).write_html("/tmp/ucihd_ebm_trestbps_imp.html", include_plotlyjs=False)
# Global contribution on sex.
fid = ucihd_ebm_global.selector.Name.tolist().index("sex_0.0")
ucihd_ebm_global.visualize(fid).write_html("/tmp/ucihd_ebm_sex_imp.html", include_plotlyjs=False)
As we discussed earlier the global feature importance score is represented by the standard deviation of the feature function output. Features marked more important means that they are more “active” in shaping the model decision.18
More importantly, we must be able to explain a specific model prediction locally. EBM is inherently able to do exactly that. Using interpret
this can be done easily with a couple of lines:
# Explain the same instances previously on RF.
ucihd_exp_ind = np.array([ucihd_rf_tp_idx[0], ucihd_rf_fp_idx[0]])
# We can feed multiple examples at the same time.
ucihd_ebm_local = ucihd_ebm.explain_local(
ucihd_X_test.iloc[ucihd_exp_ind,:], ucihd_y_test[ucihd_exp_ind])
ucihd_ebm_local.visualize(0).write_html("/tmp/ucihd_ebm_exp_tp.html", include_plotlyjs=False)
ucihd_ebm_local.visualize(1).write_html("/tmp/ucihd_ebm_exp_fp.html", include_plotlyjs=False)
For the false positive case made by both RF and GBT, EBM is able to correctly predict the negative label. We still see a positive ca
value contribute a lot toward a positive prediction, while EBM is able to also pick up several negative factors that jointly negate the positive impact, ending up with a correct prediction toward negative.
Throughout all the exercises above we only demonstrate limited local actual examples, so nothing really conclusive here as which model is more reasonable for each problem in making their decision. But with more investigation there may be more insights on which model can be trusted more than the others.
We summarize the benefit of explanation modeling here. In general it allows us…
While explanation is good, we also need to be careful about its limitation. Ghorbani, Abid, and Zou (2019), for example, demonstrate adversarial attacks on interpretation of deep neural networks on an image classification task, showing that interpretation can be drastically changed by such attack without changing model prediction. And the attack is a random permutation not noticeable by human eyes. Such limitations come directly from the complexity of neural networks themselves, especially in high dimension and non-linearity.
To sum up, the research on explainable machine learning models is still relatively new and very active. This notebook reviews several influential approaches with hands-on coding examples to practice how we can experiment and interact with model explanation.19
Abadi, Martı́n, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2015. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” http://tensorflow.org/.
Ghorbani, Amirata, Abubakar Abid, and James Zou. 2019. “Interpretation of Neural Networks Is Fragile.” In Proceedings of the Aaai Conference on Artificial Intelligence, 33:3681–8.
Janzing, Dominik, Lenon Minorics, and Patrick Blöbaum. 2019. “Feature Relevance Quantification in Explainable Ai: A Causality Problem.” arXiv Preprint arXiv:1910.13413.
Lipovetsky, Stan, and Michael Conklin. 2001. “Analysis of Regression in Game Theory Approach.” Applied Stochastic Models in Business and Industry 17 (4): 319–30.
Lou, Yin, Rich Caruana, and Johannes Gehrke. 2012. “Intelligible Models for Classification and Regression.” In Proceedings of the 18th Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 150–58. ACM.
Lou, Yin, Rich Caruana, Johannes Gehrke, and Giles Hooker. 2013. “Accurate Intelligible Models with Pairwise Interactions.” In Proceedings of the 19th Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 623–31. ACM.
Lundberg, Scott M, Gabriel G Erion, and Su-In Lee. 2018. “Consistent Individualized Feature Attribution for Tree Ensembles.” arXiv Preprint arXiv:1802.03888.
Lundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 4765–74. Curran Associates, Inc. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf.
Maas, Andrew L., Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. “Learning Word Vectors for Sentiment Analysis.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–50. Portland, Oregon, USA: Association for Computational Linguistics. http://www.aclweb.org/anthology/P11-1015.
Nori, Harsha, Samuel Jenkins, Paul Koch, and Rich Caruana. 2019. “InterpretML: A Unified Framework for Machine Learning Interpretability.” arXiv Preprint arXiv:1909.09223.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.
Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. “Glove: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (Emnlp), 1532–43.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?: Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 1135–44. ACM.
Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. “Mobilenetv2: Inverted Residuals and Linear Bottlenecks.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 4510–20.
Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. 2016. “Gradients of Counterfactuals.” arXiv Preprint arXiv:1611.02639.
———. 2017. “Axiomatic Attribution for Deep Networks.” In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3319–28. JMLR. org.
Ushey, Kevin, JJ Allaire, and Yuan Tang. 2019. Reticulate: Interface to ’Python’. https://CRAN.R-project.org/package=reticulate.
Some people will further differentiate explainability from interpretability, by characterizing interpretability as knowing how without knowing why, and explainability as not only knowing how but also knowing why. In this notebook for simplicity we don’t take such approach.↩︎
In the R package iml
the default distance function for LIME is Gower’s distance, which is designed to handle distance in data with mixed types.↩︎
Keras also comes with the dataset preprocessed as integer sequences (from tf.keras.datasets import imdb
).↩︎
It will be much faster if we choose xgboost
’s or lightgbm
’s implementation of random forest. However, to demonstrate compatibility of lime
with scikit-learn
we purposely choose the slower implementation here.↩︎
Making neural network model reproducible is tricky, especially with GPU. In our experiments we realize sometimes the word bore
becomes the main driving force to lower down the score. But sometimes it does not. Whether it is due to the sensitivity of the original model or the local explanation model is not clear.↩︎
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.↩︎
xgboost
is the first to introduce such missing treatment among all the GBT package. lightgbm
follows.↩︎
For more usages please refer to the repo: open-images-dataset.↩︎
In their original work Shapley value is used to explain a large scale regression problem. But the concpet can easily extend to any machine learning model not limited to a regression model.↩︎
Such sampling scheme actually assumes independence among features, which is in general not true. More sophisticated sampling approach can be applied, but that will be out of our scope in this notebook.↩︎
In shap.KernelExplainer
’s actual implementation it may still use a LASSO with a non-zero regularization term. Readers interested can refer to the source code of KernelExplainer
.↩︎
For a detailed discussion on integrated gradients here is a great article: VisualizingExpectedGradients.↩︎
It is also buggy indeed. With the newest version we still encounter sparse input error and additivity check failure.↩︎
By default lightgbm
calculates the importance by counting how many times a feature contributes to an optimal split during training. It also supports the impurity-based approach with argument importance_type
set to "gain"
.↩︎
This is referred to as histogram tree approach and is also adopted in LightGBM’s implementation of gradient boosting trees. Experiments suggest such approach greatly reduce training time without compromising model accuracy.↩︎
To use FAST algorithm to detect interactions we need to pass explicitly an integer argument interactions
to the constructor. By default interactions=0
, i.e., no interaction is actually estimated at all. After several experiments it seems that in this particular case interactions does not help improve the model. So we stick to the default setting.↩︎
Feature shaping plot for interaction will be plotted as a heatmap. We dind’t demo that since interaction didn’t help our model at all.↩︎
Google Cloud just launched their xAI product for model explainability. They implement two approaches: integrated gradients and sampling Shapley values. Refer to their whitepaper for more details.↩︎