1 Motivation

Modern gradient boosting trees (GBT) is undoubtedly one of the most powerful machine learning algorithms for traditional supervised learning tasks in the recent decade. Ever since the creation of XGBoost there is a booming in different modern designs of GBT in the open source area aiming at not only model accuracy but also computational efficiency.

In this notebook we try to unbox two such powerful GBT frameworks: xgboost and lightgbm. We will focus more on the methodology rather than their APIs to deeply understand how these algorithms work and why they are so effective comparing to the vanilla version of GBT, which has been formally introduced nearly 20 years ago (Friedman 2001).

2 Gradient Boosting Machines

Before we step into the modern design of gradient boosting trees, let’s have a quick look at the traditional (vanilla) gradient boosting machines.

Gradient boosting is a model ensembling technique. It combines multiple models into one in an iterative and additive manner. Each individual model is indeed a weak learner, but jointly they end up providing powerful predictions. Mathematically speaking, the model can be expressed as:

\[ \hat{y} = F_k(x) = \sum_k f_k(x), \]

where \(k\) is number of iterations (and hence number of weak learners), \(f_k\) belongs to some model family and \(x\) is the feature vector.

A weak learner can be a model with lesser features, lesser parameters, or a simpler structure.

2.1 Functional Gradient Descent

Now if we rewrite the above model as:

\[ \begin{aligned} y &= F_k(x) + \epsilon \\ &= F_{k-1}(x) + f_k(x) + \epsilon, \end{aligned} \]

then we have:

\[ f_k(x) = y - F_{k-1}(x) - \epsilon. \]

The equation tells us that the function added at round \(k\) is fitting the model residual of the previous ensembled model (\(F_{k-1}\)). If our loss is squared error, model residual is indeed the negative gradient to the loss w.r.t. the function at the last step (ignoring the constant term):

\[ - \frac{\partial\frac{1}{2}\big[y - F_{k-1}\big]^2}{\partial F_{k-1}} = (y - F_{k-1}). \]

To generalize the idea, given any loss function \(L(y, F(x))\), to solve for the optimal function \(f_k\) to add at round \(k\) we are actually doing functional gradient descent:

\[ f_k(x) = - \frac{\partial L(y, F_{k-1}(x))}{\partial F_{k-1}(x)}. \]

The negative gradient is also referred to as pseudo-residual. When our loss function is not squared error, we no longer literally fit a model on residuals but on the pseudo-residuals.

2.2 Boosting v.s. Gradient Boosting

It can be easily confused about the name gradient boosting. As we already seen, graident boosting is indeed a gradient descent algorithm, but operating at functional space. That is, to optimize the objective function we are not looking for parameters but looking for functions. In the training phase we are iteratively searching for a function (a weak learner) to be included into our model.

Boosting is yet another different model ensembling technique, not to be confused with gradient boosting. The general idea of boosting is to also train a model iteratively. In each round the training examples will be re-weighted based on the results (losses) from the last run. Harder examples will gain higher weights so the subsequent training will shift focus onto them. The most popular boosting algorithm is probably AdaBoost, abbreviated from adaptive boosting.

Boosting is similar to gradient boosting in that both are model ensembling with additive weak learners and iterative training. But they are very different in the way they search for the optimal function at each iteration. Boosting fits a learner with examples re-weighted by loss generated from the previous ensemble. Gradient boosting instead fits a learner with pseudo-residuals: the negative functional gradient w.r.t. the loss.

2.3 Linear Model as Weak Learner

The choice of weak learner has a huge impact on the effectiveness of gradient boosting machines. Not all weak learners can gain equally from such ensemble. In this section we demonstrate a model ensemble using a weak linear learner. It turns out that gradient boosting with a linear model may do no good at all.

Let’s create some simulated data based on a simple ground-truth (non-linear) data generating process:

library(data.table)

set.seed(666)

# DGP: y = 3 x_1 + 6 x_2 -1.5 x_1x_2 + e
n <- 1000
x1 <- rnorm(n, 2, 4)
x2 <- runif(n)
e <- rnorm(n)
y <- 3 * x1 + 6 * x2 - 1.5 * x1 * x2 + e

# Train-test split.
is_test <- runif(n) < .2
D <- data.table(x1=x1, x2=x2, y=y)
head(D)

We fit a linear model with all features (consdier it as a “strong” learner):

# Fit a linear model with all features.
f <- lm(y ~ x1 + x2, data=D[!is_test])
D <- D[, yhat:=predict(f, D)]

# MSE as our loss function.
mse <- function(y, yhat) {
  mean((y - yhat)^2)
}

# Compute testing loss.
with(D[is_test], mse(y, yhat))
[1] 3.67646

Now let’s ensemble a weak learner which only takes one feature. We train the model based on gradient boosting framework. Since our loss is squared error, we can simply fit the model residual of the previous ensemble:

f1 <- lm(y ~ x1, data=D[!is_test])
D <- D[, yhat1:=predict(f1, D)]
D <- D[, resid1:=y - yhat1]

with(D[is_test], mse(y, yhat1))
[1] 4.47007
f2 <- lm(resid1 ~ x2, data=D[!is_test])
D <- D[, yhat2:=predict(f2, D)]
D <- D[, resid2:=y - yhat1 - yhat2]

with(D[is_test], mse(y, yhat1 + yhat2))
[1] 3.678013
f3 <- lm(resid2 ~ x1, data=D[!is_test])
D <- D[, yhat3:=predict(f3, D)]
D <- D[, resid3:=y - yhat1 - yhat2 - yhat3]

with(D[is_test], mse(y, yhat1 + yhat2 + yhat3))
[1] 3.676449
f4 <- lm(resid3 ~ x2, data=D[!is_test])
D <- D[, yhat4:=predict(f4, D)]
D <- D[, resid4:=y - yhat1 - yhat2 - yhat3 - yhat4]

with(D[is_test], mse(y, yhat1 + yhat2 + yhat3 + yhat4))
[1] 3.67646

As one may realize, the boosted model won’t outperform a simple linear model with all features included. Is it because of our naive built-from-scratch implementation? Not really. Let’s use xgboost–one of the state-of-the-art gradient boosting frameworks–to train a GBM with linear learner:

library(xgboost)

Dx <- xgb.DMatrix(as.matrix(D[!is_test, .(x1, x2)]), label=D[!is_test, y])
Dt <- xgb.DMatrix(as.matrix(D[is_test, .(x1, x2)]), label=D[is_test, y])

bst <- xgb.train(params=list(objective="reg:squarederror", booster="gblinear"), 
                 data=Dx, watchlist=list(test=Dt),
                 nround=50, early_stopping_rounds=5, verbose=0)

yhat <- predict(bst, as.matrix(D[is_test, .(x1, x2)]))
mse(D[is_test, y], yhat)
[1] 3.678262

The result is roughly the same.

In general, gradient boosting does not help imporving a linear model. This is by large due to the fact that GBM is an additive model. Combining linear models additively will just result in yet another linear model. Indeed, using a linear model as the base learner will have the effect of solving a full linear system (a model with all features) in an iterative and approximated manner. So at best the result is the same as a single linear model with all features.

2.4 Tree Model as Weak Learner

Based on empirical findings, the best weak learner in GBM is a tree. Most of the gradient boosting (or just boosting) implementations use tree models as their base learner.

For our simulated data, gradient boosted trees can easily outperform the linear model:

bst2 <- xgb.train(params=list(objective="reg:squarederror", booster="gbtree"), 
                  data=Dx, watchlist=list(test=Dt),
                  nround=50, early_stopping_rounds=5, verbose=0)

yhat <- predict(bst2, as.matrix(D[is_test, .(x1, x2)]))
mse(D[is_test, y], yhat)  # Smaller loss achieved.
[1] 1.432373

For the rest of the notebook we will focus our discussion on tree-based gradient boosting.

3 Tree Ensembles

Before we talk about gradient boosting trees, let’s have a brief review on tree models in general, from a single tree to tree ensembles.

Tree models are non-differentiable. Gradient-based optimizers won’t directly work on tree models. This makes them very different from other machine learning models such as neural networks.

Tree models are powerful because they are non-linear. Features are allowed to interact with each other in a sequential and loss-guided manner. And they are invariant to feature scale since a tree split is only ordinal. This simplifies data preprocessing and hence the overall pipeline.

3.1 A Single Tree

The idea of a single tree model is rather simple. The model uses a split finding algorithm to enumerate over all the possible splits on all the features. Once a best split is found based on a particular measure, the model move on to find the next split given the current node, growing the tree deeper and deeper until a certain criterion is met.

Here is a general pseudo code describing a single split algorithm in its exact form:

[Exact Tree Split Algorithm]

m: total number of features
n: total number of examples
s: score to measure the quality of a split
x_ij: the i-th example value on feature j

for feature j in 1 to m:
  sort x_j
  for example x_ij in 1 to n on feature j:
    compute s

do split using j* at x_ij* associated with maximum s

A Tree model involves multiple sequential splits to arrive at the terminal node which gives prediction. Here are several important components that must be taken care for a detailed implementation:

Feature Iteration

The algorithm is greedy. To search for the optimal split all features are traversed. The immediate consequence is the over-fitting nature of a single tree. And it is such difficulty in generalization leads to the idea of tree ensembles.

Feature Value Iteration

For each feature, a sorting operation is required to traverse over the (distinct) sorted values of that feature. The more values a feature can take the more computing resources are needed to find the best split.

Quality of Split Measure

This is probably the most important part of a tree algorithm. It directly affects the accuracy of the resulting tree. The quality score measures how good a node is splitting the examples. In the literature this is often refered to as impurity. A classical impurity measure is Gini index for classification and residual sum of squares for regression.

Depth of the Tree

How deep (how many splits) should we grow the tree? This is a hyper-parameter that can affect model generalization. For classification we can grow a tree as deep as until every node contains only one single example. That is, until no split can be further achieved. A tree that is too deep will definitely suffer from over-fitting.

A single tree is usually very bad at generalization. The only benefit is probably its interpretability since the tree structure itself is a set of human friendly decision rules. However due to its poor generalization over unseen data, such interpretability is practically of no use.

3.1.1 A Regression Tree

Continue with our simulated dataset, let’s quickly build a single regression tree and evaluate its performance on the testing set.

library(rpart)  # Recursive PARTitioning 
library(rpart.plot)

f <- rpart(y ~ x1 + x2, data=D[!is_test], method="anova", model=TRUE)
rpart.plot(f, digits=3)  # For each node the mean value and population are shown.

The testing score is worse than a linear model:

mse(D[is_test, y], predict(f, D[is_test]))
[1] 6.3468

Even though our true model is non-linear, a tree failed to outperform a linear model. Does that make sense? Our tree may be too simple to perform a task on predicting a continous value. The tree is only capable of predicting 9 distinct values, which make it difficult to compete with a linear model on a continous metric such as squared error.

Indeed it is very easy to build a more complicated (less regulated) tree to outperform the linear model:

f2 <- rpart(y ~ x1 + x2, data=D[!is_test], method="anova", model=TRUE,
            control=rpart.control(minsplit=10, cp=5e-4))
uniqueN(f2$where)  # How many terminal nodes?
[1] 33
mse(D[is_test, y], predict(f2, D[is_test]))
[1] 2.996025

Put aside regularization and model complexity, a common quality of split measure for a regression tree is the sum of squares \(SS = \sum (y_i -\bar{y})^2\). So for a given node we find a split such that the decrease in sum-of-squares \(SS_T - (SS_L + SS_R)\) is the largest. \(SS_T\) is the \(SS\) of current node, \(SS_R\) is \(SS\) of the right-hand node resulted from the split, \(SS_L\) the \(SS\) of left-hand node. Note that the prediction of a regression tree node is its mean value. Hence \(SS\) is the squared error of that node.

3.1.2 A Classification Tree

We use the old-school UCI Iris dataset to quickly demo a single classification tree.

confusion_matrix <- function(y_true, y_pred) {
  table(y_true, y_pred)
}

# Split IRIS dataset into train-test.
is_test_iris <- runif(nrow(iris)) < .3
iris_train <- iris[!is_test_iris,]
iris_test <- iris[is_test_iris,]

# Grow a single classification tree.
f <- rpart(Species ~ ., data=iris_train, method="class")
rpart.plot(f)

# Confusion matrix for the training set.
print(confusion_matrix(iris_train$Species, predict(f, iris_train, type="class")))
            y_pred
y_true       setosa versicolor virginica
  setosa         39          0         0
  versicolor      0         34         2
  virginica       0          1        33
# Confusion matrix for the testing set.
print(confusion_matrix(iris_test$Species, predict(f, iris_test, type="class")))
            y_pred
y_true       setosa versicolor virginica
  setosa         11          0         0
  versicolor      0         12         2
  virginica       0          2        14

A single classification tree model can perform both well in training and testing set for Iris. Since the dataset is a bit too trivial. In any real world application, a single tree is usually not enough to solve the problem. And this is where we need more trees.

3.2 From One Tree to Many Trees

To overcome the shortfall of a single tree, ensembling technique is used to combine multiple trees into one single model. There are two very successful ensembling techniques used for tree models: bagging and boosting. The general idea is to combine many simple (hence weak) trees as a tree committee for final prediction. It turns out that the variability in many simple/weak and especially non-linear learners can jointly outperform a single (and usually over-fitting) complicated/strong learner.

3.2.1 Bagging Ensembles

The most well-known bagging tree model is Random Forest (RF hereafter). In each single tree in RF the split algorithm is modified to only iterate over a subsample of all features. This reduces the ability of any single tree to over-fit the training data and create diversity among different trees. RF also use bootstrap sampling on the training dataset when doing feature value iteration. Rather than searching split on full training set it only searches on a set of bootstrapped samples. Some model variants even perform only a small number of random splits (instead of finding the best split with a full scan) on the feature value iteration phase.

Although it is claimed in theory that RF is strong against over-fitting due to both feature (column) subsampling and bootstrap (row) sampling, in reality RF can still easily over-fit without proper tunning.

One obvious advantage of RF, in terms of computational efficiency, is that it is embarrassingly parallel. Most modern implementations of RF supports single-machine cpu parallel or even distributed computing.

3.2.2 Boosting Ensembles

Another ensembling technique which has been proven to be even more power empirically is boosting. We have already discussed the difference between boosting and gradient boosting. Here we specifically refer to gradient boosting since it usually gives better results and it is our main topic in this notebook.

Unlike RF which trains multiple trees independently and aggregates the results, gradient boosting trees (GBT hereafter) adopt an additive training procedure, so each tree is iteratively added into the model.

Similar to RF or any other tree ensembling, each individual tree is a very weak learner. But jointly they become very good at generalization. Boosting trees indeed have achieved quite some of the state-of-the-art performance in real world applications.

Even though GBT has been introduced in the literature for almost 20 years already, it is the recent 5 years that several very powerful open source implementations have emerged to really shape the practitioners’ choice of toolbox. In the following sections we’d like to deep-dive into two of them: xgboost and lightgbm. We will focus especially on their novelties in bettering GBT to the extreme.

4 XGBoost

Chen and Guestrin (2016) open source the xgboost package, probably one of the most influential GBT framework to date. It also comes with the most variety of langauge API support: Python, R, Julia, Java, …, etc. It is also more or less integrated with distributed computing platform such as Spark, and cloud solution such as GCP and AWS.

print(packageVersion("xgboost"))
[1] '0.90.0.2'

4.1 Regularization on Impurity

The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost. For model, it might be more suitable to be called as regularized gradient boosting.1

One important and novel feature of XGBoost is to introduce regularization in the impurity measure. Before this, most tree ensembles focus on using traditional way of regularization, such as pruning, subsampling, or limiting number of leaves.

In xgboost the formulation of the objective function makes its tree ensembles extremely flexible. Loss can be plug-and-played as well as evaluation metrics. In general, the model has the following cost function:

\[ \text{Cost} = \sum_{i=1}^n L(y_i, \hat{y_i}) + \frac{1}{2}\lambda\vert\vert W \vert\vert^2 + \alpha\vert\vert W\vert\vert, \]

where \(L(y_i, \hat{y_i})\) is a per-instance loss, \(W\) is the weights of all tree leaves, \(\lambda\) and \(\alpha\) are L2 and L1 regularization term, respectively.

(By default xgboost has \(\lambda = 1\) and \(\alpha = 0\).)

The cost function indeed looks no different than any other differentiable machine learning model we already know. What makes it novel is that xgboost uses the cost as the impurity measure to implement the split finding algorithm in a tree, which is not differentiable. But how?

4.2 Additive Training

For convenience let’s re-state the gradient boosting model function here, with a bit more detailed notations:

\[ \begin{aligned} \hat{y}^{(k)} = F_k(x) &= \sum_k f_k(x) \\ &= \sum_{t=1}^{k-1}f_t(x) + f_k(x), \end{aligned} \]

where \(\hat{y}^{(k)}\) is the model prediction for model at round \(k\).

Define the loss function (without regularization term for now for simplicity) at round \(k\) for \(i = 1, ..., n\) examples to be:

\[ \begin{aligned} \text{Loss} &= \sum_{i=1}^n L(y_i, \hat{y_i}^{(k)}) \\ &= \sum_{i=1}^n L(y_i, \hat{y_i}^{(k-1)} + f_k(x_i)). \end{aligned} \]

Here \(\hat{y_i}^{(k-1)}\) is purely determined by model already trained in the previous round, we only care about picking up a tree model \(f_k(x_i)\) which can further reduce the loss the most.

When the loss is a squared error, the function is somewhat trackable:

\[ \begin{aligned} \text{MSE-Loss} &= \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y_i}^{(k)})^2 \\ &= \frac{1}{n}\sum_{i=1}^n \bigg[ \underbrace{(y_i - \hat{y}^{(k-1)})^2}_\text{Constant} - 2(y_i - \hat{y_i}^{(k-1)})f_k(x_i) + f_k(x_i)^2\bigg] \\ &= \frac{1}{n}\sum_{i=1}^n \bigg[ - 2\underbrace{(y_i - \hat{y_i}^{(k-1)}}_\text{Residual at k-1})f_k(x_i) + f_k(x_i)^2\bigg] + \text{Constant}. \end{aligned} \]

But when the loss is, say, a cross-entropy loss (which is fairly common), the function becomes much harder to deal with.

4.3 Second-Order Loss Approximation

In order to allow a general plug-and-play framework for different loss functions, instead of solving the exact functional form of a particular loss, xgboost approximates the loss with a second-order Taylor series expansion.

The 2nd-order expansion of a function \(L(x)\) at point \(x_0\) can be written as:

\[ L(x) \approx L(x_0) + L'(x_0)(x - x_0) + \frac{1}{2}L''(x_0)(x - x_0)^2. \]

Now define:

\[ x - x_0 = a, \]

we then have:

\[ L(x_0 + a) \approx L(x_0) + L'(x_0)a + \frac{1}{2}L''(x_0)a^2, \]

or simply:

\[ L(x + a) \approx L(x) + L'(x)a + \frac{1}{2}L''(x)a^2. \]

Use the form we can re-write our loss function:

\[ \begin{aligned} \text{Loss} &= \sum_{i=1}^n L(y_i, \hat{y_i}^{(k)}) \\ &= \sum_{i=1}^n L(y_i, \hat{y_i}^{(k-1)} + f_k(x_i)), \\ \text{Approx. Loss} &= \sum_{i=1}^n \Bigg[ \underbrace{ \vphantom{\frac{\partial L(y_i, \hat{y_i}^{(k-1)})}{\partial \hat{y_i}^{(k-1)}}} L(y_i, \hat{y_i}^{(k-1)})}_\text{Constant} + \underbrace{\frac{\partial L(y_i, \hat{y_i}^{(k-1)})}{\partial \hat{y_i}^{(k-1)}}}_{g_i}f_k(x_i) + \underbrace{\frac{\partial^2 L(y_i, \hat{y_i}^{(k-1)})}{\partial^2 \hat{y_i}^{(k-1)}}}_{h_i}f_k(x_i)^2 \Bigg] \\ &= \sum_{i=1}^n \Bigg[ g_if_k(x_i) + h_if_k(x_i)^2 \Bigg] + \text{Constant}. \end{aligned} \]

We follow the author’s notation on the first order term as \(g_i\) (g for gradient) and the second order term as \(h_i\) (h for Hessian). Together with the regularization term it becomes the quality of split measure which ultimately guides the model to pick up the next learner \(f_k\) at round \(k\).

Essentially XGBoost employs a functional gradient descent up to the second-order derivative. This makes it more precise in finding the next best learner.

Note that the value of \(g_i\) and \(h_i\) both remain the same for the current round, meaning that the computing cost is largely reduced when we are evaluating among different tree splits. This is one of the reason why XGBoost is so much faster than any other vanilla implementation of GBT.

4.4 Tree Split Optimization

Now let’s formally parameterize a tree model \(f_k(x_i)\) with \(T\) terminal nodes (leaves), \(w_j\) denotes the weight for \(j\)-th leaf, and \(I_j\) the set of examples assigned to leaf \(j\). The approximated loss (with regularization term this time) at round \(k\) can then be re-written in a group-by-leaf summation:

\[ \begin{aligned} Cost^k &= \sum_{i=1}^n \Bigg[ g_if_k(x_i) + h_if_k(x_i)^2 \Bigg] + \frac{1}{2}\lambda\vert\vert W \vert\vert^2 + \alpha\vert\vert W\vert\vert \\ &= \sum_{j=1}^T \Bigg[ \sum_{i \in I_j} g_iw_j + \frac{1}{2}\sum_{i \in I_j}h_iw_j^2 \Bigg] + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 + \alpha\sum_{j=1}^T w_j \\ &= \sum_{j=1}^T \Bigg[ \sum_{i \in I_j} g_iw_j + \frac{1}{2}(\sum_{i \in I_j}h_i + \lambda)w_j^2 + \alpha\vert w_j\vert \Bigg]. \end{aligned} \]

Given the cost function we can solve for optimal weight \(w_j^*\) minimizing the cost with a first-order-condition:

\[ \frac{\partial Cost^k}{\partial w_j} = 0, \]

which gives:

\[ \sum_{i \in I_j} g_i + (\sum_{i \in I_j}h_i + \lambda)w_j + \alpha\frac{w_j}{\vert w_j\vert} = 0. \]

For a null L1 penalty (\(\alpha = 0\)) the solution is simply:

\[ \begin{aligned} w_j^* &= - \frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j}h_i + \lambda}, \\ Cost^{k*} &= - \frac{1}{2} \sum_{j=1}^T \frac{\big(\sum_{i \in I_j} g_i\big)^2}{\sum_{i \in I_j}h_i + \lambda}. \end{aligned} \]

Now we have solved the weights that minimize the cost once \(T\) is determined. We will also need the optimized cost to determine the best split. That is, the quality of split measure. The best split should result in two nodes such that the total cost of two nodes is reduced the most from that of their parent node.

A learning rate can be applied to the optimal weights at each iteration, making the ensembler more conservative in updating itself.

Here is a pseudo code describing the core algorithm:

[XGBoost Additive Training Algorithm]

F: initial model
M: total number of features
N: total number of examples
K: total number of iterations (tree models)
T: total number of leaves
g_ik: g_i for example i at round k
h_ik: h_i for example i at round k
g_ijk: g_i for example i at round k assigned to leaf j
h_ijk: h_i for example i at round k assigned to leaf j
cost: the score at leaf j, sum(g_ijk)^2 / (sum(h_ijk) + lambda)
x_im: the i-th example value on feature m

for round k in 1 to K:
  while number of split < T:
    compute g_ik, h_ik for all i
    for feature j in 1 to m:
      sort x_j
      for example x_im in i = 1 to n on feature m:
        split by x_im into left and right leaves
        compute sum of cost on left and right leaves
    find best split using feature m* at x_im* associated with minimum cost
  update model F <- F + tree learned

4.4.1 Common Loss Functions

xgboost already comes with several loss functions built-in. We can also define our custom loss easily by just implementing the first (\(g\)) and second order (\(h\)) gradients of that loss w.r.t. model output. Here we simply review the two mostly used loss functions.

Squared Loss

Deriving \(g\) and \(h\) for a squared loss is trivial:

\[ \begin{aligned} g_i &= \frac{\partial(y_i - \hat{y}_i^{(k-1)})^2}{\partial\hat{y}_i^{(k-1)}} = 2(\hat{y}_i^{(k-1)} - y_i), \\ h_i &= \frac{\partial^2(y_i - \hat{y}^{(k-1)})^2}{\partial^2\hat{y}^{(k-1)}} = 2. \end{aligned} \]

We can ignore constant so the gradient \(g_i\) is simply \(\hat{y} - y\) and the hessian is unity.

Let’s use R’s symbolic derivatives to verify our derivation:

dy2y <- deriv3(~ (y - yhat)^2, c("yhat"))

yhat <- rnorm(5)
y <- rnorm(5)

2*(yhat - y)  # Gradient by hands.
[1]  3.187627  4.822539  4.377736 -4.599785  2.648223
eval(dy2y)  # By symbolic derivatives.
[1] 2.540241 5.814221 4.791143 5.289506 1.753271
attr(,"gradient")
          yhat
[1,]  3.187627
[2,]  4.822539
[3,]  4.377736
[4,] -4.599785
[5,]  2.648223
attr(,"hessian")
, , yhat

     yhat
[1,]    2
[2,]    2
[3,]    2
[4,]    2
[5,]    2

Cross-Entropy Loss

A subtle details must be taken care for cross-entropy loss. Intuitively we’d like to just take the derivative w.r.t. the model output probability:

\[ \begin{aligned} g_i &= \frac{\partial\big[- y_i\ln\hat{y}_i^{(k-1)} - (1 - y_i)\ln(1 - \hat{y}_i^{(k-1)})\big]} {\partial\hat{y}_i^{(k-1)}} = \frac{\hat{y}_i^{(k-1)} - y_i}{(1 - \hat{y}_i^{(k-1)})\hat{y}_i^{(k-1)}}, \\ h_i &= \frac{\partial^2\big[- y_i\ln\hat{y}_i^{(k-1)} - (1 - y_i)\ln(1 - \hat{y}_i^{(k-1)})\big]} {\partial^2\hat{y}_i^{(k-1)}} = \frac{1 - y_i}{(1 - \hat{y}_i^{(k-1)})^2} + \frac{y_i}{(\hat{y}_i^{(k-1)})^2}. \end{aligned} \]

Again use symbolic differentiation to verify our derivation:

# Demonstrate symbolic derivatives.
dq2q <- deriv3(~ - (y*log(q) + (1-y)*log(1 - q)), c("q"))
y <- sample(c(0, 1), 5, replace=TRUE)
q <- runif(5)

(1 - y)/(1 - q) - y / q  # Gradient by hands.
[1]  3.092679 -1.506702  1.476308 -1.771791  1.126489
(1 - y)/(1-q)^2 + y / q^2  # Hessian by hands.
[1] 9.564663 2.270150 2.179485 3.139244 1.268978
eval(dq2q)  # By symbolic derivatives.
[1] 1.1290377 0.4099229 0.3895444 0.5719910 0.1191058
attr(,"gradient")
             q
[1,]  3.092679
[2,] -1.506702
[3,]  1.476308
[4,] -1.771791
[5,]  1.126489
attr(,"hessian")
, , q

            q
[1,] 9.564663
[2,] 2.270150
[3,] 2.179485
[4,] 3.139244
[5,] 1.268978

The above derivation assumes \(\hat{y}\) is the sigmoid model output (probability). While actually in xgboost’s implementation for logistic regression the term \(\hat{y}\) is before sigmoid transformation (what the auther refers to as margin). So the model weights are raw scores (logits) instead of probabilities. This makes sense since the model is additive. It is better to do the add operation before the final sigmoid transformation of the ensembler. That is, each tree predicts raw scores and the final model ensemble is the sigmoid of the sum of all the raw scores (from the assigned leaves). This also largely simplifies the derivatives because now what we will be doing is:

\[ g_i = \frac{\partial L}{\partial t}\frac{\partial t}{\partial\hat{y}}, \]

where \(t\) is the sigmoid \(t(\hat{y}) = \frac{1}{1 + \exp(-\hat{y})}\).

Based on the fact that \(\frac{\partial t}{\partial \hat{y}} = t(1 - t)\), we finally have (simplying \(\hat{y}_i^{(k-1)}\) as \(\hat{y}\) to save typings):

\[ \begin{aligned} g_i &= \frac{t - y_i}{t(1 - t)} \times t(1-t) \\ &= t(\hat{y}) - y_i, \\ h_i &= t(\hat{y})(1 - t(\hat{y})). \end{aligned} \]

4.4.2 Exercise: Implement a Single XGB Split

To verify our understanding, as an exercise we built a single tree with just one split (a decision stump). And we comapre the result of our implementation to that of xgboost to see if they agree with each other.

lambd <- 1  # L2 regularization.
all_features <- c("x1", "x2")

cost <- function(df) {
  # The XGB optimal cost based on 2nd-order Taylor expansion.
  - sum(df$g)^2 / (2*(sum(df$h) + lambd))
}

cost_on_split <- function(D, split_ind) {
  # Split the data into left and right child nodes.
  DL <- D[1:split_ind, .(g, h)]
  DR <- D[(split_ind + 1):nrow(D), .(g, h)]
  cost(DL) + cost(DR)
}

feature_iteration <- function(D, all_features) {
  feature_best_costs <- list()
  best_split_ind <- list()
  best_split_val <- list()
  for ( x in all_features ) {
    # Sort examples by x.
    setorderv(D, x)

    # Feature value iteration.
    split_cost <- rep(NA_real_, nrow(D) - 1)
    for ( i in seq_len(nrow(D) - 1) ) {
      split_cost[i] <- cost_on_split(D, i)
    }
    feature_best_costs[[x]] <- min(split_cost)
    best_split_ind[[x]] <- which.min(split_cost)
    best_split_val[[x]] <- D[[x]][best_split_ind[[x]]]
  }

  # Determine which feature gains the largest reduction in cost.
  split_col <- names(which.min(feature_best_costs))
  list(split_col=split_col,
       split_ind=best_split_ind[[split_col]],
       split_val=best_split_val[[split_col]])
}


y <- D$y[!is_test]
dy2y <- deriv3(~ (y - yhat)^2, c("yhat"))
yhat <- .5  # This is the initial prediction used by xgboost.
dy <- eval(dy2y)
D <- D[!is_test, g:=attr(dy, "gradient")[,"yhat"]]
D <- D[!is_test, h:=attr(dy, "hessian")[,"yhat","yhat"]]

# Feature iteration on the root split.
r <- feature_iteration(D[!is_test], all_features)
print(r)
$split_col
[1] "x1"

$split_ind
[1] 394

$split_val
[1] 1.821115
# Verify the result with a real xgboost implementation.
bst3 <- xgb.train(params=list(objective="reg:squarederror", max_depth=1),
                  data=Dx, nround=1)
xgb.model.dt.tree(model=bst3)

So the optimal split at root node from our toy implementation agrees with that of xgboost. Note that the split value reported by xgboost does not correspond to an actual value in the feature. It is a virtual value just for splitting purpose. Indeed it is the average value between the closest two values at the split. To check this:

D2 <- D[!is_test]
setorder(D2, x1)
print(mean(D2[r$split_ind:(r$split_ind + 1), x1]))
[1] 1.826493

4.4.3 Parallelization

Unlike bagging trees which are embarassingly parallel, boosting trees are trained iteratively, which makes them harder to parallelize. We can, however, still parallelize the training process, not at tree-level but at feature-level.

Since the algorithm needs to traverse over all (or a subset of) features to find the best split, feature iteration is readily parallelable. All modern GBT libraries support this kind of parallelization.

4.4.4 Comparing to Vanilla GBT

In vanilla gradient boosting at each iteration we fit a tree to the pseudo residuals, i.e., negative gradient of loss function w.r.t. model output. The tree (or any other base learner) is learning from the pseudo residuals with its own loss. Such loss is local and may not reflect the global objective we want to optimize. XGBoost incorporates that global objective (with regularization) into each iteration, making the base learner more relevant to our end goal.

The second-order approximation of global objective further reduces the computing cost by converting the loss computation into a group-by-leaf sum operation, with all gradients pre-computed at the beginning of each iteration.

Notice that in a vanilla GBT we need to calculate the gradients to get the pseudo-residuals, and the base learners need to calculate the gradients of their own objective function (or calculate the impurity measure) in order to do optimization. XGBoost hence is much faster than vanilla GBT in training.

4.5 Approximated Tree Split

The exact split finding algorithm may not be scalable in large applications. So instead xgboost supports an approximated split algorithm described in the following pseudo code:2

[Approximated Tree Split Algorithm]

m: total number of features
s: score to measure the quality of a split
eps: percentile increment, dividing feature roughly into 1/eps buckets
x_pj: the p-th example percentile value on feature j

for feature j in 1 to m:
  divided feature j into buckets based on eps
  aggregate s for each bucket

for feature j in 1 to m:
  for all value x_pj on feature j:
    compute s based on bucketed aggregation

do split using j* at x_pj* associated with maximum s

The idea is to split only at percentile value, introducing one additional parameter called sketch_eps (default at 0.03 in xgboost) to control the granularity of the buckets. So the number of search for each continous variable will be capped roughly at 1/sketch_eps instead of the number of distinct values, which can be huge.

By default in xgboost the bucketization is done only once at the model initialization phase in each iteration. All subsequent splits are performed on the same set of buckets for each feature. It is also possible to re-propose new buckets after each optimal split, as pointed out in the original paper, but such control is not yet open to end user in xgboost interface as of 2019-12-27.

More specifically, in xgboost the parameter tree_method controls the type of splitting algorithm:

  • tree_method="exact": The exact splitting algorithm by default when dataset is small
  • tree_method="approx": The approximated splitting algorithm
  • tree_method="hist": The histogram tree approach (refer to LightGBM section for details)

4.6 Sparsity Awareness: Split on Missing

One novelty in xgboost is its capability of dealing with missing values. It solves the missing value problem by learning a default branch (either the left or the right child node) for each node in a split. During training, in each node the missing value can go to either left or right branch. Quality score for both branches are now computed based on the assumption that missing values should go to the other branch. In the end the split algorithm returns not only optimal split but also optimal default branch for missing value at each node.

xgboost is indeed not the first tree model to handle missing values in its own way. For example, rpart also handles missing value in a particular manner. In rpart, each split will come with additional surrogate variables to handle cases where the split feature value is missing at inference time. A surrogate variable is a feature other than the best split feature but used to classify the resulting child nodes passing a minimum quality criterion. For more details readers may refer to Therneau and Atkinson (2019).

Comparing to rpart’s surrogate variable approach, which increases computing cost, xgboost’s approach is more computationally efficient and seems to also result in better model accuracy.

4.7 Random Forest via XGBoost

xgboost is a general framework for model (especially tree) ensemblers. Since there is no rule stopping us from building more than one tree at each iteration, we can effectively grow multiple trees with only one round to train a random forest with xgboost. Even more, we can train a boosting random forest by growing multiple trees and with multiple iterations. That is, we use RF as our base learner in a gradient boosting machine.

5 LightGBM

Ke et al. (2017) open source the lightgbm package, yet another powerful GBT framework with its own remarkable novel implementation.3

library(lightgbm)
packageVersion("lightgbm")
[1] '2.3.2'

Here is a simple training task using lightgbm on our simulated data:

Dx2 <- lgb.Dataset(as.matrix(D[!is_test, .(x1, x2)]), label=D[!is_test, y])
Dt2 <- lgb.Dataset(as.matrix(D[is_test, .(x1, x2)]), label=D[is_test, y])

# We set the parameters such that lgb behaves like xgb's default for common parameters.
lgbst <- lgb.train(params=list(objective="regression", boosting="gbdt", lambda_l2=1,
                               min_sum_hessian_in_leaf=1, min_data_in_leaf=0),
                   data=Dx2, valids=list(test=Dt2),
                   num_iterations=100, early_stopping_rounds=5, verbose=0)

yhat <- predict(lgbst, as.matrix(D[is_test, .(x1, x2)]))
mse(D[is_test, y], yhat)
[1] 1.360733

In general lightgbm runs even faster than xgboost in training, but with comparable model accuracy. This is achieved by a combination of different strategies (some of them novel). We will walk-through each of them in the following sections.

5.1 Histogram Tree (Quantization)

lightgbm uses histogram-based tree split algorithm.4

Instead of splitting at raw feature values histogram-based approach bucketizes continous features into discrete bins, forming a limited number of feature bins (in lightgbm this is default max_bin=255) and then split on those bins. This largely reduces the complexity from \(O(n)\) to \(O(\text{max_bin})\) during split finding for a feature.5

5.1.1 Greedy Equal-Density Binning

There are two strategies in binning: equal bin length (Li, Wu, and Burges 2008) or equal bin density. lightgbm adopts the latter. So number of data points will be roughly the same for most bins. This effectively transforms feature distribution into a uniform one when finding the best split.

lightgbm doesn’t use the entire training dataset to construct the boundaries of bins. Instead it use a random subset (only applicable when data size is huge enough). This random sample size can be controlled by parameter bin_construct_sample_cnt=200000. Additionally, min_data_in_bin=3 can be used to control minimal number of data inside one bin.

Note that this approach is different from the percentile splitting we’ve discussed for approximated tree split implemented in xgboost:

The existing approximate splitting method in xgboost also buckets continuous features into discrete bins to speed up training. The approx method generates a new set of bins for each iteration, whereas the hist method re-uses the bins over multiple iterations.6

5.1.2 Histogram Subtraction Trick

To calulate the information gain from a split we need to calculate the cost for the resulting two nodes. Remember that a cost of a single node \(j\) is:

\[ \text{Cost of Node j} = - \frac{1}{2} \frac{\big(\sum_{i \in I_j} g_i\big)^2}{\sum_{i \in I_j}h_i + \lambda}. \]

A key observation here is that what we need is simply summation of gradients inside the node. Since an example is assigned to either left (\(I_j^L\)) or right (\(I_j^R\)) child after a split, apparently we have:

\[ \begin{aligned} \sum_{i \in I_j} g_i &= \sum_{i \in I_j^L} g_i + \sum_{i \in I_j^R} g_i, \\ \sum_{i \in I_j} h_i &= \sum_{i \in I_j^L} h_i + \sum_{i \in I_j^R} h_i. \end{aligned} \]

This means that when calculating the total gain of a split, we only need to calculate the cost of one node. And the cost of the other node can be obtained by first substracting the gradient sums from that of the parent, then squaring the first-order gradient sum before pluging into the cost function to arrive at the final cost of the other node. To save even more resources, we can always choose to calculate on the smaller child node. This is one key factor why lightgbm can run even faster than xgboost’s default or the approximated tree method.

5.1.3 Speed-Accuracy Tradeoff

In the original paper it is demonstrated with experiments that a total number of 255 bins (plus one bin specifically for zeroes) can perform very well. We can always choose more bins for potentially better accuracy, but at the price of slowering down the training speed.

We can use a single decision stump to easily see the effect of max_bin. First let’s run with the default setting:

lgbst2 <- lgb.train(params=list(objective="regression", boosting="gbdt", lambda_l2=1,
                                num_leaves=2),
                    data=Dx2, num_iterations=1, verbose=0)

lgb.model.dt.tree(model=lgbst2)[]

The optimal root split coincide with xgboost and also our toy implementation. Note that we have far more than max_bin=255 distinct values in our splitting variable:

distinct_cnt <- uniqueN(D[!is_test, x1])
print(distinct_cnt)
[1] 830

Instead of performing 829 splits LightGBM only performs 255 splits to locate the same optimal split.

Now let’s reduce the maximal number of bins to a rather small number:

# We need to re-construct the dataset since lgb caches the bins in dataset.
Dx2 <- lgb.Dataset(as.matrix(D[!is_test, .(x1, x2)]), label=D[!is_test, y])
lgbst3 <- lgb.train(params=list(objective="regression", boosting="gbdt", lambda_l2=1,
                                num_leaves=2, max_bin=10),
                    data=Dx2, num_iterations=1, verbose=0)

lgb.model.dt.tree(model=lgbst3)[]

By lowering down max_bin now we end up with a sub-optimal split, with smaller gain for the split.

5.1.4 Handling Sparsity

Histogram tree cannot be fully optimized for sparse data though. Since no matter a feature value is zero or not, its bin value needs to be accessed. As a result, to speed up training runtime there must be other strategies targeted at sparsity optimization.

5.2 Gradient-Based One-Side Sampling

To reduce the training time one idea is to reduce number of splits in finding the best split. Without any approximation or quantization, for each feature the algorithm needs to traverse over all distinct feature values to find the best split. This can be prohibitively slow for large applications. xgboost tackles this with its approximated tree split using only percentiles as split value. lightgbm has several different strategies combined together. The histogram tree method we just discussed is one of them. Another such strategy is called Gradient-Based One-Side Sampling (GOSS).

When finding the best split, GOSS down-samples the examples with their absolute gradient sizes as sampling weights. It is based on the fact that examples with smaller gradients generally means they are already well-trained (and hence the error is smaller). Examples with larger gradients, instead, should be the focus for the model to imporve further. The concept is borrowed from AdaBoost, where examples are re-weighted at each boosting round to guide the model on learning the harder examples.

GOSS keeps examples with large gradients and only do sampling on examples with small gradients. Since this will screw up the distribution, which in turn hurts the model’s ability to learn, GOSS compensates such effect by applying a constant factor \(\frac{1 - a}{b}\) to sampled examples with small gradient to “re-weight” the data back to its original distribution. Here \(a\) is the top percentage of examples to keep, sorted descendingly by gradients, and \(b\) the sampling rate for the rest of the examples (with smaller gradients than those top ones). The authors in their paper provide the theoretical ground to support that GOSS can approximate well when the data size is huge.

To use GOSS in lightgbm we need to specifiy the parameter boosting="goss". By default booster="gbdt" is just the modern GBT with leaf-wise growing histogram trees.

5.3 Exclusive Feature Bundling

GOSS deals with data in long size. What about very wide (high dimentionality) data? In such case, lightgbm adopts a strategy to group together features that rarely co-exist into single feature and do the split scan. This is referred to as Exclusive Feature Bundling, or EFB.

lightgbm postulates EFB as a graph coloring problem by taking features as vertices and adding edges for every two features if they are not mutually exclusive. To speed up even more, it introduces a certain level of “pollution” by allowing features that are not 100% mutually exclusive to be also bundled together. This can be controlled by parameter max_conflict_rate=0. The higher the rate the more compromises for the model accuracy.

EFB can be enabled by parameter enable_bundle (and by default it is on already).

5.4 Leaf-Wise Tree Split

Leaf-wise tree split is also called best-first split (Shi 2007). Classical trees will grow their nodes in a depth-first order. At each layer all nodes are splitted in order to construct the next layer. The final tree layout can be asymmetric due to post-training pruning or other regularization techniques, but the grow policy is to always go deeper first for each node at the same level.

A leaf-wise growing tree doesn’t always split every node at the same layer down to the next layer. Instead, it chooses the node that can result in the best gain for every split. Since one of the resulting two additional nodes at current split may result in the next best split, the tree may well grow horizontally (leaf-wise) instead of vertically (depth-wise).

Emprically, a single leaf-wise growing tree tend to over-fit more than a depth-wise growing tree. But as a week learner (with shallower structure) in gradient boosting it can outperform the depth-wise approach. To further regularize the tree, lightgbm also has max_depth to control the depth of the tree, even though it is growing leaf-wise.

5.5 Optimal Categorical Encoding

Usually we use one-hot encoding to represent categorical variables. To be memory-friendly, such data input is usually a sparse matrix. But when a categorical is highly skewed (high cardinality), it will hurt the performance of a tree learner.

Here is a direct quote from lightgbm ’s official doc:

Instead of one-hot encoding, the optimal solution is to split on a categorical feature by partitioning its categories into 2 subsets. If the feature has k categories, there are 2^(k-1) - 1 possible partitions. But there is an efficient solution for regression trees. It needs about O(k * log(k)) to find the optimal partition. The basic idea is to sort the categories according to the training objective at each split. More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.

Readers interested can refer to Fisher (1958) for the details of the solution to optimal partition in a categorical variable.

5.6 Parallel Learning

We’ve mentioned that GBT is a iterative training algorithm so theoretically the parallelization must be done at a level other than the base learner. In lightgbm parallelization is seriously implemented into different such levels, with support of both multi-threading or multi-machine runtime.

(By default tree_learner=serial so there is no parallelization at learning.)

5.6.1 Feature Parallel

tree_learner=feature

This is the most straightforward parallelization we can think of. Data is partitioned vertically so each thread will only inspect a subset of features. Best split is found simultaneously among those features to get the local best splits, then a comminucation overhead is required to arrive at the global best. The speedup is limited if our data is not wide but instead very long.

5.6.2 Data Parallel

tree_learner=data

Instead of vertically parallelization, lightgbm also supports horizontal parallelization. Local gradient histograms will be constructed for each partition and then merged to a global histogram for finding the best split.

5.6.3 Voting Parallel

tree_learner=voting

Meng et al. (2016) propose the idea of parallel voting tree. The general idea can be summarized into the following flow:

  1. Partition data according to multiple threads/machines
  2. In each partition a local voting is done to propose top-k features for the best split
  3. Top-2k features are collectd based on all local votes
  4. Full data for the top-2k is collected and a global best split is searched within them

Such parallel voting strategy has limited communication overhead but can approximate very good in accuracy. In lightgbm, by default when doing parallel voting we have top_k=20.

In general, since lightgbm is already lightning fast, only very huge data will we need to resort to parallel voting.

6 XGBoost v.s. LightGBM

xgboost and lightgbm have a lot in common, even their APIs (in both Python and R) are highly similar to each other. They use the same global loss function with 2nd-order approximation and regularization to optimize the model ensembler. They allow plug-and-play custom losses and evaluation metrics. They handle missing values in the same way by learning a default branch at each split, …

Still, there are subtle differences we should be aware of. In this section we discuss several important differences (out of commonality) in the two.

6.1 On Tree Type

As a matter of fact, since histogram-based approach together with leaf-wise growth is so powerful, after lightgbm’s entering into the commuity, xgboost also quickly catched up with the same implementation. So in xgboost we can set the following parameter

tree_method="hist"
grow_policy="lossguide"

to train a XGBoost model with leaf-wise growing histogram-based trees, just like LightGBM.

6.2 On Base Learner

lightgbm only supports tree learner (using the boosting parameter) while xgboost also supports linear learners (using the booster parameter).

In xgboost two linear learners are supported: A simple linear regressor and a generalized linear model with tweedie distribution.

For tree learner other than the regression tree they both supports two additional tree models:

  1. Random Forest
  2. DART (Rashmi and Gilad-Bachrach 2015)

One notable difference in RF is that while xgboost allows training a boosted RF, lightgbm only supports training a regular RF (there is no boosting involved).

6.3 On Hyperparameters

There are tons of hyperparameters in modern GBT models we discussed in this notebook. Here we only walk-through those important and non-trivial ones.

6.3.1 Regularization Parameters

6.3.1.1 L1/L2 Penalties

Both packages implement the same penalties on cost function. Only the names and default values are different. In lightgbm L1/L2 penalties are defined by:

lambda_l1=0
lambda_l2=0

And in xgboost they are:

alpha=0
lambda=1

6.3.1.2 Leaf Size

One way to regularize a tree model is to limit the size of its terminal node (leaf). In lightgbm this can be done by either limiting the minimum number of examples or limiting the minimum sum of hessian (\(\sum_{i \in I_j}h_i\) for node \(j\)). Here are the default values:

min_data_in_leaf=20
min_sum_hessian_in_leaf=1e-3

In lightgbm an improper value for this parameter usually results in the following warning message during training runtime:

[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

In such case, try to lower down the value since the algorithm is complaining about not able to split further due to reaching the minimum limits of node size, given the targeted number of leaves. Of course another cure is to reduce tree complexity by decreasing num_leaves.

In xgboost to control leaf size we can only limit on hessian but not the exact example counts:

min_child_weight=1

Remember that in the case of a squared loss the hessian is indeed unity (after normalization), so the hessian sum will be the same as example counts.

Notice the difference in the default values between lightgbm and xgboost. The former has less tolerance on over-fitting (more regularized).

6.3.1.3 Tree Size

A larger tree has more leaves and hence more complicated. More leaves also means more interaction among variables, which in turn controls the degree of non-linearity in the model.7

Since lightgbm grows a tree leaf-wise, there are two parameters to control tree size:

max_depth=-1
num_leaves=31

num_leaves is the main factor, and max_depth supplements it.

In xgboost one can choose to grow a tree either depth-wise (the default) or leaf-wise. For depth-wise tree there is only one relevant parameter:

max_depth=6

For leaf-wise growing tree to control number of leaves:

max_leaves=0

For a leaf-wise tree to grow as many leaves as a depth-wise tree, the former will usually get deeper than the latter. This is why to simultaneously control the number of leaves and tree depth is a good strategy. Be aware that by default lightgbm doesn’t limit tree depth at all.

6.3.1.4 Sub-Sampling

Both libraries support column sampling and row sampling. In lightgbm we have:

bagging_freq=0
bagging_fraction=1
feature_fraction=1
feature_fraction_bynode=1

Note that bagging_freq controls how many iterations to re-do the bagging. So bagging_freq=1 will do row sampling at the start of every iteration. We can even subsample only positive or negative examples (for binary classification), by using pos_bagging_fraction or neg_bagging_fraction. This can be handy when dealing with imbalanced dataset.

For feature sampling, feature_fraction controls the sampling at the beginning of each iteration, while feature_fraction_bynode controls the sampling at each tree node. The latter is in general more regularized since the model training is subject to stronger randomness.

In xgboost we have the following relevant parameters:

subsample=1
colsample_bytree=1
colsample_bylevel=1
colsample_bynode=1

It has less control over bagging, only one parameter subsample is used and the bagging is done at the beginning of each iteration. While it has more control over column sampling. Besides column sampling at iteration or node-level, it also supports column sampling at tree-depth-level.

Note that in both libraries the bagging is done without replacement. So none of them implements bootstrap bagging, which is proposed by the original random forest paper.

6.4 On Imbalanced Data

6.4.1 Class Re-Weighting

Simply based on hyperparameters there are two ways to handle imbalanced dataset in a binary classification task. The first approach is to re-weight the examples. Both xgboost and lightgbm has the parameter for this purpose:

scale_pos_weight=1

To re-balance the distribution to 50-50 the weight should be:

number of negative samples / number of positive samples

This will help the model in its general ranking performance (so metrics such as AUC will improve a lot). For lightgbm there is also another convenient boolean parameter just to do this: is_unbalance.

How does the weighting factor affect our model training? It is directly through the gradients. The scale factor will simply be used as a multiplier on both gradient \(g_i\) and hessian \(h_i\) for that example, provided that the example \(i\) is a positive example.

In xgboost’ source code it is:

_out_gpair[_idx] = GradientPair(Loss::FirstOrderGradient(p, label) * w,
                                Loss::SecondOrderGradient(p, label) * w);

where w is the scale factor (after considering also user-supplied custom weights for each example).

And in lightgbm’s source code it is:

const double response = -label * sigmoid_ / (1.0f + std::exp(label * sigmoid_ * score[i]));
const double abs_response = fabs(response);
gradients[i] = static_cast<score_t>(response * label_weight  * weights_[i]);
hessians[i] = static_cast<score_t>(abs_response * (sigmoid_ - abs_response) * label_weight * weights_[i]);

where label is 1 for positive and -1 for negative, label_weight is the scale factor, sigmoid_ is the scale factor for sigmoid function (default at 1), and weights_ is user-supplied custom weights.

6.4.2 Margin Clipping

But since any non-unity weight also screws up the distribution of our training data, the model will not be able to estimate the class probability correctly. In case that estimating the correct probability is relevant, we have another parameter that may help, which is also shared by the two libraries:

max_delta_step=0

Here is a quote from xgboost’s official document on this parameter:

Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update.

Technically, max_delta_step cap the leaf weight (base model logit output or margin) learnt from each iteration. Not that this is different from learning rate which only applies a fraction of the update. max_delta_step effectively limits the absolute update amount. And learning rate will still apply to the capped update.

This is a bit similar to gradient clipping in gradient descent, where we cap the gradient update to a hard limit to avoid aggressive update.

6.5 On Categorical Encoding

One important user-facing difference is lightgbm’s special support for categorical features without one-hot-encoding. This is probably one of the most handy function in modern machine learning libraries. It not only improves the model accuracy but also simplifies data pipeline and hence reduces engineering efforts.

For xgboost categoricals must be either one-hot encoded or converted to integers with an ordinal assumption. Empirically, even though the feature is not ordinal, sometimes converting it into integers still work pretty well in terms of accuracy, adding that the training runtime can be reduced a lot and data pipeline simplified.

To demonstrate the performance gain of the optimal categorical encoding in LightGBM, let’s simulate some data. Here we use the same data generating process previously setup, but with one more categorical variable that enter linearly into the model.

# Create simulated data for categorical experiment.
library(Matrix)

n <- 10000
z1 <- rnorm(n, 2, 4)
z2 <- runif(n)
e <- rnorm(n)

# Create one categorical with moderate cardinality and skewed distribution.
categorical <- sample(1:20, size=n, replace=TRUE, prob=1.3^(seq(2, 21)))
table(categorical)
categorical
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
  16   21   30   34   53   59   62  105  124  180  198  294  346  509  602  788 
  17   18   19   20 
1017 1445 1837 2280 
sparse_cat <- t(fac2sparse(factor(categorical)))  # One-hot encode.
stopifnot(all.equal(as.numeric(table(categorical)), 
                    colSums(sparse_cat), check.attributes=FALSE))

# Random effects on categories.
cat_beta <- rnorm(ncol(sparse_cat))

# DGP: g = 3 z_1 + 6 z_2 -1.5 z_1z_2 + categorical dummy effects + e
g <- 3 * z1 + 6 * z2 - 1.5 * z1 * z2 + sparse_cat %*% cat_beta + e
g <- g[,1]  # Strip redundant dim.

# Prepare model design matrix.
in_mat <- cbind(z1, z2, sparse_cat)
head(in_mat)  # Training data in sparse matrix format.
6 x 22 sparse Matrix of class "dgCMatrix"
   [[ suppressing 22 column names 'z1', 'z2', '1' ... ]]
                                                                   
[1,] -1.6390872 0.006529306 . . . . . . . . . . . . 1 . . . . . . .
[2,] -4.3118135 0.694535395 . . . . . . . . . . . . . . . . . 1 . .
[3,]  1.5151302 0.992225272 . . . . . . . . . . . 1 . . . . . . . .
[4,]  3.5821921 0.705857143 . . . . . . . . . . . . . . . . . . 1 .
[5,] -0.5556953 0.787081922 . . . . . . . . . . . . . . . . . 1 . .
[6,]  1.0538218 0.330773519 . . . . . . . . . . . . . . . . . 1 . .
in_dt <- data.table(z1=z1, z2=z2, cat=categorical)
head(in_dt)  # Training data in tabular format.
# Train-test split.
is_test_g <- runif(n) < .2

Now train a XGBoost model with one-hot-encoded sparse matrix as input. To be fair, we setup the parameters such that xgboost will also grow a leaf-wise histogram tree as its base learner.

run_xgb <- function(v=1) {
  xgb_dtrain <- xgb.DMatrix(in_mat[!is_test_g,], label=g[!is_test_g])
  xgb_dtest <- xgb.DMatrix(in_mat[is_test_g,], label=g[is_test_g])

  # We set the parameters such that xgb use lgb-style leaf-wise histogram tree.
  xgb.train(params=list(objective="reg:squarederror", booster="gbtree",
                        eta=.3, max_leaves=7, max_depth=3, nthreads=1,
                        tree_method="hist", grow_policy="lossguide"),
            data=xgb_dtrain, watchlist=list(test=xgb_dtest),
            nround=50, early_stopping_rounds=5, verbose=v, print_every_n=10)
}

invisible(run_xgb())
[1] test-rmse:8.549810 
Will train until test_rmse hasn't improved in 5 rounds.

[11]    test-rmse:1.565751 
[21]    test-rmse:1.313399 
[31]    test-rmse:1.241104 
[41]    test-rmse:1.219501 
[50]    test-rmse:1.196451 

Now it’s LightGBM’s turn. We directly feed a data matrix with only 3 columns, one of it being categorical but represented in integer, without the ordinal assumption by specifying explicitly in categorical_feature argument to the model training call.

run_lgb <- function(v=1) {
  lgb_dtrain <- lgb.Dataset(as.matrix(in_dt[!is_test_g]), label=g[!is_test_g],
                            colnames=names(in_dt), categorical_feature="cat")
  lgb_dtest <- lgb.Dataset(as.matrix(in_dt[is_test_g]), label=g[is_test_g],
                           colnames=names(in_dt), categorical_feature="cat")

  # We set the parameters such that lgb behaves like xgb's default for common parameters.
  lgb.train(params=list(objective="regression", boosting="gbdt",
                        lambda_l2=1, learning_rate=.3, metric="rmse",
                        num_leaves=7, max_depth=3, num_threads=1,
                        min_sum_hessian_in_leaf=1, min_data_in_leaf=0),
            data=lgb_dtrain, valids=list(test=lgb_dtest),
            num_iterations=50, early_stopping_rounds=5, verbose=v, eval_freq=10)
}

invisible(run_lgb())
[1]:    test's rmse:6.84807 
[11]:   test's rmse:1.43646 
[21]:   test's rmse:1.21961 
[31]:   test's rmse:1.17573 
[41]:   test's rmse:1.14199 
[50]:   test's rmse:1.12623 

In our specific parameter setup LightGBM outperforms XGBoost. But it is minor since we didn’t tune any of the models seriously. There are already lots of benchmarks on the internet with different experiments and dataset, the results are in general a mixture. We consider two models comparable in terms of accuracy. Here we only care about the performance gap resulted from a different treatment of categorical variables.

Now we benchmark the model training runtime:

library(microbenchmark)

print(microbenchmark(
  xgb=run_xgb(v=0),
  lgb=run_lgb(v=0),
  times=10
))
Unit: milliseconds
 expr      min       lq      mean   median       uq      max neval
  xgb 788.7158 790.1674 796.14318 792.5025 800.2261 817.4937    10
  lgb  26.1314  28.3054  29.80156  28.9952  29.4064  38.3660    10

Clearly LightGBM is much faster than XGBoost with its optimal categorical encoding scheme. Not to mention it also saves our ass from preparing a one-hot encoder for our data.

7 Concluding Remarks

Modern gradient boosting trees are powerful machine learning models for lots of real world applications. In this notebook we review thoroughly two of such frameworks in the open source community: xgboost and lightgbm. We choose them not only because they are both implemented with very high engineering quality, but also because they are both extremely popular and wdiely adopted among data science practitioners.

For such a tool that is so widely spread, we feel there is a need to demystify why it is so effective by understanding how they are different from traditional or vanilla gradient boosting machines. We focus more on the technical details and their novelties in bettering the GBT family. We hope the notebook can help practitioners better understand their tool at hand, so they can utilize it with its maximum potential.

8 References

Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–94. KDD ’16. New York, NY, USA: ACM. https://doi.org/10.1145/2939672.2939785.

Fisher, Walter D. 1958. “On Grouping for Maximum Homogeneity.” Journal of the American Statistical Association 53 (284): 789–98.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.

Ke, Guolin, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. “Lightgbm: A Highly Efficient Gradient Boosting Decision Tree.” In Advances in Neural Information Processing Systems, 3146–54.

Li, Ping, Qiang Wu, and Christopher J Burges. 2008. “Mcrank: Learning to Rank Using Multiple Classification and Gradient Boosting.” In Advances in Neural Information Processing Systems, 897–904.

Meng, Qi, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, and Tie-Yan Liu. 2016. “A Communication-Efficient Parallel Algorithm for Decision Tree.” In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1279–87. Curran Associates, Inc. http://papers.nips.cc/paper/6381-a-communication-efficient-parallel-algorithm-for-decision-tree.pdf.

Rashmi, Korlakai Vinayak, and Ran Gilad-Bachrach. 2015. “DART: Dropouts Meet Multiple Additive Regression Trees.” In.

Shi, Haijian. 2007. “Best-First Decision Tree Learning.” PhD thesis, The University of Waikato.

Therneau, Terry, and Beth Atkinson. 2019. Rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.


  1. Quoted from the creator of xgboost: https://www.quora.com/What-is-the-difference-between-the-R-gbm-gradient-boosting-machine-and-xgboost-extreme-gradient-boosting↩︎

  2. Approximated split search is not particularly novel in XGBoost. It has been well studied in other academic works.↩︎

  3. As of 2019-12-27 lightgbm is not yet available on CRAN for its R wrapper. Please refer to the official instruction to install it separately.↩︎

  4. Histogram tree is not particularly novel in lightgbm. The approach has been proposed for quite long time in the literature and there are also GBT libraries already implemented the approach.↩︎

  5. The complexity of building the histogram itself is still \(O(n)\) but the operation is simple and negligible comparing to the split finding.↩︎

  6. https://github.com/dmlc/xgboost/issues/1950↩︎

  7. When a subsequent split is conditioning on a previous split on another variable, we consider this as variable interaction.↩︎

---
title: "Demystify Modern Gradient Boosting Trees"
subtitle: "From Theory to Hands-On Examples"
author:
- name: Kyle Chung
  affiliation:
date: "`r format(Sys.time(), '%d %b %Y')` Last Updated (27 Dec 2019 First Uploaded)"
output:
  html_notebook:
    highlight: tango
    number_sections: yes
    theme: paper
    toc: yes
    toc_depth: 3
    toc_float: yes
    includes:
      in_header: /tmp/meta_header.html
  code_download: true
bibliography: gbt.bib
link-citations: yes
abstract: |
  Modern gradient boosting trees (GBT) is undoubtedly one of the most powerful machine learning algorithms for traditional supervised learning tasks in the recent decade. In this notebook we try to unbox two such powerful GBT frameworks: `xgboost` and `lightgbm`. We will focus more on the methodology rather than their APIs to deeply understand how these algorithms work and why they are so effective comparing to the vanilla version of GBT, which has been formally introduced nearly 20 years ago.
---

```{r meta, include=FALSE}
meta_header_file <- file("/tmp/meta_header.html")

# Add open graph meta.
meta <- c(
  '<meta name="author" content="Kyle Chung">',
  '<meta property="og:title" content="Demystify Modern Gradient Boosting Trees: From Theory to Hands-On Examples">',
  '<meta property="og:type" content="article">',
  '<meta property="og:url" content="https://everdark.github.io/k9/notebooks/ml/gradient_boosting/gbt.nb.html">',
  '<meta property="og:image" content="https://everdark.github.io/k9/assets/androidify.jpg">',
  '<meta property="og:description" content="A data science notebook about gradient boosting trees.">'
)
contents <- meta

# Add Github corner.
github_corner_svg <- "../../../assets/github_corner.html"
github_corner_conf <- list(github_link="https://github.com/everdark/k9/tree/master/notebooks/ml/gradient_boosting")
contents <- c(contents, stringr::str_interp(readLines(github_corner_svg), github_corner_conf))
writeLines(contents, meta_header_file)

close(meta_header_file)
```

# Motivation

Modern gradient boosting trees (GBT) is undoubtedly one of the most powerful machine learning algorithms for traditional supervised learning tasks in the recent decade.
Ever since the creation of XGBoost there is a booming in different modern designs of GBT in the open source area aiming at not only model accuracy but also computational efficiency.

In this notebook we try to unbox two such powerful GBT frameworks: `xgboost` and `lightgbm`.
We will focus more on the methodology rather than their APIs to deeply understand how these algorithms work and why they are so effective comparing to the vanilla version of GBT,
which has been formally introduced nearly 20 years ago [@friedman2001greedy].

# Gradient Boosting Machines

Before we step into the modern design of gradient boosting trees,
let's have a quick look at the traditional (vanilla) gradient boosting machines.

Gradient boosting is a model ensembling technique.
It combines multiple models into one in an *iterative and additive* manner.
Each individual model is indeed a weak learner,
but jointly they end up providing powerful predictions.
Mathematically speaking,
the model can be expressed as:

$$
\hat{y} = F_k(x) = \sum_k f_k(x),
$$

where $k$ is number of iterations (and hence number of weak learners),
$f_k$ belongs to some model family and $x$ is the feature vector.

A weak learner can be a model with lesser features,
lesser parameters,
or a simpler structure.

## Functional Gradient Descent

Now if we rewrite the above model as:

$$
\begin{aligned}
y
&= F_k(x) + \epsilon \\
&= F_{k-1}(x) + f_k(x) + \epsilon,
\end{aligned}
$$

then we have:

$$
f_k(x) = y - F_{k-1}(x) - \epsilon.
$$

The equation tells us that the function added at round $k$ is fitting the *model residual* of the previous ensembled model ($F_{k-1}$).
If our loss is squared error,
model residual is indeed the negative gradient to the loss w.r.t. the function at the last step (ignoring the constant term):

$$
- \frac{\partial\frac{1}{2}\big[y - F_{k-1}\big]^2}{\partial F_{k-1}} = (y - F_{k-1}).
$$

To generalize the idea,
given any loss function $L(y, F(x))$,
to solve for the optimal function $f_k$ to add at round $k$ we are actually doing *functional gradient descent*:

$$
f_k(x) = - \frac{\partial L(y, F_{k-1}(x))}{\partial F_{k-1}(x)}.
$$

The negative gradient is also referred to as *pseudo-residual*.
When our loss function is not squared error,
we no longer literally fit a model on residuals but on the pseudo-residuals.

## Boosting v.s. Gradient Boosting

It can be easily confused about the name *gradient boosting*.
As we already seen,
graident boosting is indeed a gradient descent algorithm,
but operating at functional space.
That is,
to optimize the objective function we are not looking for parameters but looking for functions.
In the training phase we are iteratively searching for a function (a weak learner) to be included into our model.

*Boosting* is yet another different model ensembling technique,
not to be confused with gradient boosting.
The general idea of boosting is to also train a model iteratively.
In each round the training examples will be re-weighted based on the results (losses) from the last run.
Harder examples will gain higher weights so the subsequent training will shift focus onto them.
The most popular boosting algorithm is probably [AdaBoost](https://en.wikipedia.org/wiki/AdaBoost),
abbreviated from adaptive boosting.

Boosting is similar to gradient boosting in that both are model ensembling with additive weak learners and iterative training.
But they are very different in the way they search for the optimal function at each iteration.
Boosting fits a learner with examples re-weighted by loss generated from the previous ensemble.
Gradient boosting instead fits a learner with pseudo-residuals:
the negative functional gradient w.r.t. the loss.

## Linear Model as Weak Learner

The choice of weak learner has a huge impact on the effectiveness of gradient boosting machines.
Not all weak learners can gain equally from such ensemble.
In this section we demonstrate a model ensemble using a weak linear learner.
It turns out that gradient boosting with a linear model may do no good at all.

Let's create some simulated data based on a simple ground-truth (non-linear) data generating process:

```{r sim_data}
library(data.table)

set.seed(666)

# DGP: y = 3 x_1 + 6 x_2 -1.5 x_1x_2 + e
n <- 1000
x1 <- rnorm(n, 2, 4)
x2 <- runif(n)
e <- rnorm(n)
y <- 3 * x1 + 6 * x2 - 1.5 * x1 * x2 + e

# Train-test split.
is_test <- runif(n) < .2
D <- data.table(x1=x1, x2=x2, y=y)
head(D)
```

We fit a linear model with all features (consdier it as a "strong" learner):

```{r sim_data_fit}
# Fit a linear model with all features.
f <- lm(y ~ x1 + x2, data=D[!is_test])
D <- D[, yhat:=predict(f, D)]

# MSE as our loss function.
mse <- function(y, yhat) {
  mean((y - yhat)^2)
}

# Compute testing loss.
with(D[is_test], mse(y, yhat))
```

Now let's ensemble a weak learner which only takes one feature.
We train the model based on gradient boosting framework.
Since our loss is squared error,
we can simply fit the model residual of the previous ensemble:

```{r gbm_linear_learner}
f1 <- lm(y ~ x1, data=D[!is_test])
D <- D[, yhat1:=predict(f1, D)]
D <- D[, resid1:=y - yhat1]

with(D[is_test], mse(y, yhat1))

f2 <- lm(resid1 ~ x2, data=D[!is_test])
D <- D[, yhat2:=predict(f2, D)]
D <- D[, resid2:=y - yhat1 - yhat2]

with(D[is_test], mse(y, yhat1 + yhat2))

f3 <- lm(resid2 ~ x1, data=D[!is_test])
D <- D[, yhat3:=predict(f3, D)]
D <- D[, resid3:=y - yhat1 - yhat2 - yhat3]

with(D[is_test], mse(y, yhat1 + yhat2 + yhat3))

f4 <- lm(resid3 ~ x2, data=D[!is_test])
D <- D[, yhat4:=predict(f4, D)]
D <- D[, resid4:=y - yhat1 - yhat2 - yhat3 - yhat4]

with(D[is_test], mse(y, yhat1 + yhat2 + yhat3 + yhat4))
```

As one may realize,
the boosted model won't outperform a simple linear model with all features included.
Is it because of our naive built-from-scratch implementation?
Not really.
Let's use `xgboost`--one of the state-of-the-art gradient boosting frameworks--to train a GBM with linear learner:

```{r xgb_linear_learner}
library(xgboost)

Dx <- xgb.DMatrix(as.matrix(D[!is_test, .(x1, x2)]), label=D[!is_test, y])
Dt <- xgb.DMatrix(as.matrix(D[is_test, .(x1, x2)]), label=D[is_test, y])

bst <- xgb.train(params=list(objective="reg:squarederror", booster="gblinear"), 
                 data=Dx, watchlist=list(test=Dt),
                 nround=50, early_stopping_rounds=5, verbose=0)

yhat <- predict(bst, as.matrix(D[is_test, .(x1, x2)]))
mse(D[is_test, y], yhat)
```

The result is roughly the same.

In general,
gradient boosting does not help imporving a linear model.
This is by large due to the fact that GBM is an additive model.
Combining linear models additively will just result in yet another linear model.
Indeed,
using a linear model as the base learner will have the effect of solving a full linear system (a model with all features) in an iterative and approximated manner.
So at best the result is the same as a single linear model with all features.

## Tree Model as Weak Learner

Based on empirical findings,
the best weak learner in GBM is a tree.
Most of the gradient boosting (or just boosting) implementations use tree models as their base learner.

For our simulated data,
gradient boosted trees can easily outperform the linear model:

```{r xgb_tree_learner}
bst2 <- xgb.train(params=list(objective="reg:squarederror", booster="gbtree"), 
                  data=Dx, watchlist=list(test=Dt),
                  nround=50, early_stopping_rounds=5, verbose=0)

yhat <- predict(bst2, as.matrix(D[is_test, .(x1, x2)]))
mse(D[is_test, y], yhat)  # Smaller loss achieved.
```

For the rest of the notebook we will focus our discussion on tree-based gradient boosting.

# Tree Ensembles

Before we talk about gradient boosting trees,
let's have a brief review on tree models in general,
from a single tree to tree ensembles.

Tree models are non-differentiable.
Gradient-based optimizers won't directly work on tree models.
This makes them very different from other machine learning models such as neural networks.

Tree models are powerful because they are non-linear.
Features are allowed to interact with each other in a sequential and loss-guided manner.
And they are invariant to feature scale since a tree split is only ordinal.
This simplifies data preprocessing and hence the overall pipeline.

## A Single Tree

The idea of a single tree model is rather simple.
The model uses a split finding algorithm to enumerate over all the possible splits on all the features.
Once a best split is found based on a particular measure,
the model move on to find the next split given the current node,
growing the tree deeper and deeper until a certain criterion is met.

Here is a general pseudo code describing a single split algorithm in its *exact* form:

```
[Exact Tree Split Algorithm]

m: total number of features
n: total number of examples
s: score to measure the quality of a split
x_ij: the i-th example value on feature j

for feature j in 1 to m:
  sort x_j
  for example x_ij in 1 to n on feature j:
    compute s

do split using j* at x_ij* associated with maximum s
```

A Tree model involves multiple *sequential* splits to arrive at the terminal node which gives prediction.
Here are several important components that must be taken care for a detailed implementation:

#### Feature Iteration {-}

The algorithm is greedy.
To search for the optimal split all features are traversed.
The immediate consequence is the over-fitting nature of a single tree.
And it is such difficulty in generalization leads to the idea of tree ensembles.

#### Feature Value Iteration {-}

For each feature,
a sorting operation is required to traverse over the (distinct) sorted values of that feature.
The more values a feature can take the more computing resources are needed to find the best split.

#### Quality of Split Measure {-}

This is probably the most important part of a tree algorithm.
It directly affects the accuracy of the resulting tree.
The quality score measures how good a node is splitting the examples.
In the literature this is often refered to as *impurity*.
A classical impurity measure is Gini index for classification and residual sum of squares for regression.

#### Depth of the Tree {-}

How deep (how many splits) should we grow the tree?
This is a hyper-parameter that can affect model generalization.
For classification we can grow a tree as deep as until every node contains only one single example.
That is,
until no split can be further achieved.
A tree that is too deep will definitely suffer from over-fitting.

A single tree is usually very bad at generalization.
The only benefit is probably its *interpretability* since the tree structure itself is a set of human friendly decision rules.
However due to its poor generalization over unseen data,
such interpretability is practically of no use.

### A Regression Tree

Continue with our simulated dataset,
let's quickly build a single regression tree and evaluate its performance on the testing set.

```{r sim_data_tree_train}
library(rpart)  # Recursive PARTitioning 
library(rpart.plot)

f <- rpart(y ~ x1 + x2, data=D[!is_test], method="anova", model=TRUE)
rpart.plot(f, digits=3)  # For each node the mean value and population are shown.
```

The testing score is worse than a linear model:

```{r sim_data_tree_eval}
mse(D[is_test, y], predict(f, D[is_test]))
```

Even though our true model is non-linear,
a tree failed to outperform a linear model.
Does that make sense?
Our tree may be too simple to perform a task on predicting a continous value.
The tree is only capable of predicting 9 distinct values,
which make it difficult to compete with a linear model on a continous metric such as squared error.

Indeed it is very easy to build a more complicated (less regulated) tree to outperform the linear model:

```{r sim_data_better_tree}
f2 <- rpart(y ~ x1 + x2, data=D[!is_test], method="anova", model=TRUE,
            control=rpart.control(minsplit=10, cp=5e-4))
uniqueN(f2$where)  # How many terminal nodes?
mse(D[is_test, y], predict(f2, D[is_test]))
```

Put aside regularization and model complexity,
a common quality of split measure for a regression tree is the sum of squares $SS = \sum (y_i -\bar{y})^2$.
So for a given node we find a split such that the decrease in sum-of-squares $SS_T - (SS_L + SS_R)$ is the largest.
$SS_T$ is the $SS$ of current node,
$SS_R$ is $SS$ of the right-hand node resulted from the split,
$SS_L$ the $SS$ of left-hand node.
Note that the prediction of a regression tree node is its mean value.
Hence $SS$ is the squared error of that node.

### A Classification Tree

We use the old-school [UCI Iris](https://archive.ics.uci.edu/ml/datasets/iris) dataset to quickly demo a single classification tree.

```{r rpart_train}
confusion_matrix <- function(y_true, y_pred) {
  table(y_true, y_pred)
}

# Split IRIS dataset into train-test.
is_test_iris <- runif(nrow(iris)) < .3
iris_train <- iris[!is_test_iris,]
iris_test <- iris[is_test_iris,]

# Grow a single classification tree.
f <- rpart(Species ~ ., data=iris_train, method="class")
rpart.plot(f)
```

```{r rpart_eval_train}
# Confusion matrix for the training set.
print(confusion_matrix(iris_train$Species, predict(f, iris_train, type="class")))
```

```{r rpart_eval_test}
# Confusion matrix for the testing set.
print(confusion_matrix(iris_test$Species, predict(f, iris_test, type="class")))
```

A single classification tree model can perform both well in training and testing set for Iris.
Since the dataset is a bit too trivial.
In any real world application,
a single tree is usually not enough to solve the problem.
And this is where we need *more* trees.

## From One Tree to Many Trees

To overcome the shortfall of a single tree,
ensembling technique is used to combine multiple trees into one single model.
There are two very successful ensembling techniques used for tree models:
bagging and boosting.
The general idea is to combine many simple (hence weak) trees as a tree committee for final prediction.
It turns out that the variability in many simple/weak and especially non-linear learners can jointly outperform a single (and usually over-fitting) complicated/strong learner.

### Bagging Ensembles

The most well-known bagging tree model is Random Forest (RF hereafter).
In each single tree in RF the split algorithm is modified to only iterate over a subsample of all features.
This reduces the ability of any single tree to over-fit the training data and create diversity among different trees.
RF also use boostrap sampling on the training dataset when doing feature value iteration.
Rather than searching split on full training set it only searches on a set of boostrapped samples.
Some model variants even perform only a small number of random splits (instead of finding the best split with a full scan) on the feature value iteration phase.

Although it is claimed in theory that RF is strong against over-fitting due to both feature (column) subsampling and boostrap (row) sampling,
in reality RF can still easily over-fit without proper tunning.

One obvious advantage of RF,
in terms of computational efficiency,
is that it is [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel).
Most modern implementations of RF supports single-machine cpu parallel or even distributed computing.

### Boosting Ensembles

Another ensembling technique which has been proven to be even more power empirically is boosting.
We have already discussed the difference between boosting and gradient boosting.
Here we specifically refer to gradient boosting since it usually gives better results and it is our main topic in this notebook.

Unlike RF which trains multiple trees independently and aggregates the results,
gradient boosting trees (GBT hereafter) adopt an additive training procedure,
so each tree is iteratively added into the model.

Similar to RF or any other tree ensembling,
each individual tree is a very weak learner.
But jointly they become very good at generalization.
Boosting trees indeed have achieved quite some of the state-of-the-art performance in real world applications.

Even though GBT has been introduced in the literature for almost 20 years already,
it is the recent 5 years that several very powerful open source implementations have emerged to really shape the practitioners' choice of toolbox.
In the following sections we'd like to deep-dive into two of them:
`xgboost` and `lightgbm`.
We will focus especially on their novelties in bettering GBT to the extreme.

# XGBoost

@Chen2016 open source the [`xgboost`](https://github.com/dmlc/xgboost) package,
probably one of the most influential GBT framework to date.
It also comes with the most variety of langauge API support:
Python, R, Julia, Java, ..., etc.
It is also more or less integrated with distributed computing platform such as Spark,
and cloud solution such as GCP and AWS.

```{r xgb_ver}
print(packageVersion("xgboost"))
```

## Regularization on Impurity

> The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms.
Which is the reason why many people use xgboost.
For model, it might be more suitable to be called as regularized gradient boosting.
^[Quoted from the creator of xgboost: https://www.quora.com/What-is-the-difference-between-the-R-gbm-gradient-boosting-machine-and-xgboost-extreme-gradient-boosting]

One important and novel feature of XGBoost is to introduce regularization in the impurity measure.
Before this,
most tree ensembles focus on using traditional way of regularization,
such as pruning, subsampling, or limiting number of leaves.

In `xgboost` the formulation of the objective function makes its tree ensembles extremely flexible.
Loss can be plug-and-played as well as evaluation metrics.
In general,
the model has the following cost function:

$$
\text{Cost} = \sum_{i=1}^n L(y_i, \hat{y_i}) +
\frac{1}{2}\lambda\vert\vert W \vert\vert^2 +
\alpha\vert\vert W\vert\vert,
$$

where $L(y_i, \hat{y_i})$ is a per-instance loss,
$W$ is the weights of all tree leaves,
$\lambda$ and $\alpha$ are L2 and L1 regularization term,
respectively.

(By default `xgboost` has $\lambda = 1$ and $\alpha = 0$.)

The cost function indeed looks no different than any other differentiable machine learning model we already know.
What makes it novel is that `xgboost` uses the cost as the impurity measure to implement the split finding algorithm in a tree,
which is not differentiable.
But how?

## Additive Training

For convenience let's re-state the gradient boosting model function here,
with a bit more detailed notations:

$$
\begin{aligned}
\hat{y}^{(k)} = F_k(x)
&= \sum_k f_k(x) \\
&= \sum_{t=1}^{k-1}f_t(x) + f_k(x),
\end{aligned}
$$

where $\hat{y}^{(k)}$ is the model prediction for model at round $k$.

Define the loss function (without regularization term for now for simplicity) at round $k$ for $i = 1, ..., n$ examples to be:

$$
\begin{aligned}
\text{Loss} &= \sum_{i=1}^n L(y_i, \hat{y_i}^{(k)}) \\
&= \sum_{i=1}^n L(y_i, \hat{y_i}^{(k-1)} + f_k(x_i)).
\end{aligned}
$$

Here $\hat{y_i}^{(k-1)}$ is purely determined by model already trained in the previous round,
we only care about picking up a tree model $f_k(x_i)$ which can further reduce the loss the most.

When the loss is a squared error,
the function is somewhat trackable:

$$
\begin{aligned}
\text{MSE-Loss} &= \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y_i}^{(k)})^2 \\
&= \frac{1}{n}\sum_{i=1}^n \bigg[
\underbrace{(y_i - \hat{y}^{(k-1)})^2}_\text{Constant}
- 2(y_i - \hat{y_i}^{(k-1)})f_k(x_i)
+ f_k(x_i)^2\bigg] \\
&= \frac{1}{n}\sum_{i=1}^n \bigg[
- 2\underbrace{(y_i - \hat{y_i}^{(k-1)}}_\text{Residual at k-1})f_k(x_i)
+ f_k(x_i)^2\bigg] + \text{Constant}.
\end{aligned}
$$

But when the loss is, say, a cross-entropy loss (which is fairly common),
the function becomes much harder to deal with.

## Second-Order Loss Approximation

In order to allow a general plug-and-play framework for different loss functions,
instead of solving the exact functional form of a particular loss,
`xgboost` approximates the loss with a second-order [Taylor series]((https://en.wikipedia.org/wiki/Taylor_series)) expansion.

The 2nd-order expansion of a function $L(x)$ at point $x_0$ can be written as:

$$
L(x) \approx L(x_0) + L'(x_0)(x - x_0) + \frac{1}{2}L''(x_0)(x - x_0)^2.
$$

Now define:

$$
x - x_0 = a,
$$

we then have:

$$
L(x_0 + a) \approx L(x_0) + L'(x_0)a + \frac{1}{2}L''(x_0)a^2,
$$

or simply:

$$
L(x + a) \approx L(x) + L'(x)a + \frac{1}{2}L''(x)a^2.
$$

Use the form we can re-write our loss function:

$$
\begin{aligned}
\text{Loss}
&= \sum_{i=1}^n L(y_i, \hat{y_i}^{(k)}) \\
&= \sum_{i=1}^n
L(y_i, \hat{y_i}^{(k-1)} + f_k(x_i)), \\
\text{Approx. Loss} &= \sum_{i=1}^n
\Bigg[
\underbrace{
\vphantom{\frac{\partial L(y_i, \hat{y_i}^{(k-1)})}{\partial \hat{y_i}^{(k-1)}}}
L(y_i, \hat{y_i}^{(k-1)})}_\text{Constant} +
\underbrace{\frac{\partial L(y_i, \hat{y_i}^{(k-1)})}{\partial \hat{y_i}^{(k-1)}}}_{g_i}f_k(x_i) +
\underbrace{\frac{\partial^2 L(y_i, \hat{y_i}^{(k-1)})}{\partial^2 \hat{y_i}^{(k-1)}}}_{h_i}f_k(x_i)^2
\Bigg] \\
&= \sum_{i=1}^n
\Bigg[
g_if_k(x_i) +
h_if_k(x_i)^2
\Bigg] + \text{Constant}.
\end{aligned}
$$

We follow the author's notation on the first order term as $g_i$ (g for gradient) and the second order term as $h_i$ (h for [Hessian](https://en.wikipedia.org/wiki/Hessian_matrix)).
Together with the regularization term it becomes the quality of split measure which ultimately guides the model to pick up the next learner $f_k$ at round $k$.

Essentially XGBoost employs a functional gradient descent up to the second-order derivative.
This makes it more precise in finding the next best learner.

Note that the value of $g_i$ and $h_i$ both remain the same for the current round,
meaning that the computing cost is largely reduced when we are evaluating among different tree splits.
This is one of the reason why XGBoost is so much faster than any other vanilla implementation of GBT.

## Tree Split Optimization

Now let's formally parameterize a tree model $f_k(x_i)$ with $T$ terminal nodes (leaves),
$w_j$ denotes the weight for $j$-th leaf,
and $I_j$ the set of examples assigned to leaf $j$.
The approximated loss (with regularization term this time) at round $k$ can then be re-written in a *group-by-leaf* summation:

$$
\begin{aligned}
Cost^k &=
\sum_{i=1}^n
\Bigg[
g_if_k(x_i) + h_if_k(x_i)^2
\Bigg] +
\frac{1}{2}\lambda\vert\vert W \vert\vert^2 +
\alpha\vert\vert W\vert\vert \\
&= \sum_{j=1}^T
\Bigg[
\sum_{i \in I_j} g_iw_j + \frac{1}{2}\sum_{i \in I_j}h_iw_j^2
\Bigg] +
\frac{1}{2}\lambda\sum_{j=1}^T w_j^2 + \alpha\sum_{j=1}^T w_j \\
&= \sum_{j=1}^T
\Bigg[
\sum_{i \in I_j} g_iw_j + \frac{1}{2}(\sum_{i \in I_j}h_i + \lambda)w_j^2 + \alpha\vert w_j\vert
\Bigg].
\end{aligned}
$$

Given the cost function we can solve for optimal weight $w_j^*$ minimizing the cost with a first-order-condition:

$$
\frac{\partial Cost^k}{\partial w_j} = 0,
$$

which gives:

$$
\sum_{i \in I_j} g_i + (\sum_{i \in I_j}h_i + \lambda)w_j + \alpha\frac{w_j}{\vert w_j\vert} = 0.
$$

For a null L1 penalty ($\alpha = 0$) the solution is simply:

$$
\begin{aligned}
w_j^* &= - \frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j}h_i + \lambda}, \\
Cost^{k*} &= - \frac{1}{2} \sum_{j=1}^T \frac{\big(\sum_{i \in I_j} g_i\big)^2}{\sum_{i \in I_j}h_i + \lambda}.
\end{aligned}
$$

Now we have solved the weights that minimize the cost once $T$ is determined.
We will also need the optimized cost to determine the best split.
That is,
the quality of split measure.
The best split should result in two nodes such that the total cost of two nodes is reduced the most from that of their parent node.

A learning rate can be applied to the optimal weights at each iteration,
making the ensembler more conservative in updating itself.

Here is a pseudo code describing the core algorithm:

```
[XGBoost Additive Training Algorithm]

F: initial model
M: total number of features
N: total number of examples
K: total number of iterations (tree models)
T: total number of leaves
g_ik: g_i for example i at round k
h_ik: h_i for example i at round k
g_ijk: g_i for example i at round k assigned to leaf j
h_ijk: h_i for example i at round k assigned to leaf j
cost: the score at leaf j, sum(g_ijk)^2 / (sum(h_ijk) + lambda)
x_im: the i-th example value on feature m

for round k in 1 to K:
  while number of split < T:
    compute g_ik, h_ik for all i
    for feature j in 1 to m:
      sort x_j
      for example x_im in i = 1 to n on feature m:
        split by x_im into left and right leaves
        compute sum of cost on left and right leaves
    find best split using feature m* at x_im* associated with minimum cost
  update model F <- F + tree learned
```

### Common Loss Functions

`xgboost` already comes with several loss functions built-in.
We can also define our custom loss easily by just implementing the first ($g$) and second order ($h$) gradients of that loss w.r.t. model output.
Here we simply review the two mostly used loss functions.

#### Squared Loss {-}

Deriving $g$ and $h$ for a squared loss is trivial:

$$
\begin{aligned}
g_i &= \frac{\partial(y_i - \hat{y}_i^{(k-1)})^2}{\partial\hat{y}_i^{(k-1)}} = 2(\hat{y}_i^{(k-1)} - y_i), \\
h_i &= \frac{\partial^2(y_i - \hat{y}^{(k-1)})^2}{\partial^2\hat{y}^{(k-1)}} = 2.
\end{aligned}
$$

We can ignore constant so the gradient $g_i$ is simply $\hat{y} - y$ and the hessian is unity.

Let's use R's *symbolic derivatives* to verify our derivation:

```{r symbolic_deriv_squared_loss}
dy2y <- deriv3(~ (y - yhat)^2, c("yhat"))

yhat <- rnorm(5)
y <- rnorm(5)

2*(yhat - y)  # Gradient by hands.

eval(dy2y)  # By symbolic derivatives.
```

#### Cross-Entropy Loss {-}

A subtle details must be taken care for cross-entropy loss.
Intuitively we'd like to just take the derivative w.r.t. the model output probability:

$$
\begin{aligned}
g_i &= \frac{\partial\big[- y_i\ln\hat{y}_i^{(k-1)} - (1 - y_i)\ln(1 - \hat{y}_i^{(k-1)})\big]}
{\partial\hat{y}_i^{(k-1)}} =
\frac{\hat{y}_i^{(k-1)} - y_i}{(1 - \hat{y}_i^{(k-1)})\hat{y}_i^{(k-1)}}, \\
h_i &= \frac{\partial^2\big[- y_i\ln\hat{y}_i^{(k-1)} - (1 - y_i)\ln(1 - \hat{y}_i^{(k-1)})\big]}
{\partial^2\hat{y}_i^{(k-1)}} = \frac{1 - y_i}{(1 - \hat{y}_i^{(k-1)})^2} + \frac{y_i}{(\hat{y}_i^{(k-1)})^2}.
\end{aligned}
$$

Again use symbolic differentiation to verify our derivation:

```{r symbolic_deriv_log_loss}
# Demonstrate symbolic derivatives.
dq2q <- deriv3(~ - (y*log(q) + (1-y)*log(1 - q)), c("q"))
y <- sample(c(0, 1), 5, replace=TRUE)
q <- runif(5)

(1 - y)/(1 - q) - y / q  # Gradient by hands.

(1 - y)/(1-q)^2 + y / q^2  # Hessian by hands.

eval(dq2q)  # By symbolic derivatives.
```

The above derivation assumes $\hat{y}$ is the sigmoid model output (probability).
While actually in `xgboost`'s implementation for logistic regression the term $\hat{y}$ is *before* sigmoid transformation (what the auther refers to as *margin*).
So the model weights are raw scores (*logits*) instead of probabilities.
This makes sense since the model is additive.
It is better to do the add operation before the final sigmoid transformation of the ensembler.
That is,
each tree predicts raw scores and the final model ensemble is the sigmoid of the sum of all the raw scores (from the assigned leaves).
This also largely simplifies the derivatives because now what we will be doing is:

$$
g_i = \frac{\partial L}{\partial t}\frac{\partial t}{\partial\hat{y}},
$$

where $t$ is the sigmoid $t(\hat{y}) = \frac{1}{1 + \exp(-\hat{y})}$.

Based on the fact that $\frac{\partial t}{\partial \hat{y}} = t(1 - t)$,
we finally have (simplying $\hat{y}_i^{(k-1)}$ as $\hat{y}$ to save typings):

$$
\begin{aligned}
g_i &= \frac{t - y_i}{t(1 - t)} \times t(1-t) \\
&= t(\hat{y}) - y_i, \\
h_i &= t(\hat{y})(1 - t(\hat{y})).
\end{aligned}
$$

### Exercise: Implement a Single XGB Split

To verify our understanding,
as an exercise we built a single tree with just one split (a *decision stump*).
And we comapre the result of our implementation to that of `xgboost` to see if they agree with each other.

```{r xgb_one_split_from_scratch}
lambd <- 1  # L2 regularization.
all_features <- c("x1", "x2")

cost <- function(df) {
  # The XGB optimal cost based on 2nd-order Taylor expansion.
  - sum(df$g)^2 / (2*(sum(df$h) + lambd))
}

cost_on_split <- function(D, split_ind) {
  # Split the data into left and right child nodes.
  DL <- D[1:split_ind, .(g, h)]
  DR <- D[(split_ind + 1):nrow(D), .(g, h)]
  cost(DL) + cost(DR)
}

feature_iteration <- function(D, all_features) {
  feature_best_costs <- list()
  best_split_ind <- list()
  best_split_val <- list()
  for ( x in all_features ) {
    # Sort examples by x.
    setorderv(D, x)

    # Feature value iteration.
    split_cost <- rep(NA_real_, nrow(D) - 1)
    for ( i in seq_len(nrow(D) - 1) ) {
      split_cost[i] <- cost_on_split(D, i)
    }
    feature_best_costs[[x]] <- min(split_cost)
    best_split_ind[[x]] <- which.min(split_cost)
    best_split_val[[x]] <- D[[x]][best_split_ind[[x]]]
  }

  # Determine which feature gains the largest reduction in cost.
  split_col <- names(which.min(feature_best_costs))
  list(split_col=split_col,
       split_ind=best_split_ind[[split_col]],
       split_val=best_split_val[[split_col]])
}


y <- D$y[!is_test]
dy2y <- deriv3(~ (y - yhat)^2, c("yhat"))
yhat <- .5  # This is the initial prediction used by xgboost.
dy <- eval(dy2y)
D <- D[!is_test, g:=attr(dy, "gradient")[,"yhat"]]
D <- D[!is_test, h:=attr(dy, "hessian")[,"yhat","yhat"]]

# Feature iteration on the root split.
r <- feature_iteration(D[!is_test], all_features)
print(r)
```

```{r xgb_tree_learner_verify}
# Verify the result with a real xgboost implementation.
bst3 <- xgb.train(params=list(objective="reg:squarederror", max_depth=1),
                  data=Dx, nround=1)
xgb.model.dt.tree(model=bst3)
```

So the optimal split at root node from our toy implementation agrees with that of `xgboost`.
Note that the split value reported by `xgboost` does not correspond to an actual value in the feature.
It is a virtual value just for splitting purpose.
Indeed it is the average value between the closest two values at the split.
To check this:

```{r check_split_val}
D2 <- D[!is_test]
setorder(D2, x1)
print(mean(D2[r$split_ind:(r$split_ind + 1), x1]))
```

### Parallelization

Unlike bagging trees which are embarassingly parallel,
boosting trees are trained iteratively,
which makes them harder to parallelize.
We can, however, still parallelize the training process,
not at tree-level but at feature-level.

Since the algorithm needs to traverse over all (or a subset of) features to find the best split,
feature iteration is readily parallelable.
All modern GBT libraries support this kind of parallelization.

### Comparing to Vanilla GBT

In vanilla gradient boosting at each iteration we fit a tree to the pseudo residuals,
i.e., negative gradient of loss function w.r.t. model output.
The tree (or any other base learner) is learning from the pseudo residuals with its own loss.
Such loss is local and may not reflect the global objective we want to optimize.
XGBoost incorporates that global objective (with regularization) into each iteration,
making the base learner more relevant to our end goal.

The second-order approximation of global objective further reduces the computing cost by converting the loss computation into a group-by-leaf sum operation,
with all gradients pre-computed at the beginning of each iteration.

Notice that in a vanilla GBT we need to calculate the gradients to get the pseudo-residuals,
and the base learners need to calculate the gradients of their own objective function (or calculate the impurity measure) in order to do optimization.
XGBoost hence is much faster than vanilla GBT in training.

## Approximated Tree Split

The exact split finding algorithm may not be scalable in large applications.
So instead `xgboost` supports an approximated split algorithm described in the following pseudo code:
^[Approximated split search is not particularly novel in XGBoost.
It has been well studied in other academic works.]

```
[Approximated Tree Split Algorithm]

m: total number of features
s: score to measure the quality of a split
eps: percentile increment, dividing feature roughly into 1/eps buckets
x_pj: the p-th example percentile value on feature j

for feature j in 1 to m:
  divided feature j into buckets based on eps
  aggregate s for each bucket

for feature j in 1 to m:
  for all value x_pj on feature j:
    compute s based on bucketed aggregation

do split using j* at x_pj* associated with maximum s
```

The idea is to split only at percentile value,
introducing one additional parameter called `sketch_eps` (default at 0.03 in `xgboost`) to control the granularity of the buckets.
So the number of search for each continous variable will be capped roughly at `1/sketch_eps` instead of the number of distinct values,
which can be huge.

By default in `xgboost` the bucketization is done only once at the model initialization phase in each iteration.
All subsequent splits are performed on the same set of buckets for each feature.
It is also possible to re-propose new buckets after each optimal split,
as pointed out in the original paper,
but such control is not yet open to end user in `xgboost` interface as of `r format(Sys.time(), '%Y-%m-%d')`.

More specifically,
in `xgboost` the parameter `tree_method` controls the type of splitting algorithm:

+ `tree_method="exact"`: The exact splitting algorithm by default when dataset is small
+ `tree_method="approx"`: The approximated splitting algorithm
+ `tree_method="hist"`: The histogram tree approach (refer to [LightGBM section](#lgb) for details)

## Sparsity Awareness: Split on Missing

One novelty in `xgboost` is its capability of dealing with missing values.
It solves the missing value problem by learning a default branch (either the left or the right child node) for each node in a split.
During training,
in each node the missing value can go to either left or right branch.
Quality score for both branches are now computed based on the assumption that missing values should go to the other branch.
In the end the split algorithm returns not only optimal split but also optimal default branch for missing value at each node.

`xgboost` is indeed not the first tree model to handle missing values in its own way.
For example,
`rpart` also handles missing value in a particular manner.
In `rpart`,
each split will come with additional surrogate variables to handle cases where the split feature value is missing at inference time.
A surrogate variable is a feature other than the best split feature but used to classify the resulting child nodes passing a minimum quality criterion.
For more details readers may refer to @rpart.

Comparing to `rpart`'s surrogate variable approach,
which increases computing cost,
`xgboost`'s approach is more computationally efficient and seems to also result in better model accuracy.

## Random Forest via XGBoost

`xgboost` is a general framework for model (especially tree) ensemblers.
Since there is no rule stopping us from building more than one tree at each iteration,
we can effectively grow multiple trees with only one round to train a random forest with `xgboost`.
Even more,
we can train a *boosting random forest* by growing multiple trees and with multiple iterations.
That is,
we use RF as our base learner in a gradient boosting machine.

# LightGBM {#lgb}

@ke2017lightgbm open source the [`lightgbm`](https://github.com/microsoft/LightGBM) package,
yet another powerful GBT framework with its own remarkable novel implementation.
^[As of `r format(Sys.time(), '%Y-%m-%d')` `lightgbm` is not yet available on [CRAN](https://cran.r-project.org) for its R wrapper.
Please refer to the [official instruction](https://github.com/microsoft/LightGBM/tree/master/R-package) to install it separately.]

```{r import_lgb, message=FALSE}
library(lightgbm)
packageVersion("lightgbm")
```

Here is a simple training task using `lightgbm` on our simulated data:

```{r lgb_tree_learner}
Dx2 <- lgb.Dataset(as.matrix(D[!is_test, .(x1, x2)]), label=D[!is_test, y])
Dt2 <- lgb.Dataset(as.matrix(D[is_test, .(x1, x2)]), label=D[is_test, y])

# We set the parameters such that lgb behaves like xgb's default for common parameters.
lgbst <- lgb.train(params=list(objective="regression", boosting="gbdt", lambda_l2=1,
                               min_sum_hessian_in_leaf=1, min_data_in_leaf=0),
                   data=Dx2, valids=list(test=Dt2),
                   num_iterations=100, early_stopping_rounds=5, verbose=0)

yhat <- predict(lgbst, as.matrix(D[is_test, .(x1, x2)]))
mse(D[is_test, y], yhat)
```

In general `lightgbm` runs even faster than `xgboost` in training,
but with comparable model accuracy.
This is achieved by a combination of different strategies (some of them novel).
We will walk-through each of them in the following sections.

## Histogram Tree (Quantization)

`lightgbm` uses histogram-based tree split algorithm.
^[Histogram tree is not particularly novel in `lightgbm`.
The approach has been proposed for quite long time in the literature and there are also GBT libraries already implemented the approach.]

Instead of splitting at raw feature values histogram-based approach bucketizes continous features into discrete bins,
forming a limited number of feature bins (in `lightgbm` this is default `max_bin=255`) and then split on those bins.
This largely reduces the complexity from $O(n)$ to $O(\text{max_bin})$ during split finding for a feature.
^[The complexity of building the histogram itself is still $O(n)$ but the operation is simple and negligible comparing to the split finding.]

### Greedy Equal-Density Binning

There are two strategies in binning:
equal bin length [@li2008mcrank] or equal bin density.
`lightgbm` adopts the latter.
So number of data points will be roughly the same for most bins.
This effectively transforms feature distribution into a uniform one when finding the best split.

`lightgbm` doesn't use the entire training dataset to construct the boundaries of bins.
Instead it use a random subset (only applicable when data size is huge enough).
This random sample size can be controlled by parameter `bin_construct_sample_cnt=200000`.
Additionally,
`min_data_in_bin=3` can be used to control minimal number of data inside one bin.

Note that this approach is different from the percentile splitting we've discussed for approximated tree split implemented in `xgboost`:

> The existing approximate splitting method in `xgboost` also buckets continuous features into discrete bins to speed up training.
The `approx` method generates a new set of bins for each iteration,
whereas the `hist` method re-uses the bins over multiple iterations.
^[https://github.com/dmlc/xgboost/issues/1950]

### Histogram Subtraction Trick

To calulate the information gain from a split we need to calculate the cost for the resulting two nodes.
Remember that a cost of a single node $j$ is:

$$
\text{Cost of Node j} = - \frac{1}{2} \frac{\big(\sum_{i \in I_j} g_i\big)^2}{\sum_{i \in I_j}h_i + \lambda}.
$$

A key observation here is that what we need is simply summation of gradients inside the node.
Since an example is assigned to either left ($I_j^L$) or right ($I_j^R$) child after a split,
apparently we have:

$$
\begin{aligned}
\sum_{i \in I_j} g_i &= \sum_{i \in I_j^L} g_i + \sum_{i \in I_j^R} g_i, \\
\sum_{i \in I_j} h_i &= \sum_{i \in I_j^L} h_i + \sum_{i \in I_j^R} h_i.
\end{aligned}
$$

This means that when calculating the total gain of a split,
we only need to calculate the cost of one node.
And the cost of the other node can be obtained by first substracting the gradient sums from that of the parent,
then squaring the first-order gradient sum before pluging into the cost function to arrive at the final cost of the other node.
To save even more resources,
we can always choose to calculate on the smaller child node.
This is one key factor why `lightgbm` can run even faster than `xgboost`'s default or the approximated tree method.

### Speed-Accuracy Tradeoff

In the original paper it is demonstrated with experiments that a total number of 255 bins (plus one bin specifically for zeroes) can perform very well.
We can always choose more bins for potentially better accuracy,
but at the price of slowering down the training speed.

We can use a single decision stump to easily see the effect of `max_bin`.
First let's run with the default setting:

```{r lgb_decision_stump}
lgbst2 <- lgb.train(params=list(objective="regression", boosting="gbdt", lambda_l2=1,
                                num_leaves=2),
                    data=Dx2, num_iterations=1, verbose=0)

lgb.model.dt.tree(model=lgbst2)[]
```

The optimal root split coincide with `xgboost` and also our toy implementation.
Note that we have far more than `max_bin=255` distinct values in our splitting variable:

```{r distinct_count}
distinct_cnt <- uniqueN(D[!is_test, x1])
print(distinct_cnt)
```

Instead of performing `r distinct_cnt - 1` splits LightGBM only performs 255 splits to locate the same optimal split.

Now let's reduce the maximal number of bins to a rather small number:

```{r lgb_decision_stump_large_bin}
# We need to re-construct the dataset since lgb caches the bins in dataset.
Dx2 <- lgb.Dataset(as.matrix(D[!is_test, .(x1, x2)]), label=D[!is_test, y])
lgbst3 <- lgb.train(params=list(objective="regression", boosting="gbdt", lambda_l2=1,
                                num_leaves=2, max_bin=10),
                    data=Dx2, num_iterations=1, verbose=0)

lgb.model.dt.tree(model=lgbst3)[]
```

By lowering down `max_bin` now we end up with a sub-optimal split,
with smaller gain for the split.

### Handling Sparsity

Histogram tree cannot be fully optimized for sparse data though.
Since no matter a feature value is zero or not,
its bin value needs to be accessed.
As a result,
to speed up training runtime there must be other strategies targeted at sparsity optimization.

## Gradient-Based One-Side Sampling

To reduce the training time one idea is to reduce number of splits in finding the best split.
Without any approximation or quantization,
for each feature the algorithm needs to traverse over all distinct feature values to find the best split.
This can be prohibitively slow for large applications.
`xgboost` tackles this with its approximated tree split using only percentiles as split value.
`lightgbm` has several different strategies combined together.
The histogram tree method we just discussed is one of them.
Another such strategy is called Gradient-Based One-Side Sampling (GOSS).

When finding the best split,
GOSS down-samples the examples with their absolute gradient sizes as sampling weights.
It is based on the fact that examples with smaller gradients generally means they are already well-trained (and hence the error is smaller).
Examples with larger gradients,
instead,
should be the focus for the model to imporve further.
The concept is borrowed from AdaBoost,
where examples are re-weighted at each boosting round to guide the model on learning the harder examples.

GOSS keeps examples with large gradients and only do sampling on examples with small gradients.
Since this will screw up the distribution,
which in turn hurts the model's ability to learn,
GOSS compensates such effect by applying a constant factor $\frac{1 - a}{b}$ to sampled examples with small gradient to "re-weight" the data back to its original distribution.
Here $a$ is the top percentage of examples to keep,
sorted descendingly by gradients,
and $b$ the sampling rate for the rest of the examples (with smaller gradients than those top ones).
The authors in their paper provide the theoretical ground to support that GOSS can approximate well when the data size is huge.

To use GOSS in `lightgbm` we need to specifiy the parameter `boosting="goss"`.
By default `booster="gbdt"` is just the modern GBT with leaf-wise growing histogram trees.

## Exclusive Feature Bundling

GOSS deals with data in long size.
What about very wide (high dimentionality) data?
In such case,
`lightgbm` adopts a strategy to group together features that rarely co-exist into single feature and do the split scan.
This is referred to as Exclusive Feature Bundling, or EFB.

`lightgbm` postulates EFB as a [graph coloring problem](https://en.wikipedia.org/wiki/Graph_coloring) by taking features as vertices and adding edges for every two features if they are not mutually exclusive.
To speed up even more,
it introduces a certain level of "pollution" by allowing features that are not 100% mutually exclusive to be also bundled together.
This can be controlled by parameter `max_conflict_rate=0`.
The higher the rate the more compromises for the model accuracy.

EFB can be enabled by parameter `enable_bundle` (and by default it is on already).

## Leaf-Wise Tree Split

Leaf-wise tree split is also called best-first split [@shi2007best].
Classical trees will grow their nodes in a depth-first order.
At each layer all nodes are splitted in order to construct the next layer.
The final tree layout can be asymmetric due to post-training pruning or other regularization techniques,
but the grow policy is to always go deeper first for each node at the same level.

A leaf-wise growing tree doesn't always split every node at the same layer down to the next layer.
Instead,
it chooses the node that can result in the best gain for every split.
Since one of the resulting two additional nodes at current split may result in the next best split,
the tree may well grow horizontally (leaf-wise) instead of vertically (depth-wise).

Emprically,
a single leaf-wise growing tree tend to over-fit more than a depth-wise growing tree.
But as a week learner (with shallower structure) in gradient boosting it can outperform the depth-wise approach.
To further regularize the tree,
`lightgbm` also has `max_depth` to control the depth of the tree,
even though it is growing leaf-wise.

## Optimal Categorical Encoding

Usually we use one-hot encoding to represent categorical variables.
To be memory-friendly,
such data input is usually a sparse matrix.
But when a categorical is highly skewed (high cardinality),
it will hurt the performance of a tree learner.

Here is a direct quote from `lightgbm` 's official doc:

>Instead of one-hot encoding,
the optimal solution is to split on a categorical feature by partitioning its categories into 2 subsets.
If the feature has k categories,
there are `2^(k-1) - 1` possible partitions.
But there is an efficient solution for regression trees.
It needs about `O(k * log(k))` to find the optimal partition.
The basic idea is to sort the categories according to the training objective at each split. More specifically,
LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (`sum_gradient` / `sum_hessian`) and then finds the best split on the sorted histogram.

Readers interested can refer to @fisher1958grouping for the details of the solution to optimal partition in a categorical variable.

## Parallel Learning

We've mentioned that GBT is a iterative training algorithm so theoretically the parallelization must be done at a level other than the base learner.
In `lightgbm` parallelization is seriously implemented into different such levels,
with support of both multi-threading or multi-machine runtime.

(By default `tree_learner=serial` so there is no parallelization at learning.)

### Feature Parallel

```
tree_learner=feature
```

This is the most straightforward parallelization we can think of.
Data is partitioned vertically so each thread will only inspect a subset of features.
Best split is found simultaneously among those features to get the local best splits,
then a comminucation overhead is required to arrive at the global best.
The speedup is limited if our data is not wide but instead very long.

### Data Parallel

```
tree_learner=data
```

Instead of vertically parallelization,
`lightgbm` also supports horizontal parallelization.
Local gradient histograms will be constructed for each partition and then merged to a global histogram for finding the best split.

### Voting Parallel

```
tree_learner=voting
```

@NIPS2016_6381 propose the idea of parallel voting tree.
The general idea can be summarized into the following flow:

1. Partition data according to multiple threads/machines
2. In each partition a local voting is done to propose top-k features for the best split
3. Top-2k features are collectd based on all local votes
4. Full data for the top-2k is collected and a global best split is searched within them

Such parallel voting strategy has limited communication overhead but can approximate very good in accuracy.
In `lightgbm`,
by default when doing parallel voting we have `top_k=20`.

In general,
since `lightgbm` is already lightning fast,
only very huge data will we need to resort to parallel voting.

# XGBoost v.s. LightGBM

`xgboost` and `lightgbm` have a lot in common,
even their APIs (in both Python and R) are highly similar to each other.
They use the same global loss function with 2nd-order approximation and regularization to optimize the model ensembler.
They allow plug-and-play custom losses and evaluation metrics.
They handle missing values in the same way by learning a default branch at each split, ...

Still,
there are subtle differences we should be aware of.
In this section we discuss several important differences (out of commonality) in the two.

## On Tree Type

As a matter of fact,
since histogram-based approach together with leaf-wise growth is so powerful,
after `lightgbm`'s entering into the commuity,
`xgboost` also quickly catched up with the same implementation.
So in `xgboost` we can set the following parameter

```
tree_method="hist"
grow_policy="lossguide"
```

to train a XGBoost model with leaf-wise growing histogram-based trees,
just like LightGBM.

## On Base Learner

`lightgbm` only supports tree learner (using the `boosting` parameter) while `xgboost` also supports linear learners (using the `booster` parameter).

In `xgboost` two linear learners are supported:
A simple linear regressor and a generalized linear model with [tweedie distribution](https://en.wikipedia.org/wiki/Tweedie_distribution).

For tree learner other than the regression tree they both supports two additional tree models:

1. Random Forest
2. DART [@rashmi2015dart]

One notable difference in RF is that while `xgboost` allows training a boosted RF,
`lightgbm` only supports training a regular RF (there is no boosting involved).

## On Hyperparameters

There are tons of hyperparameters in modern GBT models we discussed in this notebook.
Here we only walk-through those important and non-trivial ones.

### Regularization Parameters

#### L1/L2 Penalties

Both packages implement the same penalties on cost function.
Only the names and default values are different.
In `lightgbm` L1/L2 penalties are defined by:

```
lambda_l1=0
lambda_l2=0
```

And in `xgboost` they are:

```
alpha=0
lambda=1
```

#### Leaf Size

One way to regularize a tree model is to limit the size of its terminal node (leaf).
In `lightgbm` this can be done by either limiting the minimum number of examples or limiting the minimum sum of hessian ($\sum_{i \in I_j}h_i$ for node $j$).
Here are the default values:

```
min_data_in_leaf=20
min_sum_hessian_in_leaf=1e-3
```

In `lightgbm` an improper value for this parameter usually results in the following warning message during training runtime:

```
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
```

In such case,
try to lower down the value since the algorithm is complaining about not able to split further due to reaching the minimum limits of node size,
given the targeted number of leaves.
Of course another cure is to reduce tree complexity by decreasing `num_leaves`.

In `xgboost` to control leaf size we can only limit on hessian but not the exact example counts:

```
min_child_weight=1
```

Remember that in the case of a squared loss the hessian is indeed unity (after normalization),
so the hessian sum will be the same as example counts.

Notice the difference in the default values between `lightgbm` and `xgboost`.
The former has less tolerance on over-fitting (more regularized).

#### Tree Size

A larger tree has more leaves and hence more complicated.
More leaves also means more interaction among variables,
which in turn controls the degree of non-linearity in the model.
^[When a subsequent split is conditioning on a previous split on another variable,
we consider this as variable interaction.]

Since `lightgbm` grows a tree leaf-wise,
there are two parameters to control tree size:

```
max_depth=-1
num_leaves=31
```

`num_leaves` is the main factor,
and `max_depth` supplements it.


In `xgboost` one can choose to grow a tree either depth-wise (the default) or leaf-wise.
For depth-wise tree there is only one relevant parameter:

```
max_depth=6
```

For leaf-wise growing tree to control number of leaves:

```
max_leaves=0
```

For a leaf-wise tree to grow as many leaves as a depth-wise tree,
the former will usually get deeper than the latter.
This is why to simultaneously control the number of leaves and tree depth is a good strategy.
Be aware that by default `lightgbm` doesn't limit tree depth at all.

#### Sub-Sampling

Both libraries support column sampling and row sampling.
In `lightgbm` we have:

```
bagging_freq=0
bagging_fraction=1
feature_fraction=1
feature_fraction_bynode=1
```

Note that `bagging_freq` controls how many iterations to re-do the bagging.
So `bagging_freq=1` will do row sampling at the start of every iteration.
We can even subsample only positive or negative examples (for binary classification),
by using `pos_bagging_fraction` or `neg_bagging_fraction`.
This can be handy when dealing with imbalanced dataset.

For feature sampling,
`feature_fraction` controls the sampling at the beginning of each iteration,
while `feature_fraction_bynode` controls the sampling at each tree node.
The latter is in general more regularized since the model training is subject to stronger randomness.

In `xgboost` we have the following relevant parameters:

```
subsample=1
colsample_bytree=1
colsample_bylevel=1
colsample_bynode=1
```

It has less control over bagging,
only one parameter `subsample` is used and the bagging is done at the beginning of each iteration.
While it has more control over column sampling.
Besides column sampling at iteration or node-level,
it also supports column sampling at tree-depth-level.

Note that in both libraries the bagging is done *without replacement*.
So none of them implements boostrap bagging,
which is proposed by the original random forest paper.

## On Imbalanced Data

### Class Re-Weighting

Simply based on hyperparameters there are two ways to handle imbalanced dataset in a binary classification task.
The first approach is to re-weight the examples.
Both `xgboost` and `lightgbm` has the parameter for this purpose:

```
scale_pos_weight=1
```

To re-balance the distribution to 50-50 the weight should be:

```
number of negative samples / number of positive samples
```

This will help the model in its general ranking performance (so metrics such as AUC will improve a lot).
For `lightgbm` there is also another convenient boolean parameter just to do this:
`is_unbalance`.

How does the weighting factor affect our model training?
It is directly through the gradients.
The scale factor will simply be used as a multiplier on *both* gradient $g_i$ and hessian $h_i$ for that example,
provided that the example $i$ is a positive example.

In `xgboost`' source code it is:

```cpp
_out_gpair[_idx] = GradientPair(Loss::FirstOrderGradient(p, label) * w,
                                Loss::SecondOrderGradient(p, label) * w);
```

where `w` is the scale factor (after considering also user-supplied custom weights for each example).

And in `lightgbm`'s source code it is:

```cpp
const double response = -label * sigmoid_ / (1.0f + std::exp(label * sigmoid_ * score[i]));
const double abs_response = fabs(response);
gradients[i] = static_cast<score_t>(response * label_weight  * weights_[i]);
hessians[i] = static_cast<score_t>(abs_response * (sigmoid_ - abs_response) * label_weight * weights_[i]);
```

where `label` is 1 for positive and -1 for negative,
`label_weight` is the scale factor,
`sigmoid_` is the scale factor for sigmoid function (default at 1),
and  `weights_` is user-supplied custom weights.

### Margin Clipping

But since any non-unity weight also screws up the distribution of our training data,
the model will not be able to estimate the class probability correctly.
In case that estimating the correct probability is relevant,
we have another parameter that may help,
which is also shared by the two libraries:

```
max_delta_step=0
```

Here is a quote from `xgboost`'s official document on this parameter:

> Usually this parameter is not needed,
but it might help in logistic regression when class is extremely imbalanced.
Set it to value of 1-10 might help control the update.

Technically,
`max_delta_step` cap the leaf weight (base model logit output or *margin*) learnt from each iteration.
Not that this is different from learning rate which only applies a fraction of the update.
`max_delta_step` effectively limits the absolute update amount.
And learning rate will still apply to the capped update.

This is a bit similar to *gradient clipping* in gradient descent,
where we cap the gradient update to a hard limit to avoid aggressive update.

## On Categorical Encoding

One important user-facing difference is `lightgbm`'s special support for categorical features *without* one-hot-encoding.
This is probably one of the most handy function in modern machine learning libraries.
It not only improves the model accuracy but also simplifies data pipeline and hence reduces engineering efforts.

For `xgboost` categoricals must be either one-hot encoded or converted to integers with an ordinal assumption.
Empirically,
even though the feature is not ordinal,
sometimes converting it into integers still work pretty well in terms of accuracy,
adding that the training runtime can be reduced a lot and data pipeline simplified.

To demonstrate the performance gain of the optimal categorical encoding in LightGBM,
let's simulate some data.
Here we use the same data generating process previously setup,
but with one more categorical variable that enter linearly into the model.

```{r sim_data_for_categorical_exp}
# Create simulated data for categorical experiment.
library(Matrix)

n <- 10000
z1 <- rnorm(n, 2, 4)
z2 <- runif(n)
e <- rnorm(n)

# Create one categorical with moderate cardinality and skewed distribution.
categorical <- sample(1:20, size=n, replace=TRUE, prob=1.3^(seq(2, 21)))
table(categorical)

sparse_cat <- t(fac2sparse(factor(categorical)))  # One-hot encode.
stopifnot(all.equal(as.numeric(table(categorical)), 
                    colSums(sparse_cat), check.attributes=FALSE))

# Random effects on categories.
cat_beta <- rnorm(ncol(sparse_cat))

# DGP: g = 3 z_1 + 6 z_2 -1.5 z_1z_2 + categorical dummy effects + e
g <- 3 * z1 + 6 * z2 - 1.5 * z1 * z2 + sparse_cat %*% cat_beta + e
g <- g[,1]  # Strip redundant dim.

# Prepare model design matrix.
in_mat <- cbind(z1, z2, sparse_cat)
head(in_mat)  # Training data in sparse matrix format.

in_dt <- data.table(z1=z1, z2=z2, cat=categorical)
head(in_dt)  # Training data in tabular format.

# Train-test split.
is_test_g <- runif(n) < .2
```

Now train a XGBoost model with one-hot-encoded sparse matrix as input.
To be fair,
we setup the parameters such that `xgboost` will also grow a leaf-wise histogram tree as its base learner.

```{r categorical_exp_xgb}
run_xgb <- function(v=1) {
  xgb_dtrain <- xgb.DMatrix(in_mat[!is_test_g,], label=g[!is_test_g])
  xgb_dtest <- xgb.DMatrix(in_mat[is_test_g,], label=g[is_test_g])

  # We set the parameters such that xgb use lgb-style leaf-wise histogram tree.
  xgb.train(params=list(objective="reg:squarederror", booster="gbtree",
                        eta=.3, max_leaves=7, max_depth=3, nthreads=1,
                        tree_method="hist", grow_policy="lossguide"),
            data=xgb_dtrain, watchlist=list(test=xgb_dtest),
            nround=50, early_stopping_rounds=5, verbose=v, print_every_n=10)
}

invisible(run_xgb())
```

Now it's LightGBM's turn.
We directly feed a data matrix with only 3 columns,
one of it being categorical but represented in integer,
without the ordinal assumption by specifying explicitly in `categorical_feature` argument to the model training call.

```{r categorical_exp_lgb}
run_lgb <- function(v=1) {
  lgb_dtrain <- lgb.Dataset(as.matrix(in_dt[!is_test_g]), label=g[!is_test_g],
                            colnames=names(in_dt), categorical_feature="cat")
  lgb_dtest <- lgb.Dataset(as.matrix(in_dt[is_test_g]), label=g[is_test_g],
                           colnames=names(in_dt), categorical_feature="cat")

  # We set the parameters such that lgb behaves like xgb's default for common parameters.
  lgb.train(params=list(objective="regression", boosting="gbdt",
                        lambda_l2=1, learning_rate=.3, metric="rmse",
                        num_leaves=7, max_depth=3, num_threads=1,
                        min_sum_hessian_in_leaf=1, min_data_in_leaf=0),
            data=lgb_dtrain, valids=list(test=lgb_dtest),
            num_iterations=50, early_stopping_rounds=5, verbose=v, eval_freq=10)
}

invisible(run_lgb())
```

In our specific parameter setup LightGBM outperforms XGBoost.
But it is minor since we didn't tune any of the models seriously.
There are already lots of benchmarks on the internet with different experiments and dataset,
the results are in general a mixture.
We consider two models comparable in terms of accuracy.
Here we only care about the performance gap resulted from a different treatment of categorical variables.

Now we benchmark the model training runtime:

```{r benchmark_cat}
library(microbenchmark)

print(microbenchmark(
  xgb=run_xgb(v=0),
  lgb=run_lgb(v=0),
  times=10
))
```

Clearly LightGBM is much faster than XGBoost with its optimal categorical encoding scheme.
Not to mention it also saves our ass from preparing a one-hot encoder for our data.

# Concluding Remarks

Modern gradient boosting trees are powerful machine learning models for lots of real world applications.
In this notebook we review thoroughly two of such frameworks in the open source community:
`xgboost` and `lightgbm`.
We choose them not only because they are both implemented with very high engineering quality,
but also because they are both extremely popular and wdiely adopted among data science practitioners.

For such a tool that is so widely spread,
we feel there is a need to demystify why it is so effective by understanding how they are different from traditional or vanilla gradient boosting machines.
We focus more on the technical details and their novelties in bettering the GBT family.
We hope the notebook can help practitioners better understand their tool at hand,
so they can utilize it with its maximum potential.

# References
