Abstract
We review Monte Carlo bootstrap sampling approach, a powerful technique to measure error of a statistical inference task when the variance of our estimator is not analytically trackable. We explain why bootstrapping works with minimum amount of theoretical derivation and demonstration of hands-on examples.
At the core of statistical inference is about to use an estimator \(\hat{\theta}(X)\) to guess the unknown value of \(\theta\) a population parameter of interest based on a random sample dataset \(X\) drawn from that population. The most common case of \(\theta\) will be the population mean \(\mu = E(x)\) where \(x\) denotes the population following an unknown distribution \(x \overset{d}\sim F\). Other popular parameters can be a median, a variance, a confidence interval of the mean, or a regression model coefficient, a confidence interval of the regression coefficient, …, etc.
How do we know our guess (our estimator) is a good one? That is, how do we measure the error of our estimator in statistical inference, without knowing the ground truth of the population? The distribution function of population \(F\) is unknown. As a result, to make our statistical inference from a sample dataset we rely on the Central Limit Theorem. In its vanilla version CLT only requires the population to have finite second moment and our sample dataset is identically and independently distributed (i.i.d.). It allows us to derive the limiting distribution (as \(n \rightarrow \infty\)) of the sampling distribution of a sample mean, which in turn can be used to measure the error of our guess \(\hat{\theta} = \frac{1}{n}\sum_ix_i\) to the unknown population parameter \(\theta = \mu\).
To be concrete, the vanilla CLT indicates that the sampling distribution of sample mean \(\bar{X} = \frac{1}{n}\sum_ix_i\) has the following limiting distribution:
\[ \bar{X}_n \overset{d}\rightarrow \text{Normal}(\mu, \frac{\sigma}{\sqrt{n}}), \]
where \(\mu\) is the population mean and \(\sigma\) the population standard deviation. This is elegant since Normal distribution is mathematically very friendly to deal with.
Ideally we’d like to create multiple (as many as possible) random sample datasets to measure our guess against the population. In reality due to resource constraint we usually end up with one and only such sample dataset. But based on CLT we have a solid theoretical ground to measure our guess (with its asymptotic sampling distribution) even from only one sample of size \(n\), provided that \(n\) is large enough.
Different versions of CLT can be formulated based on the estimator we are using, in order to derive the aysmptotic sampling distribution of that estimator. So we also have CLT applicable to, say, a regression model coefficient. Let’s exactly how we test the null effect of a regressor on the regressand. In general, any estimator in a form of weighted averaging can be adapted to CLT.
The usual workflow of statistical inference will be to use CLT to derive the analytical solution of the asymptotic sampling distribution of the estimator. But this is not always achievable. The more complicated the estimator the harder it can be analyzed in a trackable way. And some estimators can be harder to deal with even if they are simple in their own. One such well-known example is the sample median (to guess the population median of course), which requires order statistics come to play. Indeed there is no closed form solution for the variance of sample median unless we assume the population distribution to be a certain kind.1
This is where bootstrap sampling comes to shine. The idea of boostrap sampling is to use the sample and the only sample as a surrogate population, to approximate the underlying sampling distribution of our estimator which otherwise is untrackable.
Efron (1992) propose the bootstrapping technique as a solution to estimate the variance (or the confidence interval) of a given estimator, such that we can draw conclusion about the quaility of our inference, especially when there is no analytical solution to derive the variance of that estimator.
Bootstrapping can be simply defined as drawing i.i.d. sample of the same size \(n\) from a given sample. In some literatures this is also called the resample. To strictly follow i.i.d. the sampling must be done with replacement.
With enough computing power, we can generate as many resamples \(X_b\) as we want and for each resample we calculate the value of our estimator \(\hat{\theta}_b = \hat{\theta}(X_b)\) (termed as the estimate). The resulting distribution of bootstrapped estimates from a given sample \(X\) is referred to as the bootstrap distribution.
It has been proven that the limiting distribution of a bootstrap distribution is actually the same as that of the sampling distribution, for a variety of estimators widely used for inference. (SINGH 1981)
To be clear let’s put it in notation:
\[ \underbrace{(\hat{\theta}_b - \hat{\theta})}_\text{Bootstrap Distribution} \overset{d}\rightarrow \underbrace{(\hat{\theta} - \theta) \vphantom{\hat{\theta}_b}}_\text{Sampling Distribution}, \]
where \(\hat{\theta}\) is the estimate derived from our only sample to infer the parameter \(\theta\), and \(\hat{\theta}_b\) is the estimate derived from the bootstrap sample out of the original sample.2
Due to practical reason we only have one sample to derive \(\hat{\theta}\), but we can do as many as bootstrap samples from that sample to derive lots of \(\hat{\theta}_b\). Knowing that under large sample (the original sample) the bootstrap distribution will approximate the sampling distribution, we can estimate the variance of our estimator without knowing its closed form solution.
To sum up, bootstrapping is just another route to arrive at the sampling distribution of our estimator in order to measure how good our guess is against the population unknown. In the following section we are going to discuss the heuristics as why bootstrapping works.
To do bootstrap we essentially take the random sample \(X\) from the original population \(F\), and do the i.i.d. sampling from that sample as if the sample itself is another population with a distribution function \(\hat{F}\). Only in this time the population is finite and has a discrete distribution function, which is the empirical distribution function \(\hat{F}_n(X)\).
From Wikipedia:
In statistics, an empirical distribution function is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.
Specifically, the empirical distribution function (EDF) and empirical PDF of a sample \(x \overset{d}\sim F\) of size \(n\) can be written as:
\[ \begin{aligned} \hat{F}_n(x) &= \frac{1}{n} \sum_{i=1}^nI(x_i \leq x), \\ \hat{p}_n(x) &= \frac{d\hat{F}_n}{dx} = \frac{1}{n}, \end{aligned} \]
where \(I\) is an indicator function.
Intuitively and loosely speaking, \(\hat{F}_n\) is just an estimator for \(F\) the true population distribution. But here we are not guesing a single population parameter but the entire distribution. That is, we are not guesing a scalar but a function.3
Why introducing the notion of EDF? To see the value, it’s better to step back and examine our best-friend parameter: population mean \(\mu = E(x)\). By definition the derivative of a CDF \(F\) is a PDF \(p\):
\[ \frac{dF(x)}{dx} = p(x). \]
Now we observe that the population mean can be re-written as:
\[ E(x) = \int xp(x)dx = \int xdF(x). \]
That is to say, population mean is indeed a function of CDF. We can easily extend this to the variance as a function of CDF as well, and to many other statistics.
Of course we don’t know CDF of the population. But we do know the EDF from our observed sample dataset, which itself is an estimator for CDF. This immediately gives us one estimator for population mean of the following form:
\[ \begin{aligned} E(x) \approx \int xd\hat{F}_n(x) &= \int x\hat{p}_n(x)dx \\ &= \sum_{i=1}^n x \cdot \frac{1}{n} \\ &= \frac{1}{n}\sum_{i=1}^nx_i, \end{aligned} \]
to guess the population mean.4
To no surprise, this estimator is exactly what we call the sample mean, proven to be a very good estimator for the population mean. But the idea generalizes to any estimator, as long as the estimator can be expressed as a function of CDF, which in turn will be approximated by EDF. In the literature such estimators are also referred to as statistical functionals.
More formally, under i.i.d. sample we have:
\[ \begin{aligned} E\big[\hat{F}_n(a)\big] &= \frac{1}{n}E\big[\sum_iI(x_i \leq a)\big] \\ &= \frac{1}{n}\sum_iE\big[I(x_i \leq a)\big] \\ &= \frac{1}{n} \cdot n \cdot \Pr(x \leq a) \\ &= F(a). \end{aligned} \]
Hence EDF is unbiased in every value \(a\). This means that any statistical functional which can be expressed as a linear function of EDF \(G(\hat{F}_n)\), will be unbiased as well:
\[ E\big[G(\hat{F}_n)\big] = G(F). \]
One thing remains unclear for those who are skeptical: How good does \(\hat{F}_n\) approximate \(F\)? Technically speaking, what is the variance of \(\hat{F}_n\)? The good news is, we do have a sound theoretical ground describing the bounds how close EDF will be to CDF. This is referred to as the Dvoretzky–Kiefer–Wolfowitz inequality:
\[ \Pr\big( \sup_{x \in R}\vert \hat{F}_n(x) - F(x) \vert > \epsilon \big) \leq 2e^{-2n\epsilon^2},\forall\epsilon > 0. \]
In plain words, EDF converges uniformly to CDF in \(n\) with exponential speed.5
So the intuition of why bootstrapping works lies in the following facts:
There are well-known cases where bootstrapping is not able to approximate the sampling distribution of the estimator. Here are two of them:
They are both related to the applicability of CLT. The proof of the convergence of bootstrap distribution toward sampling distribution is based on CLT. Anything disqualifying CLT will result in the failure of bootstrapping.
It is possible to derive the closed form solution for a bootstrap distribution. But this is kinda defeating the purpose of using bootstrap. The very reason we resort to bootstrapping is to bypass the need for a trackable derivation. So we will use Monte Carlo simulation to approximate the bootstrap distribution. Usually we are particularly interested in the variance of the bootstrap distribution.
Just a little recap. Given a particular estimator \(\hat{\theta}(\cdot)\) as a random variable, we are usually concerning about its expectation and variance. The expectation tells us whether the estimator is unbiased or not. The variance tells us how volatile our guess is, and also if it can be consistent–converging to the true parameter of interest.
By definition, the variance of a random variable \(\hat{\theta}\) is:
\[ \begin{aligned} Var(\hat{\theta}) &= E\big[(\hat{\theta} - E(\hat{\theta}))^2\big] \\ &= E(\hat{\theta}^2) - E(\hat{\theta})^2. \end{aligned} \]
Now to approximate the above result with Monte Carlo of \(M\) repetitions, we can simply do:
\[ Var(\hat{\theta}) \approx \frac{1}{M}\sum_m \hat{\theta}_m^2 - \bigg(\frac{1}{M}\sum_m \hat{\theta}_m\bigg)^2 , \]
where each \(\hat{\theta}_m\) is derived from a bootstrapped sample at the \(m\)-th repetition.
The above approximation holds thanks to the Law of Large Numbers (as \(M \rightarrow \infty\)):
\[ \begin{aligned} \frac{1}{M}\sum_m \hat{\theta}_m &\overset{p}\rightarrow E(\hat{\theta}), \\ \frac{1}{M}\sum_m \hat{\theta}_m^2 &\overset{p}\rightarrow E(\hat{\theta}^2). \end{aligned} \]
Remember that now we express the estimator \(\hat{\theta}(X)\) as function of EDF: \(\hat{\theta}(\hat{F}_n(X))\), i.e., a statistical functional. Now to draw random sample from this function we do i.i.d. sampling with replacement against \(\hat{F}_n\). Once the sample is constructed, we calculate the corresponding estimate value, and we do this for many many times to derive the entire distribution of the resulting estimates. That is, the simulated bootstrap distribution.
There are two sources of error that can arise from this approach: simulation error and estimation error.
Simulation error occurred because we are using Monte Carlo finite repetitions. Increasing the number of repetitions can effectively decrease such error. Thanks to the wide-spread computing power nowadays, this is not a big issue at all.
Estimation error is the error due to statistical noise when we use \(\hat{F}_n\) to approximate \(F\) and by nature cannot be fully eliminated unless \(n\) goes to infinity. To reduce this error we will need a larger \(n\). Remember that the DKW inequality tells us that the convergence is at exponential speed in \(n\).
Let’s demonstrate the use of Monte Carlo bootstrapping with our best friend estimator: sample mean. Assume a Beta distribution as our pretendingly unknown population with the PDF:
beta_a <- 2
beta_b <- 10
mu_beta <- beta_a / (beta_a + beta_b)
var_beta <- (beta_a * beta_b) / ((beta_a + beta_b)^2*(beta_a + beta_b + 1))
beta_pdf <- function(x) dbeta(x, beta_a, beta_b)
curve(beta_pdf, from=0, to=1, ylab="Density",
main=sprintf("(Unknown) Population Beta(%s, %s) PDF", beta_a, beta_b))
and the CDF:
beta_cdf <- function(x) pbeta(x, beta_a, beta_b)
curve(beta_cdf, from=0, to=1, ylab="Density",
main=sprintf("(Unknown) Population Beta(%s, %s) CDF", beta_a, beta_b))
Now we draw a random sample with a fair size from the population. We plot the resulting histogram (probability mass function, or PMF):
set.seed(666)
n <- 5000
x <- rbeta(n, beta_a, beta_b)
hist(x, main="Probability Mass Function of X", probability=TRUE)
and the EDF:
The variance of sample mean is analytically trackable, so indeed we don’t really need bootstrap to tackle this problem. But we can use it to verify if bootstrap can result in the correct approximation to our analytical solution.
Here is the derivation of the variance of sample mean, provided that the sample is i.i.d.:
\[ \begin{aligned} Var(\frac{1}{n}\sum_ix_i) &= \frac{1}{n^2}Var(\sum_ix_i) \\ &= \frac{1}{n^2}\sum_iVar(x_i) &\text{(due to independently distributed)} \\ &= \frac{1}{n^2}\cdot n Var(x) &\text{(due to identically distributed)} \\ &= \frac{Var(x)}{n}. \end{aligned} \]
Obviously the variance goes to zero as sample size goes to infinity. Hence sample mean as an estimator is consistent.
We can readily calculate the standard error of sample mean given our sample:
[1] 0.001461763
But remember that we don’t know our population so there is no way we have access to the variance of population (the numerator). So instead we will almost always use sample variance to approximate the solution:
[1] 0.001489787
Now let’s say we don’t know the analytical solution. If we can draw more than just one sample dataset, we can use the resulting distribution to measure the error of our estiimator. Here is a corresponding Monte Carlo simulation to do exactly this:6
M <- 10000
mc_sampling <- function(M, n, print=TRUE) {
mc_xbar <- rep(NA_real_, M)
for ( i in seq_len(M) ) {
xm <- rbeta(n, beta_a, beta_b)
mc_xbar[i] <- mean(xm)
}
v <- sqrt(sum(mc_xbar^2) / M - (sum(mc_xbar) / M)^2) # Or simply sd(mc_xbar).
if ( print ) print(v)
mc_xbar
}
mc_xbar <- mc_sampling(M, n)
[1] 0.001465021
The simulated distribution is the sampling distribution of our estimator. CLT tells us that the sampling distribution of sample mean is asymptotically Normal. Here without even knowing that, we can just use Monte Carlo to approximate the asymptotic sampling distribution. But this is hypothetical only. In reality we don’t have access to multiple random samples due to various resource constraints, so this approach only exists in theory.
Here we can plot the simulated estimates. It is a finite sample approximation to the sampling distribution. We also plot the theoretical limiting distribution given by CLT in blue curve.
x_mc <- seq(min(mc_xbar), max(mc_xbar), length=100)
y_mc <- dnorm(x_mc, mu_beta, sqrt(var_beta / n))
hist(mc_xbar, main="Sampling Distribution (Sample Mean)", xlab="X", probability=TRUE)
lines(x_mc, y_mc, col="blue", lwd=2)
One can see how closely the simulated sampling distribution in histogram follows the theoretical limiting distribution in blue curve. When CLT applies, we can bypass the need to use multiple sample datasets and instead use only one to measure our statistical inference.
Bootstrapping is yet another route to the sampling distribution. Even though we only have one sample dataset, and without applying CLT, we can use bootstrapping to generate as many resamples as we want. And this is totally feasible given enough computing power.
The same Monte Carlo setup, but now each sample is generated by bootstrapping from the only sample we have at hand:
M <- 10000
bs_sampling <- function(x, M, print=TRUE) {
bs_xbar <- rep(NA_real_, M)
for ( i in seq_len(M) ) {
bs <- sample(x, size=length(x), replace=TRUE) # The bootstrap.
bs_xbar[i] <- mean(bs)
}
v <- sqrt(sum(bs_xbar^2) / M - (sum(bs_xbar) / M)^2) # Or simply sd(bs_xbar).
if ( print ) print(v)
bs_xbar
}
bs_xbar <- bs_sampling(x, M)
[1] 0.001474795
The approximation is as good as if we hypothetically created multiple samples from the original population (as in the previous Monte Carlo exercise).
Below we plot all bootstrapped sample means. We also plot a green curve as the sampling distribution as if the original sample is the population. (That is, the original sample mean becomes the “population mean”.)
x_bs <- seq(min(bs_xbar), max(bs_xbar), length=100)
y_bs <- dnorm(x_bs, mean(x), sqrt(var_beta / n))
hist(bs_xbar, probability=TRUE,
main="Distribution of Bootstrapped Sample Mean",
sub=sprintf("Sample Size = %i", n), xlab="X")
lines(x_mc, y_mc, col="blue", lwd=2)
lines(x_bs, y_bs, col="green", lwd=2)
Notice that we also plot the blue curve as the limiting distribution of sample mean. Our previous Monte Carlo exercise has already confirmed this is the sampling distribution based on CLT. Clearly the bootstrapped estimates does NOT magically approximate the sampling distribution. There is a bias introduced by the original sample mean which is statistically deviated from the true mean. But they do share the same variance (shape). And since we are asking for the variance of our estimator, such bias does not affect our estimation of variance at all!
So the question is, can the resulting bootstrapped estimates approximate the sampling distribution in both shape and location? The answer is yes, but we need a large enough sample size. Let’s double up our original sample size and redo the entire exercise:
n2 <- 10000
x2 <- rbeta(n2, beta_a, beta_b)
mc_xbar2 <- mc_sampling(M, n2, print=FALSE)
bs_xbar2 <- bs_sampling(x2, M, print=FALSE)
x_mc2 <- seq(min(mc_xbar2), max(mc_xbar2), length=100)
y_mc2 <- dnorm(x_mc2, mu_beta, sqrt(var_beta / n2))
x_bs2 <- seq(min(bs_xbar2), max(bs_xbar2), length=100)
y_bs2 <- dnorm(x_bs2, mean(x2), sqrt(var_beta / n2))
hist(bs_xbar2, probability=TRUE,
main="Distribution of Bootstrapped Sample Mean",
sub=sprintf("Sample Size = %i", n2), xlab="X")
lines(x_mc2, y_mc2, col="blue", lwd=2)
lines(x_bs2, y_bs2, col="green", lwd=2)
Now the two distributions are getting closer to each other. We also observe that the requirement on sample size for bootstrap to approximate (both shape and location) the limiting distribution is much stricter than that for CLT to be applicable.7
To show this clearly, we run another Monte Carlo simulation with only sample size of 100, and we can see how well the sampling distribution is approximating the limiting distribution already:
n3 <- 100
x3 <- rbeta(n3, beta_a, beta_b)
mc_xbar3 <- mc_sampling(M, n3, print=FALSE)
x_mc3 <- seq(min(mc_xbar3), max(mc_xbar3), length=100)
y_mc3 <- dnorm(x_mc3, mu_beta, sqrt(var_beta / n3))
hist(mc_xbar3, main="Sampling Distribution (Sample Mean)",
sub=sprintf("Sample Size = %i", n3), xlab="X", probability=TRUE)
lines(x_mc3, y_mc3, col="blue", lwd=2)
This suggests that whenver we do have analytical solution for the limiting distribution of our estimator based on the corresponding CLT, we should just use that for our inference task instead of trying bootstrapping.8
So, yes, for a classical statistic like sample mean, there is not much value added using bootstrap. But it can always serve as a good educational case study.
In our above comparison between bootstrap distribution and sampling distribution, what we actually plotted is \(\hat{\theta}\) v.s. \(\hat{\theta}_b\). But the approximation is better for \((\hat{\theta}_b - \hat{\theta})\) v.s. \((\hat{\theta} - \theta)\), and even better for \(\frac{(\hat{\theta}_b - \hat{\theta})}{se_b}\) v.s. \(\frac{(\hat{\theta} - \theta)}{se}\). The last one is referred to as second-order correction of bootstrap in the literature.
For our previous example if we plot instead the de-meaned version:
# Plot smooth density instead of histogram to make it easier to see the overlapping.
plot(density(bs_xbar - mean(x)), col="green",
main="Monte Carlo Sampling Distribution v.s. Bootstrap Distribution",
sub=sprintf("Sample Size = %i", n),
xlab="X (De-Meaned)")
lines(density(mc_xbar - mu_beta), col="blue")
legend("topright",
c(expression(hat(theta)[b] - hat(theta)),
expression(hat(theta) - theta)),
col=c("green", "blue"), pch=c(1, 1))
We see that the convergence is indeed quite good already.
boot
Rather than implementing the sampling procedure from scratch, we can also take advantage of the built-in package boot
(Canty and Ripley 2019) in R to simplify our code:
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = x, statistic = function(x, i) mean(x[i]), R = M,
parallel = "multicore")
Bootstrap Statistics :
original bias std. error
t1* 0.1691994 -1.099041e-05 0.00148536
The bias reported by boot
is simply the difference between the average of our simulated estimates and the original estimate:
[1] -1.099041e-05
It should be very close to zero if nothing went wrong.
The returned object from boot
also has a convenient plot
method for visualization purpose:
We can also use the function boot.array
to check how many times a data point is resampled at each repetition. For example to check how many times the fisrt two examples are resampled at the first 10 repetitions:
[,1] [,2]
[1,] 0 1
[2,] 2 1
[3,] 0 1
[4,] 0 0
[5,] 0 1
[6,] 2 1
[7,] 1 1
[8,] 0 0
[9,] 1 0
[10,] 1 2
The row dimension is repetition and the column dimension is data.
Since we now have access to all simulated estimates, we can easily construct the confidence interval as well:
2.5% 97.5%
0.1662635 0.1721149
Of course for simple estimator such as sample mean, we do have the analytical solution for its confidence interval. But bootstrapping is able to do the approximation without knowing the solution.
Another possible (though higly similar to variance) metric we may want to look at is the mean squared error (MSE). We can approximate MSE of our estimator using bootstrapping:
\[ \text{MSE}(\hat{\theta}) = \frac{1}{M}\sum_m(\hat{\theta}_m - \hat{\theta})^2. \]
So for the sample mean example, we can do:
# We use root mean squared error instead since the number is small.
sqrt(mean((bs_r$t - mean(x))^2))
[1] 0.001485327
The t-statistic is:
\[ t = \frac{\bar{X} - \mu}{s / \sqrt{n}}, \]
where \(s\) is the standard deviation of sample \(X\).
The (two-sided) confidence interval for \(\mu\) is:
\[ \bar{X}\pm t_{(1 - \alpha/2)} \times \frac{s}{\sqrt{n}}, \]
where \(t_{(1 - \alpha/2)}\) is the t-statistic at confidence level \(\alpha\) with degree of freedom \(n - 1\).
For our demo example the groud truth of t-statistic is:
[1] 1.700033
Or we can use the built-in test function to also compute the confidence interval conveniently:
One Sample t-test
data: x
t = 1.7, df = 4999, p-value = 0.08919
alternative hypothesis: true mean is not equal to 0.1666667
95 percent confidence interval:
0.1662787 0.1721200
sample estimates:
mean of x
0.1691994
Note that the above interval is not derived by pluging the unknown parameter \(\mu\), but based on the limiting distribution of t-statistic, the Student’s t-distribution. To verify this:
[1] 0.1662787 0.1721200
Instead we can use the bootstrapped t-statistic to approximate the unknown true t-statistic:
\[ t_b = \frac{\bar{X}_b - \bar{X}}{s_b / \sqrt{n}}. \]
Then all we need to do is to replace \(t\) with \(t_b\) in the equation:
\[ \bar{X}\pm t_{b{(1 - \alpha/2)}} \times \frac{s}{\sqrt{n}}. \]
The exact exercise follows.
# Re-define the bootstrapper since this time we need access to sd(x_b) as well.
t_bs_sampling <- function(x, M) {
bs_xbar <- rep(NA_real_, M)
bs_sd <- rep(NA_real_, M)
for ( i in seq_len(M) ) {
bs <- sample(x, size=length(x), replace=TRUE) # The bootstrap.
bs_xbar[i] <- mean(bs)
bs_sd[i] <- sd(bs) # Compute also standard deiviation for each resample.
}
list(xbar=bs_xbar, sd=bs_sd)
}
res <- t_bs_sampling(x, M)
bs_t <- (res$xbar - mean(x)) / (res$sd/sqrt(n)) # Bootstrap t-statistics.
# Bootstrapped confidence interval of mu using t-statistic.
mean(x) + (sd(x) / sqrt(n)) * quantile(bs_t, c(.025, .975))
2.5% 97.5%
0.1662145 0.1721219
What we demonsatrate here is that the bootstrap distribution of t-statistic is also converging to the sampling distribution of t-statistic.
Now we understand the basic idea of bootstrapping. In this section we further discuss its application for a linear regression model.
We create some simulated data with a hypotehtical true model:
set.seed(777)
size <- 10000
num_feature <- 2
# Regressors (design matrix).
X <- rbeta(size * (num_feature + 1), beta_a, beta_b)
X <- cbind(1, matrix(X, ncol=num_feature + 1))
colnames(X) <- paste0("x_", 0:3)
# Noise.
e <- rnorm(size)
# True parameters.
true_coefs <- runif(num_feature + 2)
names(true_coefs) <- colnames(X)
# Response.
y <- (X %*% true_coefs + e)
# We will purposely omit one covariate to make the data a little bit "real".
Xy <- as.data.frame(X[,1:(num_feature + 1)])
Xy$y <- y
head(Xy)
The true parameters are:
x_0 x_1 x_2 x_3
0.4783329 0.4621635 0.2008297 0.5402106
The last parameter will be omitted in our model fit since we pretend as if it is not observable.
We fit the model and report the heteroskedasticity consistent standard error:9
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
x_0 0.589523 0.024877 23.6977 < 2.2e-16 ***
x_1 0.364150 0.098327 3.7035 0.0002138 ***
x_2 0.070404 0.095387 0.7381 0.4604799
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The simpliest approach is to resample at the observation level. To speed up the repetitions we will directly compute the analytical solution rather than calling lm
to avoid the overhead. Here is a quick single-threaded implementation from scratch:
ols <- function(X, y) {
# Analytical solution for OLS estimator.
solve(t(X) %*% X) %*% t(X) %*% y
}
bs_coefs <- matrix(NA_real_, nrow=1 + num_feature, ncol=M)
for ( i in seq_len(M) ) {
Xy_b <- as.matrix(Xy[sample(size, replace=TRUE),])
coefs_b <- ols(X=Xy_b[, 0:num_feature + 1], y=Xy_b[, ncol(Xy_b)])
bs_coefs[,i] <- coefs_b
}
Instead let’s take advantage of boot
to simplify and also speed up our code:
reg <- function(Xy, i) {
Xy <- Xy[i,]
X <- Xy[, 0:num_feature + 1]
y <- Xy[, ncol(Xy)]
solve(t(X) %*% X) %*% t(X) %*% y
}
bs_coefs <- boot(as.matrix(Xy), reg, R=M, parallel="multicore")
print(bs_coefs)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = as.matrix(Xy), statistic = reg, R = M, parallel = "multicore")
Bootstrap Statistics :
original bias std. error
t1* 0.58952303 -4.014462e-04 0.02519527
t2* 0.36415043 1.791231e-03 0.09865505
t3* 0.07040356 9.224955e-05 0.09552950
To plot the bootstrap distribution of the 3rd coefficient:
Case-wise bootstrapping implicitly assumes covariates are randomly assigned, which may not be justifiable if we believe our data is collected in a highly controlled manner. For example, data may be collected from an experimental design (so some of the features are directly set by the researcher). In such scenario we can resort to another approach: bootstrapping the model residuals.
If we’d like to hold our design matrix fixed, we can resample instead the residuals from the fitted model. Be aware that this implicitly assumes that the error term is identically distributed.
Anyway let’s do the exercise. Here is again the single-threaded implementation from scratch:
bs_coefs2 <- matrix(NA_real_, nrow=1 + num_feature, ncol=M)
yhat <- predict(lm_model, Xy)
for ( i in seq_len(M) ) {
e_b <- sample(lm_model$residuals, size=size, replace=TRUE) # Residual bootstrapping.
y_b <- yhat + e_b
bs_coefs2[,i] <- ols(X=X[, 0:num_feature + 1], y=y_b)
}
And with boot
:
X_ <- X[, 0:num_feature + 1] # Design matrix is fixed.
yhat <- predict(lm_model, Xy)
reg2 <- function(e, i) {
y <- yhat + e[i]
solve(t(X_) %*% X_) %*% t(X_) %*% y
}
bs_coefs2 <- boot(lm_model$residuals, reg2, R=M, parallel="multicore")
print(bs_coefs2)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = lm_model$residuals, statistic = reg2, R = M, parallel = "multicore")
Bootstrap Statistics :
original bias std. error
t1* 0.58952303 0.0002952371 0.02477298
t2* 0.36415043 -0.0005145451 0.09641844
t3* 0.07040356 -0.0017440291 0.09556832
As one can see in our toy example the two approaches don’t have too much discrepancy.
Since residual boostrapping relies on the fitted residuals, the original model specification must be meaningful in the first place.
Time series are usually autocorrelated. This means that either case-wise or residual bootstrapping won’t work because the underlying assumptions are heavily violated. To deal with autocorrelation a variety of block bootstrap methods have been proposed. The general idea is to resample at block-level instead of individual level to maintain the underlying autocorrelation structure. We will skip the detailed discussion here to keep the scope small.10
Canty, Angelo, and B. D. Ripley. 2019. Boot: Bootstrap R (S-Plus) Functions.
Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge: Cambridge University Press. http://statwww.epfl.ch/davison/BMA/.
Efron, Bradley. 1992. “Bootstrap Methods: Another Look at the Jackknife.” In Breakthroughs in Statistics, 569–93. Springer.
SINGH, KEASAR. 1981. “On the Aysmptotic Accuracy of Efron’s Bootstrap.”
Zeileis, Achim. 2004. “Econometric Computing with HC and HAC Covariance Matrix Estimators.” Journal of Statistical Software 11 (10): 1–17. https://doi.org/10.18637/jss.v011.i10.
Zeileis, Achim, and Torsten Hothorn. 2002. “Diagnostic Checking in Regression Relationships.” R News 2 (3): 7–10. https://CRAN.R-project.org/doc/Rnews/.
To emphasis one more time, for CLT to apply to sample mean, no assumption about population distribution is made except the finite second moment requirement, which is a fairly weak assumption.↩︎
Though we didn’t discuss the technical details here, the proof of this convergence is indeed based on CLT.↩︎
Here our wording is a bit Fisherian. Because in a Baysian point of view even a population parameter is a function (a random variable), not a constant scalar.↩︎
This is based on the plug-in principle.↩︎
Notation \(\sup\) denotes the supremum: The smallest number that is greater than or equal to every number in the set.↩︎
Here we didn’t adjust the denominator to \(n-1\) in the formula we used. using function sd
will do that by default. Theoretically the adjustment make the estimation unbiased. The effect is negligible though since we control the number of repetitions, which is usually a big enough number.↩︎
Some researchers recommend \(n^2\) repetitions where \(n\) is the original sample size. Or \(n\ln n\) if \(n^2\) is prohibitive.↩︎
Any old-school textbooks may mention the rule-of-thumb sample size of 30 as a generally accepted large enough number. This is over-simplification and can be dangerous in practice. Whether a sample size is big enough for CLT to have good approximation really depends on the nature of the population distribution, which unfortunately is unknown in the first place. There are some heuristics, though. For example the more symmetric the population the less sample size required to have a good approximation.↩︎
Even though we know that our true model has homoskedastic error, the HC approach is more robust in practice.↩︎
Block bootstrap is not limited to time series. It can also apply to data with spatial correlation.↩︎