下面列出的方法是Davidson和Hinckley(1997)第6.3.3节“
Bootstrap方法及其应用”中描述的方法。由于Glen_b和他的评论在这里。鉴于与此主题有关“交叉验证”存在几个问题,我认为值得写。
线性回归模型为:
Yi=Xiβ+ϵi
我们有数据,i=1,2,…,N,我们用它来估计β为
β^OLS=(X′X)−1X′Y
Now, we want to predict what Y will be for a new data point, given that we know X for it. This is the prediction problem. Let's call the new X (which we know) XN+1 and the new Y (which we would like to predict), YN+1. The usual prediction (if we are assuming that the ϵi are iid and uncorrelated with X) is:
YpN+1=XN+1β^OLS
The forecast error made by this prediction is:
epN+1=YN+1−YpN+1
We can re-write this equation like:
YN+1=YpN+1+epN+1
Now, YpN+1 we have already calculated. So, if we want to bound YN+1 in an interval, say, 90% of the time, all we need to do is estimate consistently the 5th and 95th percentiles/quantiles of epN+1, call them e5,e95, and the prediction interval will be [YpN+1+e5,YpN+1+e95].
How to estimate the quantiles/percentiles of epN+1? Well, we can write:
epN+1=YN+1−YpN+1=XN+1β+ϵN+1−XN+1β^OLS=XN+1(β−β^OLS)+ϵN+1
The strategy will be to sample (in a bootstrap kind of way) many times from epN+1 and then calculate percentiles in the usual way. So, maybe we will sample 10,000 times from epN+1, and then estimate the 5th and 95th percentiles as the 500th and 9,500th smallest members of the sample.
To draw on XN+1(β−β^OLS), we can bootstrap errors (cases would be fine, too, but we are assuming iid errors anyway). So, on each bootstrap replication, you draw N times with replacement from the variance-adjusted residuals (see next para) to get ϵ∗i, then make new Y∗i=Xiβ^OLS+ϵ∗i, then run OLS on the new dataset, (Y∗,X) to get this replication's β∗r. At last, this replication's draw on XN+1(β−β^OLS) is XN+1(β^OLS−β∗r)
Given we are assuming iid ϵ, the natural way to sample from the ϵN+1 part of the equation is to use the residuals we have from the regression, {e∗1,e∗2,…,e∗N}. Residuals have different and generally too small variances, so we will want to sample from {s1−s¯¯¯,s2−s¯¯¯,…,sN−s¯¯¯}, the variance-corrected residuals, where si=e∗i/(1−hi)−−−−−−√ and hi is the leverage of observation i.
And, finally, the algorithm for making a 90% prediction interval for YN+1, given that X is XN+1 is:
- Make the prediction YpN+1=XN+1β^OLS.
- Make the variance-adjusted residuals, {s1−s¯¯¯,s2−s¯¯¯,…,sN−s¯¯¯}, where si=ei/(√1−hi).
- For replications r=1,2,…,R:
- Draw N times on the adjusted residuals
to make bootstrap residuals
{ϵ∗1,ϵ∗2,…,ϵ∗N}
- Generate bootstrap Y∗=Xβ^OLS+ϵ∗
- Calculate bootstrap OLS estimator for this replication,
β∗r=(X′X)−1X′Y∗
- Obtain bootstrap residuals from this replication, e∗r=Y∗−Xβ∗r
- Calculate bootstrap variance-adjusted residuals from this
replication, s∗−s∗¯¯¯¯¯
- Draw one of the bootstrap variance-adjusted residuals from this
replication, ϵ∗N+1,r
- Calculate this replication's draw on
epN+1, ep∗r=XN+1(β^OLS−β∗r)+ϵ∗N+1,r
- Find 5th and 95th percentiles of epN+1, e5,e95
- 90% prediction interval for YN+1 is
[YpN+1+e5,YpN+1+e95].
Here is R
code:
# This script gives an example of the procedure to construct a prediction interval
# for a linear regression model using a bootstrap method. The method is the one
# described in Section 6.3.3 of Davidson and Hinckley (1997),
# _Bootstrap Methods and Their Application_.
#rm(list=ls())
set.seed(12344321)
library(MASS)
library(Hmisc)
# Generate bivariate regression data
x <- runif(n=100,min=0,max=100)
y <- 1 + x + (rexp(n=100,rate=0.25)-4)
my.reg <- lm(y~x)
summary(my.reg)
# Predict y for x=78:
y.p <- coef(my.reg)["(Intercept)"] + coef(my.reg)["x"]*78
y.p
# Create adjusted residuals
leverage <- influence(my.reg)$hat
my.s.resid <- residuals(my.reg)/sqrt(1-leverage)
my.s.resid <- my.s.resid - mean(my.s.resid)
reg <- my.reg
s <- my.s.resid
the.replication <- function(reg,s,x_Np1=0){
# Make bootstrap residuals
ep.star <- sample(s,size=length(reg$residuals),replace=TRUE)
# Make bootstrap Y
y.star <- fitted(reg)+ep.star
# Do bootstrap regression
x <- model.frame(reg)[,2]
bs.reg <- lm(y.star~x)
# Create bootstrapped adjusted residuals
bs.lev <- influence(bs.reg)$hat
bs.s <- residuals(bs.reg)/sqrt(1-bs.lev)
bs.s <- bs.s - mean(bs.s)
# Calculate draw on prediction error
xb.xb <- coef(my.reg)["(Intercept)"] - coef(bs.reg)["(Intercept)"]
xb.xb <- xb.xb + (coef(my.reg)["x"] - coef(bs.reg)["x"])*x_Np1
return(unname(xb.xb + sample(bs.s,size=1)))
}
# Do bootstrap with 10,000 replications
ep.draws <- replicate(n=10000,the.replication(reg=my.reg,s=my.s.resid,x_Np1=78))
# Create prediction interval
y.p+quantile(ep.draws,probs=c(0.05,0.95))
# prediction interval using normal assumption
predict(my.reg,newdata=data.frame(x=78),interval="prediction",level=0.90)
# Quick and dirty Monte Carlo to see which prediction interval is better
# That is, what are the 5th and 95th percentiles of Y_{N+1}
#
# To do it properly, I guess we would want to do the whole procedure above
# 10,000 times and then see what percentage of the time each prediction
# interval covered Y_{N+1}
y.np1 <- 1 + 78 + (rexp(n=10000,rate=0.25)-4)
quantile(y.np1,probs=c(0.05,0.95))