自举预测间隔


29

是否有任何引导技术可用于计算点预测的预测间隔,例如通过线性回归或其他回归方法(k近邻,回归树等)获得的点预测?

我以某种方式感到,有时建议的仅引导点预测的方法(例如,参见kNN回归的预测间隔)不是提供预测间隔,而是提供置信区间。

R中的一个例子

# STEP 1: GENERATE DATA

set.seed(34345)

n <- 100 
x <- runif(n)
y <- 1 + 0.2*x + rnorm(n)
data <- data.frame(x, y)


# STEP 2: COMPUTE CLASSIC 95%-PREDICTION INTERVAL
fit <- lm(y ~ x)
plot(fit) # not shown but looks fine with respect to all relevant aspects

# Classic prediction interval based on standard error of forecast
predict(fit, list(x = 0.1), interval = "p")
# -0.6588168 3.093755

# Classic confidence interval based on standard error of estimation
predict(fit, list(x = 0.1), interval = "c")
# 0.893388 1.54155


# STEP 3: NOW BY BOOTSTRAP
B <- 1000
pred <- numeric(B)
for (i in 1:B) {
  boot <- sample(n, n, replace = TRUE)
  fit.b <- lm(y ~ x, data = data[boot,])
  pred[i] <- predict(fit.b, list(x = 0.1))
}
quantile(pred, c(0.025, 0.975))
# 0.8699302 1.5399179

显然,95%的基本自举间隔与95%的置信区间匹配,而不是95%的预测区间。所以我的问题是:如何正确执行?


至少在普通最小二乘法的情况下,您不仅需要点预测,还需要更多。您也想使用估计的残差来构造预测间隔。
Kodiologist's 2016年


@duplo:感谢您指出这一点。经典预测间隔的正确长度直接取决于误差项的正态性假设,因此,如果过于乐观,则可以肯定的是,如果从那里衍生出自举版本,则也将是自举版本。我想知道是否有一般的引导方法在回归中起作用(不一定是OLS)。
Michael M

1
我认为\ textit {conformal inference}可能就是您想要的,它使您可以构建基于重采样的预测间隔,该间隔具有有效的有限样本覆盖率,并且不会过度覆盖。在arxiv.org/pdf/1604.04173.pdf上有一篇很好的文章,可以作为该主题的介绍阅读,也可以从github.com/ryantibs/conformal获得R包。
西蒙·博格·布兰特

Answers:


26

下面列出的方法是Davidson和Hinckley(1997)第6.3.3节“ Bootstrap方法及其应用”中描述的方法。由于Glen_b和他的评论在这里。鉴于与此主题有关“交叉验证”存在几个问题,我认为值得写。

线性回归模型为:

Yi=Xiβ+ϵi

我们有数据,i=1,2,,N,我们用它来估计β

β^OLS=(XX)1XY

Now, we want to predict what Y will be for a new data point, given that we know X for it. This is the prediction problem. Let's call the new X (which we know) XN+1 and the new Y (which we would like to predict), YN+1. The usual prediction (if we are assuming that the ϵi are iid and uncorrelated with X) is:

YN+1p=XN+1β^OLS

The forecast error made by this prediction is:

eN+1p=YN+1YN+1p

We can re-write this equation like:

YN+1=YN+1p+eN+1p

Now, YN+1p we have already calculated. So, if we want to bound YN+1 in an interval, say, 90% of the time, all we need to do is estimate consistently the 5th and 95th percentiles/quantiles of eN+1p, call them e5,e95, and the prediction interval will be [YN+1p+e5,YN+1p+e95].

How to estimate the quantiles/percentiles of eN+1p? Well, we can write:

eN+1p=YN+1YN+1p=XN+1β+ϵN+1XN+1β^OLS=XN+1(ββ^OLS)+ϵN+1

The strategy will be to sample (in a bootstrap kind of way) many times from eN+1p and then calculate percentiles in the usual way. So, maybe we will sample 10,000 times from eN+1p, and then estimate the 5th and 95th percentiles as the 500th and 9,500th smallest members of the sample.

To draw on XN+1(ββ^OLS), we can bootstrap errors (cases would be fine, too, but we are assuming iid errors anyway). So, on each bootstrap replication, you draw N times with replacement from the variance-adjusted residuals (see next para) to get ϵi, then make new Yi=Xiβ^OLS+ϵi, then run OLS on the new dataset, (Y,X) to get this replication's βr. At last, this replication's draw on XN+1(ββ^OLS) is XN+1(β^OLSβr)

Given we are assuming iid ϵ, the natural way to sample from the ϵN+1 part of the equation is to use the residuals we have from the regression, {e1,e2,,eN}. Residuals have different and generally too small variances, so we will want to sample from {s1s¯,s2s¯,,sNs¯}, the variance-corrected residuals, where si=ei/(1hi) and hi is the leverage of observation i.

And, finally, the algorithm for making a 90% prediction interval for YN+1, given that X is XN+1 is:

  1. Make the prediction YN+1p=XN+1β^OLS.
  2. Make the variance-adjusted residuals, {s1s¯,s2s¯,,sNs¯}, where si=ei/(1hi).
  3. For replications r=1,2,,R:
    • Draw N times on the adjusted residuals to make bootstrap residuals {ϵ1,ϵ2,,ϵN}
    • Generate bootstrap Y=Xβ^OLS+ϵ
    • Calculate bootstrap OLS estimator for this replication, βr=(XX)1XY
    • Obtain bootstrap residuals from this replication, er=YXβr
    • Calculate bootstrap variance-adjusted residuals from this replication, ss¯
    • Draw one of the bootstrap variance-adjusted residuals from this replication, ϵN+1,r
    • Calculate this replication's draw on eN+1p, erp=XN+1(β^OLSβr)+ϵN+1,r
  4. Find 5th and 95th percentiles of eN+1p, e5,e95
  5. 90% prediction interval for YN+1 is [YN+1p+e5,YN+1p+e95].

Here is R code:

# This script gives an example of the procedure to construct a prediction interval
# for a linear regression model using a bootstrap method.  The method is the one
# described in Section 6.3.3 of Davidson and Hinckley (1997),
# _Bootstrap Methods and Their Application_.


#rm(list=ls())
set.seed(12344321)
library(MASS)
library(Hmisc)

# Generate bivariate regression data
x <- runif(n=100,min=0,max=100)
y <- 1 + x + (rexp(n=100,rate=0.25)-4)

my.reg <- lm(y~x)
summary(my.reg)

# Predict y for x=78:
y.p <- coef(my.reg)["(Intercept)"] + coef(my.reg)["x"]*78
y.p

# Create adjusted residuals
leverage <- influence(my.reg)$hat
my.s.resid <- residuals(my.reg)/sqrt(1-leverage)
my.s.resid <- my.s.resid - mean(my.s.resid)


reg <- my.reg
s <- my.s.resid

the.replication <- function(reg,s,x_Np1=0){
  # Make bootstrap residuals
  ep.star <- sample(s,size=length(reg$residuals),replace=TRUE)

  # Make bootstrap Y
  y.star <- fitted(reg)+ep.star

  # Do bootstrap regression
  x <- model.frame(reg)[,2]
  bs.reg <- lm(y.star~x)

  # Create bootstrapped adjusted residuals
  bs.lev <- influence(bs.reg)$hat
  bs.s   <- residuals(bs.reg)/sqrt(1-bs.lev)
  bs.s   <- bs.s - mean(bs.s)

  # Calculate draw on prediction error
  xb.xb <- coef(my.reg)["(Intercept)"] - coef(bs.reg)["(Intercept)"] 
  xb.xb <- xb.xb + (coef(my.reg)["x"] - coef(bs.reg)["x"])*x_Np1
  return(unname(xb.xb + sample(bs.s,size=1)))
}

# Do bootstrap with 10,000 replications
ep.draws <- replicate(n=10000,the.replication(reg=my.reg,s=my.s.resid,x_Np1=78))

# Create prediction interval
y.p+quantile(ep.draws,probs=c(0.05,0.95))

# prediction interval using normal assumption
predict(my.reg,newdata=data.frame(x=78),interval="prediction",level=0.90)


# Quick and dirty Monte Carlo to see which prediction interval is better
# That is, what are the 5th and 95th percentiles of Y_{N+1}
# 
# To do it properly, I guess we would want to do the whole procedure above
# 10,000 times and then see what percentage of the time each prediction 
# interval covered Y_{N+1}

y.np1 <- 1 + 78 + (rexp(n=10000,rate=0.25)-4)
quantile(y.np1,probs=c(0.05,0.95))

Thank you for the useful, detailed explanations. Following these lines, I think that a general technique outside OLS (tree based techniques, nearest neighbour etc.) wont be easily available, right?
Michael M

1
There is this one for random forests: stats.stackexchange.com/questions/49750/… which sounds similar.
Bill

As far as I can tell, if you abstract Xβ to f(X,θ), this technique works for any model.
shadowtalker

How do you generalise the "variance adjusted residuals" - the OLS approach relies on the leverage - is there a leverage calculation for an arbitrary f(X) estimator?
David Waterworth
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.