线性回归预测区间

如果我的数据点的最佳线性近似（使用最小二乘）是线，如何计算近似误差？如果我计算观察值和预测值之间的差异的标准偏差，我以后可以说真实（但未观察到）的值属于区间假设正态分布（）的概率约为68％？ $y=mx+b$ $e_i=real(x_i)-(mx_i+b)$ $y_r=real(x_0)$ $[y_p-\sigma, y_p+\sigma]$ $y_p=mx_0+b$

澄清：

我对函数进行了观察，评估结果为点。我将这些观察值拟合为。对于我没有观察到的，我想知道有多大。使用上述方法，中的是正确的。〜68％？ $f(x)$ $x_i$ $l(x)=mx+b$ $x_0$ $f(x_0)-l(x_0)$ $f(x_0) \in [l(x_0)-\sigma, l(x_0)+\sigma]$

— 小轮车
source

我认为您正在询问预测间隔。但是请注意，您使用“

”，而不是‘

’这是一个错字我们？不预测

秒。

x_{i}

$x_i$

y_{i}

$y_i$

x

$x$

— 呱-恢复莫妮卡

@gung：例如，我用

表示时间，而

表示当时某个变量的值，所以

表示我在时间

观察了

。我想知道拟合函数的预测距离y的实际值有多远。那有意义吗？功能

返回的“正确”值

在

，和我的数据点包括

x

$x$

y

$y$

y = f (x)

$y=f(x)$

y

$y$

x

$x$

r e a l (x_{i})

$real(x_i)$

y

$y$

x_{i}

$x_i$

。

(x_{i}, r e a l (x_{i}))

${(x_i, real(x_i))}$

— bmx

这似乎完全合理。我关注的部分是“

”，通常我们将reg模型中的误差/残差视为“

”。残差的SD 确实在计算预测间隔中起作用。这是“

e_{i} = r e a l (x_{i}) - (m x_{i} + b)

$e_i=real(x_i)-(mx_i+b)$

e_{i} = y_{i} - (m x_{i} + b)

$e_i=y_i-(mx_i+b)$

x_{i}

$x_i$ “那怪我;我想知道，如果它是一个错字，或者你问一些我不认识。

— 呱-恢复莫妮卡

我想我明白了；我错过了你的编辑。这表明，该系统是完全确定性＆如果你要访问真正的底层函数，你总是可以预测

完美的W / O错误。那不是我们通常想到的reg模型的方式。

y_{i}

$y_i$

— gung-恢复莫妮卡

bmx，在我看来，您对问题有清晰的认识，并对某些问题有很好的认识。您可能有兴趣回顾三个密切相关的主题。stats.stackexchange.com/questions/17773以非技术性术语描述了预测间隔；stats.stackexchange.com/questions/26702提供了更多数学描述；在stats.stackexchange.com/questions/9131中，Rob Hyndman提供了您要寻找的公式。如果这些不能完全回答您的问题，至少它们可以为您提供标准的注释法和词汇表，以澄清问题。

— ub

@whuber为您指出了三个很好的答案，但是也许我仍然可以写一些有价值的东西。据我了解，您的明确问题是：

鉴于我的拟合模型 $\hat y_i=\hat mx_i + \hat b$ （通知我加入“帽子”） ，并假设我的残差是正态分布的，，我可以预测，一个尚未未观察到的响应，，具有已知预测值，，将落入的区间内 $\mathcal N(0, \hat\sigma^2_e)$ $y_{new}$ $x_{new}$ $(\hat y -\sigma_e, \hat y +\sigma_e)$ , with probability 68%?

凭直觉，答案似乎应该是“是”，但真正的答案也许是。当参数（即 $m, b,$ & $\sigma$ ) are known and without error. Since you estimated these parameters, we need to take their uncertainty into account.

首先考虑一下残差的标准偏差。由于这是根据您的数据估算的，因此估算中可能会有一些错误。结果，您应该用来形成预测间隔的分布应该是，而不是正态分布。但是，由于迅速收敛到正常值，因此在实践中不太可能成为问题。 $t_\text{df error}$ $t$

因此，我们可以只使用，而不是，去了解我们的快乐的方式？很不幸的是，不行。更大的问题是，有你在那个位置响应的条件均值估计的不确定性，由于不确定性的估算＆。从而， $\hat y_\text{new}\pm t_{(1-\alpha/2,\ \text{df error})}s$ $\hat y_\text{new}\pm z_{(1-\alpha/2)}s$ $\hat m$ $\hat b$ 您预测的标准偏差需要结合不仅仅是 $s_\text{error}$ 。因为方差添加，预测的估计方差将是：注意，“ ”被下标来表示为新的特定值观察到，“ ”相应地被下标。也就是说，您的预测间隔取决于新观测值沿

s_{predictions(new)}^{2} = s_{error}^{2} + Var (\hat{m} x_{new} + \hat{b})

$s^2_\text{predictions(new)}=s^2_\text{error}+\text{Var}(\hat mx_\text{new}+\hat b)$

x

$x$

s^{2}

$s^2$

x

$x$

s_{predictions(new)} = \sqrt{s_{error}^{2} (1 + \frac{1}{N} + \frac{(x_{new} - \bar{x})^{2}}{\sum (x_{i} - \bar{x})^{2}})}

$s_\text{predictions(new)}=\sqrt{s^2_\text{error}\left(1+\frac{1}{N}+\frac{(x_\text{new}-\bar x)^2}{\sum(x_i-\bar x)^2}\right)}$ As an interesting side note, we can infer a few facts about prediction intervals from this equation. First, prediction intervals will be narrower the more data we had when we built the prediction model (this is because there's less uncertainty in

\hat{m}

$\hat m$ &

\hat{b}

$\hat b$ ). Second, predictions will be most precise if they are made at the mean of the

x

$x$ values you used to develop your model, as the numerator for the third term will be

0

$0$ . The reason is that under normal circumstances, there is no uncertainty about the estimated slope at the mean of

x

$x$ , only some uncertainty about the true vertical position of the regression line. Thus, some lessons to be learned for building prediction models are: that more data is helpful, not with finding 'significance', but with improving the precision of future predictions; and that you should center your data collection efforts on the interval where you will need to be making predictions in the future (to minimize that numerator), but spread the observations as widely from that center as you can (to maximize that denominator).

Having calculated the correct value in this manner, we can then use it with the appropriate $t$ distribution as noted above.

— gung - Reinstate Monica
source